Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Radiation hardened by design asynchronous framework
(USC Thesis Other)
Radiation hardened by design asynchronous framework
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
RADIATION HARDENED BY DESIGN ASYNCHRONOUS FRAMEWORK
by
Moises Fernando Herrera Buitrago
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of
the Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
August 2022
Copyright 2022 Moises Fernando Herrera Buitrago
Epigraph
”...If thou are not versed
in the business of adventures
get thee aside and pray
whilst I engage these giants in combat...”
”The Ingenious Gentleman Don Quixote of La Mancha”
by Miguel de Cervantes Saavedra, 1605
ii
Acknowledgements
This dissertation was possible with the guidance and support from Professor Peter Beerel,
whose support ranging from encouraging research paths to fruitful discussions that leaded
towards this effort. I am grateful to past and present members of my research team who
havehelped, challengedandcollaboratedwithmewhosehelpandguidanceIwouldnothave
been able to explore the research directions that have led to the material compiled in this
document. Special mention to Niobium Microsystems hardware engineering team for their
support on guiding, learning, and exploring the intricacies of Chisel’s synthesis flow. Also
worth of mention is the support provided by Diane Demetras during the whole period of
time providing insight on academics and procedures.
Heartfelt acknowledgements to my family, specially to my wife Angela, for their constant
love and support. This achievement would not have been possible without them.
iii
Table of Contents
Epigraph ii
Acknowledgements iii
List of Figures viii
List of Tables xi
List of Algorithms xii
Abstract xiii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 7
2.1 Asynchronous Templates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Quasi-Delay-Insensitive (QDI) . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Bundled Data (BD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2.1 Burst mode (BM) . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2.2 Click template . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 BD Timing Resilient Templates . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Blade Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1.1 Blade Handshaking Protocol. . . . . . . . . . . . . . . . . . 18
2.2.1.2 Error Detection Logic (EDL) . . . . . . . . . . . . . . . . . 20
2.2.2 Sharp template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Radiation Effects in CMOS Circuits . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Single Event Transient (SET) and Single Event Upset (SEU) . . . . . 23
2.3.2 Linear energy Transfer (LET) . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Radiation Hardening By Design Techniques . . . . . . . . . . . . . . . . . . 24
2.4.1 Temporal Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iv
2.4.1.1 Glitch filters . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.1.2 Dual Interlocked Cell (DICE) latch . . . . . . . . . . . . . . 26
2.4.2 Spatial redundancy or modular redundancy . . . . . . . . . . . . . . 27
2.4.2.1 Triple Modular Redundancy (TMR) . . . . . . . . . . . . . 27
2.4.2.2 Guarded Dual Modular Redundancy (GDMR) . . . . . . . . 27
2.4.3 Gate Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.3.1 Guard Gates (GG) . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Hardware Components selected for RHBD Mutex . . . . . . . . . . . . . . . 29
2.5.1 Set-Reset (SR) latch . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5.2 Metastability filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5.3 CMOS Schmitt Trigger . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 SERAD Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6.0.1 Soft Error Resilient Timing . . . . . . . . . . . . . . . . . . 33
2.7 Hardware Description Language (HDL) . . . . . . . . . . . . . . . . . . . . . 36
2.7.1 Verilog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.8 Hardware Construct Language (HCL) . . . . . . . . . . . . . . . . . . . . . . 36
2.8.1 Scala Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8.2 FIRRTL: Flexible Internal Representation for RTL . . . . . . . . . . 38
2.8.2.1 Creating a FIRRTL transform . . . . . . . . . . . . . . . . . 39
2.9 Network on Chip (NoC) Fundamentals . . . . . . . . . . . . . . . . . . . . . 39
2.9.1 NoC components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.9.1.1 Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.9.1.2 Network Interface (NI) . . . . . . . . . . . . . . . . . . . . . 42
2.9.1.3 Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.9.2 NoC Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.9.3 Omega-Network Topology . . . . . . . . . . . . . . . . . . . . . . . . 44
2.9.3.1 Network Interconnection . . . . . . . . . . . . . . . . . . . . 44
2.9.3.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.10 Fully Homomorphic Encryption (FHE) Fundamentals . . . . . . . . . . . . . 46
3 Novel Timing Resilient Asynchronous Protocols and Templates 48
3.1 Blade Template Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.1 Blade-Open Template . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.1.2 Burst Mode Implementation . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.3 Blade compared to Blade-Open . . . . . . . . . . . . . . . . . . . . . 54
3.2 Novel Blade Open-Close (Blade-OC) Template . . . . . . . . . . . . . . . . . 55
3.2.1 Blade-OC Handshaking Protocol . . . . . . . . . . . . . . . . . . . . 59
3.2.2 Time Resiliency Window Comparison . . . . . . . . . . . . . . . . . . 61
3.2.3 Blade-OC Timing Resiliency Window . . . . . . . . . . . . . . . . . . 68
3.2.4 Managing Hold Timing Constraints . . . . . . . . . . . . . . . . . . . 69
3.2.5 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.3 Timing Resilient templates as support for RHBD templates. . . . . . . . . . 77
v
4 Novel Radiation Hardened By Design (RHBD) Templates and Cells 79
4.1 SERAD-Click Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.1 SERAD-Click-EDL Handshaking . . . . . . . . . . . . . . . . . . . . 80
4.2 RHBD Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.1 Hardening of the Click Controller . . . . . . . . . . . . . . . . . . . . 86
4.2.1.1 Guard Gate applied in SERAD-Click templates . . . . . . . 87
4.2.1.2 RHBD Click Approach . . . . . . . . . . . . . . . . . . . . . 90
4.2.2 SERAD-Click Controller Design . . . . . . . . . . . . . . . . . . . . . 92
4.2.2.1 SERAD-Click-EDL Controller . . . . . . . . . . . . . . . . . 93
4.2.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2.4 SERAD-Click-EDL Controller Evaluation . . . . . . . . . . . . . . . 96
4.2.4.1 SERAD-Click-EDL Area Evaluation . . . . . . . . . . . . . 96
4.2.4.2 Fault Verification . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Radiation Hardened By Design Mutex . . . . . . . . . . . . . . . . . . . . . 98
4.3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.3.1.1 Baseline mutual exclusion element . . . . . . . . . . . . . . 100
4.3.1.2 State-of-the-Art Mutex design . . . . . . . . . . . . . . . . . 101
4.3.2 Novel RHBD Mutex Cell . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3.2.1 3-input NAND gate implementing Schmitt Trigger circuitry 103
4.3.2.2 Modified SR latch with feedback path control . . . . . . . . 104
4.3.2.3 Guard Gate implementation . . . . . . . . . . . . . . . . . . 105
4.3.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . 105
5 RHBD Asynchronous Flows 112
5.1 Original Bundled Data (BD) Flow . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1.1 BD modules and channels . . . . . . . . . . . . . . . . . . . . . . . . 113
5.1.2 BD Chisel Design file . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1.3 Chisel Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1.4 Intermediate Representation (IR) Circuit file . . . . . . . . . . . . . . 115
5.1.5 Verilog Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.1.6 RTL Verilog file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2 Novel RHBD flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.2.1 RHBD flow setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.2.2 Timing Resilient (TR) flow. . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.3 Triple Modular Redundant (TMR) flow . . . . . . . . . . . . . . . . . 121
6 Application - Case Study 123
6.1 Fully Homomorphic Encryption (FHE) Accelerator . . . . . . . . . . . . . . 123
6.1.1 FHE accelerator components . . . . . . . . . . . . . . . . . . . . . . . 125
6.1.2 AXI modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.1.2.1 Instruction Queue . . . . . . . . . . . . . . . . . . . . . . . 125
6.1.3 MAC PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
vi
6.1.3.1 Traffic Control Unit (TCU) . . . . . . . . . . . . . . . . . . 126
6.1.3.2 Number Theoretic Transform (NTT) . . . . . . . . . . . . . 126
6.1.3.3 Cipher-text Buffer (CTB) . . . . . . . . . . . . . . . . . . . 126
6.1.3.4 Permutation NoC (PNoC) . . . . . . . . . . . . . . . . . . . 127
6.1.4 RHBD Permutation NoC Application . . . . . . . . . . . . . . . . . . 127
6.1.4.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.1.4.2 Sub Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.1.5 Application Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7 Conclusions 135
References 137
vii
List of Figures
2.1 Asynchronous Handshaking protocols: a. Four phase, b. Two phase . . . . . 9
2.2 Bundled data template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Click template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Click block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Bubble Razor Flip-Flop, taken from [32] . . . . . . . . . . . . . . . . . . . . 15
2.6 Blade template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 Blade timing for a 4-stage pipeline . . . . . . . . . . . . . . . . . . . . . . . 17
2.8 Blade Handshake signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.9 EDL Block level diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.10 Sharp template, taken from [36] . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.11 Radiation particle hit on a sensitive area and the radiation particle strike model 23
2.12 Traditional glitch Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.13 Glitch Filter implementing a C-element . . . . . . . . . . . . . . . . . . . . 26
2.14 SR latch implementing cross-coupled 2-input NAND implementation . . . . 29
2.15 Inverting Schmitt Trigger and the baseline mutex . . . . . . . . . . . . . . . 31
2.16 SERAD template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.17 SERAD timing diagram, taken from [7] . . . . . . . . . . . . . . . . . . . . . 34
2.18 SERAD Burst Mode specification . . . . . . . . . . . . . . . . . . . . . . . . 35
2.19 Network on Chip Components . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.20 NoC Direct router-core topologies . . . . . . . . . . . . . . . . . . . . . . . 43
2.21 NoC Indirect router-core topologies . . . . . . . . . . . . . . . . . . . . . . . 43
2.22 Omega-Network topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.23 End-to-End encrypted computation . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 Blade-Open template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2 Blade-Open Handshaking and controller’s internal signals . . . . . . . . . . . 51
3.3 Burst Mode specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.4 Token Burst Mode specification . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Blade-Open timing diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.6 Blade-OC stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7 Blade-OC controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.8 Blade-OC controller burst mode diagrams . . . . . . . . . . . . . . . . . . . 58
viii
3.9 Blade-OC timing diagram showing latch pulses on the Master (M) and Slave
(S) stages for ideal TRW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.10 Blade timing diagram showing latch pulses on the Master (M) and Slave (S)
for ideal TRW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.11 Sharp timing diagram showing latch pulses on the Master (M) and Slave (S)
stages for ideal TRW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.12 Blade-OC consecutive latch pulse overlap. . . . . . . . . . . . . . . . . . . . 70
3.13 Effective cycle time vs. TRW for a one-token 2-stage ring implementing log-
normal delays a: µ =0.219ns, σ =0.069ns; b: µ =0.366ns, σ =0.114ns . . . . . . 72
3.14 Effective cycle time vs. TRW for a one-token 3-stage ring implementing log-
normal delays a: µ =0.219ns, σ =0.069ns; b: µ =0.366ns, σ =0.114ns . . . . . . 72
3.15 Effective cycle time vs. TRW for a: one-token 3-stage ring. b: one-token
2-stage ring assuming best performance case . . . . . . . . . . . . . . . . . . 73
3.16 Effective cycle time vs. TRW for a: one-token 3-stage ring. b: one-token
2-stage ring assuming worst-case delay . . . . . . . . . . . . . . . . . . . . . 74
4.1 SERAD-Click-EDL template . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2 SERAD-Click-TMR architecture . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 SERAD-Click-EDL architecture . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.4 SERAD-Click-EDL BM specification . . . . . . . . . . . . . . . . . . . . . . 86
4.5 SERAD-Click-EDL controller timing diagram . . . . . . . . . . . . . . . . . 87
4.6 Guard Gate [76] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.7 M¨ uller C-element [77] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.8 Spice simulations showing GG output degradation . . . . . . . . . . . . . . . 89
4.9 Click controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.10 Generalized RH-Click controller diagram . . . . . . . . . . . . . . . . . . . . 91
4.11 SERAD-Click-EDL controller implementation . . . . . . . . . . . . . . . . . 93
4.12 Arrangement of SR latches . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.13 Proposed RHBD mutex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.14 Special 3-input NAND implementing Schmitt Trigger circuitry . . . . . . . . 104
4.15 IsetcurrentmodelforLETvalues2,4and7MeV-cm
2
/mg. andVoltageupset
at the intermediate node. Inset: Design under test. . . . . . . . . . . . . . . 107
4.16 Waveforms for Iset 7 MeV-cm
2
/mg and SET events at outputs and at the
subsequent logic stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.17 Waveforms for Iset 7 MeV-cm
2
/mg at proposed mutex’s node X
0
after arbi-
tration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.18 Proposed mutex waveforms for Iset of duration 100ps at node X
0
after arbi-
tration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.1 Bundled data flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.2 Timing resilient flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3 Triple Modular Redundant flow . . . . . . . . . . . . . . . . . . . . . . . . . 121
ix
6.1 FHE accelerator block diagram, adapted from [102]. . . . . . . . . . . . . . . 124
6.2 Permutation block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.3 TMR and TR layout images obtained for Permutation NoC size = 4 . . . . . 134
x
List of Tables
3.1 Simulation parameters for the two-stage implementation . . . . . . . . . . . 74
3.2 Minimum average cycle time obtained for the two-stage implementation . . 75
3.3 Minimum average cycle time obtained for the three-stage implementation . 75
3.4 Best case cycle time for two-stage implementation . . . . . . . . . . . . . . . 76
3.5 Average cycle time for the three-stage implementation assuming worst-case
delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6 Comparison of Blade and Blade-OC on Plasma . . . . . . . . . . . . . . . . 77
4.1 Comparison of Click controllers in terms of area . . . . . . . . . . . . . . . . 97
4.2 Single 7 MeV-cm
2
/mg LET event in nodes . . . . . . . . . . . . . . . . . . . 108
4.3 Comparison to baseline and state-of-the-art circuits . . . . . . . . . . . . . . 111
6.1 Table with parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.2 Permutation NoC I/O ports . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 Router core I/O channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
xi
List of Algorithms
1 Chisel module renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
2 EDL datapath insertion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
3 TMR datapath Annotation insertion . . . . . . . . . . . . . . . . . . . . . . . 122
xii
Abstract
The complex interrelationships between technology, design, and fabrication require that
radiation-hardness must be considered when circuits will be used in low-orbit or deep-space
environments. Unlike traditional synchronous circuits, asynchronous circuits do not use a
global clock signal to control the update of the state registers and have potential benefits
including enabling soft-error tolerance with low overhead. The traditional method to design
digital circuits, including asynchronous circuits, utilizes a Hardware Description Language
(HDL) such as Verilog, that includes variable assignments, explicit notations to express
concurrency and control flow structures. However, HDLs have been compared to assembly
languages because of their low level of abstraction, explicitly describing the structure of the
circuit in detail. In contrast, Chisel is a modern hardware construct language (HCL) that
provides a more abstract description of digital circuits and is part of a synthesis framework
that automatically produces lower-level Register-transfer level (RTL) circuit descriptions.
This dissertation proposes a synthesis framework aimed at generating novel Radiation
Hardened by Design (RHBD) asynchronous circuits from Chisel specifications using novel
timing and soft-error tolerant control templates and a RHBD mutual exclusion element to
xiii
support arbitration. Specifying these designs in Chisel enables scalability and reusability by
reducingthenon-recurringengineeringcostsofdesign,test,andverification. Wedemonstrate
theutilityofthisflowonadesigncasestudyofanasynchronousnetworkonchip(ANoC)that
is part of an accelerator for fully homomorphic encryption. Two different RHBD templates,
the traditional Triple Modular Redundant (TMR) and the timing resilient template were
used to explore different ANoC designs. The resulting layouts show that timing resilient
design is 2.46x smaller than the more traditional TMR counterpart. Our flow highlights
the potential benefits of asynchronous circuits, not just for soft-error tolerance, but also for
modularity, reusability, and composability.
xiv
Chapter 1
Introduction
The traditional method to design a digital circuit is to describe the structure and behav-
ior using a Hardware Description Language (HDL) such as Verilog. An HDL is similar to
other software programming languages by including variable assignments, explicit notations
to express concurrency and control flow structures. HDLs have been compared to assembly
languages because of their low level of abstraction, describing in detail circuit components
(i.e., describing gate netlists). The low level of abstraction is the caveat that hardware
construct languages (HCL) try to solve, bringing in the power of modern underlying pro-
gramming language.
1.1 Motivation
Chisel [1] is a hardware construct language (HCL), that provides a high level description of
digital circuits to the production of lower-level Register-transfer level (RTL) circuit abstrac-
tions, generallydescribedinVerilog. AnRTLdescriptionmodelsadigitalcircuitinterms of
the flow and logic operations of signals (data and control signals) between consecutive hard-
ware registers. The use of a HDL enables a precise, formal description of a digital circuit
andadditionally, itenablesthesynthesisoftheHDLtextdescriptionintoagate-levelnetlist
1
using commercial synthesis tools. RTL customizations written in C or Python, aimed to
reuse some generic RTL into specific ASIC or FPGA implementations, underlies complexity
thatdesignersaimtosolvewritingacollectionofcustomscriptstoperformad-hocprogram-
matic RTL modifications, but these scripts are neither reusable, reliable, nor composable.
Commercial CAD tools do not completely solve the reusability problem either. Instead,
CAD tools primarily focus on the separate problem of RTL generation, synthesis and place-
and-route flows. While existing CAD tools can perform all of these tasks, each step requires
asignificantamountofcustomizationtryingtointegrateorthogonalconcerns,likethelogical
design features (bus bandwidths, signal fan-outs/fan-ins, scalable-design constraints, etc.).
The main Chisel advantage is to make explicit the inter-relationships of system compo-
nentsatahighlevel[2]. Thisapproachallowstomanipulatecomponentsthatcanbeselected,
assembled, and transformed in various combinations to satisfy specific custom applications.
Additionally, the use of a HCL allows to parameterize and create fine grained RTL, ideal for
reuse, scalability, and productivity. For example, a Network on Chip (NoC) can be explored
ondifferenttemplatesandconstraintswheremultiplespecificationsneededtobeconsidered.
Asynchronous circuits are digital circuits that do not implement a global clock signal
that controls the register and state update [3]. There are multiple benefits (low power, high
performance) when implemented on dataflow-dependent circuits, i.e. circuits that are idling
whilst there is no data. This behavior is characteristic for NoC applications where most of
it’s operational time is to momentarily transfer burst of data.
2
The complex interrelationships between technology, design, and fabrication require that
radiation-hardness objectives be considered when circuits are aimed to be used in radiation-
harshenvironments,likeloworbitordeep-space[4]. Radiation-inducedeventsonsub-micron
technologiescantemporarilyorpermanentlycompromisethenormaloperationofthedigital
circuit. As these charged particles pass through the chip, they can produce electron-hole
pairs which may be collected by the junctions in the transistors, that can make flip a latch
or generate a transient pulse (glitch) at the output of a gate.
Generally, charged particles induce current in susceptible nodes that can create a voltage
pulse that is about 100-200 picoseconds (ps) and this transient can become a circuit signal
that can disrupt normal operation. Making matters worse for operation in space, single
particle disruptions often result from high-energy cosmic rays traveling through the silicon,
justundertheICsurface,withsufficientrangetoinducecurrentsinmultipleelectricalnodes
[5].
Asynchronousdesigns, ingeneral, canadapttotimingvariationscausedbyprocess, volt-
age, and temperature (PVT) extremes better than their synchronous counterparts [6]. This
flexibility has been recently extended to mitigate soft-errors via the proposed asynchronous
Radiation Hardened by Design (RHBD), the Soft Error Resilient Asynchronous template
(SERAD) [7].
A traditional means of radiation hardening involves altering the fabrication process and
is called radiation hardening by process (RHBP). RHBP has the advantage of being very
robust and reliable, but it has high manufacturing costs.
3
In contrast, radiation hardening by design (RHBD) uses standard CMOS technologies.
The hardening is implemented at system architecture, circuit design (topology and sizing
of components), physical layout, or a combination of techniques and does not rely on the
fabrication process itself. For sub-micron technologies, the primary concern is circuit level
optimizations leveraged by layout and other possible hardening techniques [8].
A primary motivation for using dataflow Networks as a Hardware representation is the
fact that the throughput, latency and slack matching (FIFO channels balance) can be per-
formed. ThisisextremelydifficultinaFSMrepresentationduetoa state-explosion problem,
as described in [9].
The Radiation Hardened by Design (RHBD) Asynchronous framework explained in this
dissertation develops an HDL framework that allow the reuse and redesign of existing non-
hardened asynchronous designs. This is demonstrated via implementing an application test
case.
1.2 Dissertation Contributions
The main contribution of this dissertation is the development of a RHBD framework that
implements asynchronous circuits for radiation harsh environments, utilizing Chisel specifi-
cations that enables reusability and scalability. This is in contrast to state-of-the-art HDL
Asynchronous flows like the Asynchronous Circuit Toolkit (ACT) [10] and Loihi’s synthesis
flow [11]. Both are custom flows that utilizes the HDL communicating sequential processes
4
(CSP)languageandin-housedevelopedcompilerstoobtainRTLspecifications. Morespecif-
ically our contributions include:
• Developmentoftimingresilientasynchronoustemplates,startingwiththeoptimization
of Original Blade and including novel Blade-OC template [12].
• Development of RHBD templates and cells, specifically SERAD-Click-TR [13],
SERAD-Click-TMR and RHBD Mutex cell [14].
• RHBD framework, implementing the HCL paradigm using two different RHBD tech-
niques, Timing resilient (TR) and Triple Modular Redundant (TMR).
• Case study implementing an asynchronous NoC using the RHBD framework.
1.3 Dissertation Organization
This dissertation is organized as follows:
Chapter 2, Background, provides a general background that support this disserta-
tion. Specifically, traditional asynchronous templates, describing the common paradigms
and foundation; radiation effects in modern CMOS circuits; radiation Hardening By Design
Techniques, usedtodeveloptheasynchronoustemplateandkeyelementsusedinthisframe-
work; hardware design languages and the state-of-the-art approach on construct languages;
fundamental concepts on Network on Chip and last but not least concepts on full homomor-
phic encryption (FHE). These last two are important for the test case, presented later in
Chapter 6.
5
Chapter 3, timing resilient Asynchronous Protocols and Templates, provides a brief
description of the specific asynchronous templates developed. More specifically, blade tem-
plate optimizations and novel blade template Open-Close.
Chapter 4, Radiation Hardened By Design Templates and Cells, describes the novel
SERAD click controller and the important RHBD Mutex cell.
Chapter 5, RHBD Asynchronous Design Flow, explains in detail the design flow that
allows reusability, design flexibility implementing the HCL paradigm.
Chapter 6, Application - Case Study, describes the case study implementing novel
templates and testing the novel design flow. This include the FHE accelerator in detail and
the NoC module that serves as case study.
Chapter 7, Conclusions, summarizes the major contributions of the research and dis-
cusses potential future work.
6
Chapter 2
Background
This chapter provides a general background that support this dissertation.
2.1 Asynchronous Templates
The two most common design templates are explained and some of the bundled data tem-
plates are described in detail.
2.1.1 Quasi-Delay-Insensitive (QDI)
Quasi Delay Insensitive is the most resilient asynchronous template to process variation
and temperature/voltage fluctuations because of the data-completion paradigm [3], where a
constant-weightdataencodingisused,i.e.,one-of-Nencoding. Thiscommunicationencoding
generally require twice as many wires for the same data width. In particular, quasi-delay-
insensitive (QDI) templates use completion signal logic to signal when data is valid, which
makes them robust to delay variations at the cost of increased area and high switching
activity.
In a QDI circuit, all gates and wires can have arbitrary delays, except for some wire
forks labelled as isochronic forks [3] to ensure correct functionality. Isochronic forks have
7
the additional constraint that the delay to the different ends of the fork must be the same.
An interesting feature of QDI circuits is that primary inputs transitions should not have an
arrival order and the arrival of the signals are not bounded in time. A common gate used in
QDI implementations is the Muller-C element. This state hold memory element implements
the function described by C
′
=AB+AC +BC.
2.1.2 Bundled Data (BD)
Bundled data templates implement the same synchronous data path and a bundled control
path (e.g., micropipelines [15]), where the handshaking signals control the flow of the data
path,replacingthetraditionalclocktree. Thecontrolpathusesdelaylinesmatchedtosingle-
rail combinational logic, providing a low area, low switching activity asynchronous solution
(e.g., [16]). The delay lines must be implemented with sufficiently large margins in the
presence of on-chip variations, reducing the advantages of this approach. Some researchers
have proposed solutions to mitigate these margins, such as duplicating the bundled-data
delay lines [17], constraining the design to regular structures such as PLAs [18], using soft
latches [19], or implementing programmable delay lines [20]. This design style uses standard
staticlogicinthedatapath,similartotraditionalsynchronousdesign. Thishastheadvantage
that standard libraries and synthesis tools can automatically generate the datapath logic so
the designer just need to design the local clock tree, i.e., BD controllers that accomplish the
timing requirements on the datapath.
8
Templates developed using bundled data channels that implements timing resiliency are
described in Section 2.2 and soft-error resiliency for radiation environments [7] are described
inSection2.6. Thedatapathlogicismatchedwithadelaylinetriggeredwhentheinputdata
isvalidbythecontrolpart. NotethatthecontrolcircuitrycanfollowaQDIimplementation
style and requires timing matching with datapth delays.
The bundled data channels consist of request and acknowledge lines and a data bus. The
sequentialelementsinabundleddatapathcanbeflipflops(FF)orlatches. Thetokenbuffer,
right after reset, the controller initiates the transfer of valid data between pipeline stages
by asserting the request signal. Once the data has been consumed by the receiving stage,
controlling the datapath, the acknowledge wire is asserted. Note that a token buffer after
the initial send of a token, behaves like a normal buffer.
Figure 2.1: Asynchronous Handshaking protocols: a. Four phase, b. Two phase
The request (Req) and acknowledge (Ack) signals are part of the handshaking protocol.
In four-phase, the rising edge of request followed by an acknowledge rising indicates a valid
transaction, these signal transitions are illustrated on Figure 2.1. The rising edge of the Req
signal indicates to the receiver that on the datapath there is new valid data. This event is
referred as a token. The rising edge of the Ack signal indicates to the sender that the data
9
has been consumed. After these transactions, both Req and Ack return to logic 0, in order
to be ready to start a new handshaking for the next data propagation.
For the two phase handshaking protocol, there is no difference in meaning on the rising
andfallingtransitionsoftherequest(Req)andacknowledge(Ack)handshakinglines. Rising
orfallingtransitionsonReqindicatesthepresenceofanewtokenandinAcktheacknowledge
of the Req signal. No matter what handshaking protocol is implemented, the bundled data
design implements matched delay lines to the delay of the combinational stage.
Figure 2.2: Bundled data template
The implementation of bundled-data (BD) designs have a significant advantage over
QDI designs because the BD designs show similar switching activity as the synchronous
counterparts. This is due to the fact that the combinational logic datapath is unchanged.
The total area is similar because the increased area of the control circuits and delay lines is
roughlycomparabletotherequiredbytheclocktree. Thetraditionalbundleddatatemplate
is illustrated in Figure 2.2. In general, a key advantage of asynchronous implementations is
that when there is no data activity required in the system, no token is sent into the pipeline,
10
remaining in idle state, only consuming static power. This offers some form of optimal clock
gating as explained in [21, 22].
2.1.2.1 Burst mode (BM)
The BD controllers can be implemented as a Burst-Mode state machine [3] and synthesized
usingthetool3D[23]. Inburst-modecircuits,multipleinputscanchangesimultaneouslyina
burstfashion,afterwhichanumberofoutputsignalscanalsochange. Inthismethodcircuits
are specified using an standard state machine diagram, in which each arc is labeled by a set
of input transitions (an input burst that cannot be empty) and a set of output transitions
(note than an output burst can be empty). States are designated as nodes with exclusive
labels. In each state, the inputs specified on one of the input bursts leaving that state can
occur with no required order of arrivals. The machine’s state reacts to the inputs only when
all the expected inputs in the burst have occurred. Subsequently, the given output burst
occurs and then the state machine enters the next state. The new set of inputs are allowed
to occur after the system has reacted and settled completely from the previous input burst.
In this way, the circuit requires a generalized notion of the fundamental-mode assumption
to hold [3].
2.1.2.2 Click template
Uses two-phase handshaking and implements flip-flops(FF) as state holding elements [24].
TheClicktemplateconsistsofanasynchronousbundled-datadesignstylethatusesa2-phase
11
handshakeprotocoltocommunicatebetweenClickcontrollersthroughapushchannel,where
all input handshakes are passive, and all output handshakes are active. Figure 2.3 shows
a simple example of a Click controller with a single input handshake and a single output
handshake. Foritscorrectoperation, twoconstraintsmustbesatisfied: (i)clockpulsewidth
must satisfy minimum pulse width of the target technology library; (ii) input handshake
signals must remain stable during the active phase of the clock. Specifically, on the first
constraint, this can be checked by a static timing analysis tool, and if necessary additional
buffers or cells with different drive strength will be automatically placed to increase the
internal propagation delays. The second is readily achieved by the implementation of the
asynchronous 2-phase handshaking protocol, as described in [3].
FF
rst
L.ack
R.ack
L.r eq
clk
R.r eq
t
ctrl
tcomb
Figure 2.3: Click template
Regardless of the Click controller complexity, it can be generalized such as illustrated
in Figure 2.4. The input and output signals can be separated into handshake signals that
12
interact with neighboring input and output controllers as well as control signals that inter-
act with the datapath. The design features a combinational logic block that interacts with
edge-triggeredsequentialelements,hererepresentedbyasingle FF.Additionally,thegeneral
design includes two delay lines. The internal delay line t
ctrl
represents the internal propaga-
tion delays associated with the combinational logic, while the external t
comb
is set to match
the delay of the critical path in the datapath. Regarding Figure 2.4, t
ctrl
is equivalent to the
propagation delay of the AND-OR gates plus the FF setup time, and the t
comb
is equal to
zero.
rst
FF
Controller
combinational logic
t ctrl
L.ack
L.r eq
R.ack
R.r eq
clk
t comb
Figure 2.4: Click block diagram
The main advantage of a Click implementation is the use of positive edge-triggered flip-
flops(FF) as the storage elements, not implementing any special asynchronous custom cell
like a C-element. This characteristic makes this template well suited for a traditional syn-
chronous synthesis flow. Additionally, the area overhead implementing an asynchronous
controller is low compared to the BM controller counterpart. One of the major advantages
of implementing asynchronous circuits with Click is the ease of integrating with commercial
13
EDA tools and the consequent optimizations through static timing analysis. Moreover, the
required logic gates are commonly found in commercial standard-cell libraries.
Click templates are used in the Blade-Open Section 3.1 and in the SERAD-Click tem-
plates, described in Chapter 4.
2.2 BD Timing Resilient Templates
Traditional synchronous designs must incorporate substantial timing margins to ensure cor-
rect operation considering worst-case delays on critical paths caused by process, voltage,
aging, and temperature variations [25]. In contrast, asynchronous designs are more adap-
tive because they replace the clock distribution circuitry with asynchronous controllers that
automaticallyadapttovariations[26]. Timingresilienttemplatespromisetoremoveincreas-
ingly large margins due to process, voltage, temperature variations and take advantage of
average-casedatatoimproveperformance. However,previousproposedsynchronousresilient
templateshaveeithersufferedfrommetastabilityorrequireexpensivearchitecturemodifica-
tions to add replay-based logic that recovers from timing errors, which leads to high timing
error penalties and poses a tough design challenge.
Inparticular,asynchronouslatch-baseddesignshavemoreflexiblehandlingofholdtiming
constraints [27] and can hide the performance overheads associated with the asynchronous
handshaking [28]. Incorporating error-detecting latches and adjusting the associated con-
trol, makes the asynchronous designs timing resilient and enables them to operate closer
to the average delay of their associated data path [29]. Compared to their synchronous
14
timing-resilient counterparts, these asynchronous designs offer several advantages, includ-
ing metastability-safe operation, low timing error penalties, and universal application, not
requiring complex replay-based logic [29–31]. The synchronous resilient template Bubble
Razor [32] requires no architectural replay and can be applied to any pipeline enabling real
time error detection and correction. The Razor Flip-Flop is shown in Figure 2.5.
Figure 2.5: Bubble Razor Flip-Flop, taken from [32]
Implementationsarebasedontwo-phaselatchdesignanduseserrordetectionlogic(EDL)
to detect time violations, as described in more detail in Section 2.2.1.2. When a timing
violationisdetected,thetemplatestallstheneighboringpipelinestagesthroughclockgating.
The main drawback of this template, however, is the inability to deal with metastability in
the Error Detecting Latch (EDL). In particular, metastability can occur if a data transition
occurs at the very beginning of the TRW (when the latch opens). Determining if this event
is a timing violation or not may result in metastability that can take more than one clock
cycle to resolve. Unfortunately, Bubble Razor has no protection for this and consequently
15
metastable values can corrupt the control logic operation, significantly reducing reliability
and mean time between failures [33].
2.2.1 Blade Template
Based on this paradigm, a bundled data timing resilient template, Blade [29], was developed
at USC. The timing resilient bundled-data design is an extension of the bundled data design
that adds logic and complexity to the sequential logic on datapaths and controllers, aiming
to provide timing error detection and correction into the template. In addition of the fact
that can mitigate the effect of PVT variations, it also can increase the average performance
by taking advantage of data dependencies.
Some of the work done have been on bundled data timing resilient templates: Blade-O
and the latest Blade Open-Close template. Each Blade template takes advantage of the
average case delay in the datapath and implements different strategies to provide timing
improved performance compared to the synchronous counterparts.
The Blade template is a bundled data template that enables to operate in average-
case performance, it is robust to metastability issues and requires no replay-based logic.
Additionally, has a low timing error penalty that is explained in Section 3.2.2.
As shown in Figure 2.6, uses single-rail logic followed by error detecting latches (EDLs),
two reconfigurable delay lines, and an asynchronous Blade controller. The Blade controller
implements a two-phase asynchronous channels L/R [3]. The L/R channels are a typical
bundled-data push channel, comprised of Req, Ack, and Data.
16
Figure 2.6: Blade template
The controller speculatively assumes that the data at the input of the EDL is stable
when it becomes transparent and thus sends an output request along the typical bundled
data channel L/R.
Figure 2.7: Blade timing for a 4-stage pipeline
17
OnFigure2.7,eachstagecommunicateswiththenextstageusingahandshakingprotocol
implementedwithanadditionalerrorchannel(RE/LE)tocontrolifopeningofthenextlatch
is required due to a timing violation.
Bladehasfourtwo-phaseasynchronouschannels[3]: L,LE,R,andRE,connectingstages
in the pipeline. The L channel is a bundled-data push channel, comprised of Req, Ack, and
Data. TheLEchannel, whichisusedtoconnecttothepreviousstageanddelaytheopening
of the latch if suffered a timing violation, is a pull channel, i.e., the controller itself will
initiate a request. The RE channel becomes the LE channel of the next stage, allowing the
current stage to communicate a timing violation to the next. Note channel LE/RE at reset
time do not send a token. Just when the controller needs to check for timing violations will
initiate the handshaking.
2.2.1.1 Blade Handshaking Protocol
Blade implements an speculative handshaking protocol that assumes average case delay and
only delays if a timing violation (or worst case) happens in the data path. An example of
the speculative handshaking protocol that achieves this behavior using two-phase signaling
is described and shown in Figure 2.8, in (a) Handshake without extension, continuous green
circlesshowthesimultaneousL.ackandLE.acktransition. In(b)Handshakewithextension,
whenaTRWerroroccurs,dottedgreencirclesshowdelayonL.ack duetoadelayedLE.ack.
PreviousstagesendarequestonchannelRreqthatpropagatesthroughtheδ delaylinebefore
reachingthenextBladecontrollerwhilethedatapropagatesthestage’scombinationallogic.
18
The Blade controller checks with the previous stage’s controller if the speculative request
Figure 2.8: Blade Handshake signals
was sent before the input data was actually stable, i.e., if the previous stage experienced a
timing violation during the high phase of the latch. This checking is performed via a second
handshake on the pull-channel LE. When no timing violations occur in the previous stage,
(See Figure 2.8.a). The LE.req signal is immediately acknowledged by LE.ack, indicating
the speculative request was correct and no extension is required. If a timing violation occurs
in the previous stage causing the LE.ack signal to be delayed by ∆ time units while the
final, committed input data passes through the stage’s combinational logic. In both cases
this stage is given a nominal delay of δ to process stable data.
In addition, note that the information of whether a timing violation occurred is not
directly transmitted between stages; rather, this information is encoded into the variable
response time between LE.req and LE.ack. Additionally, the R.req signal of the controller,
19
not shown in Figure 2.8, is coincident with the arrival of LE.ack, which forces the R channel
request to be delayed by ∆ as well when an extension is necessary.
2.2.1.2 Error Detection Logic (EDL)
The EDL block diagram is described in Figure 2.9 and is comprised of a latch, an error
detector, a sampling circuit, and a metastability filter. The output is a mutually exclusive
dual-rail output signal: Err0, Err1. Err0 is asserted high if there is no transition detected
in the data channel during the TRW. Otherwise, Err1 is asserted, flagging the observed
transitionasatimingviolation. Theerrordetectorcircuitisdesignedtodetectanytransition
attheinputofthelatcheswhiletheyaretransparentandflagthemastimingviolationevents.
Figure 2.9: EDL Block level diagram
Moreover, the EDL contains an asymmetric C-element that remembers if a timing viola-
tion was detected until it is reset by Sample going low [29]. Metastability in the EDL latch
is avoided by ensuring the clock pulse is large enough to capture the worst case delay of the
combinational logic. Metastability in the detection logic, however, cannot be avoided. For
20
this reason, the sampling circuit contains a metastability filter as part of a metastable-safe
Q-Flop [34] that drives the dual-rail error signals. The dual-rail outputs remain neutral
until any metastability is resolved. The unbounded time required by the metastable event
to resolve simply stalls the control logic and consequently the pipeline. On average, how-
ever, the time needed to resolve metastability is small and the resulting impact on average
performance is negligible [35]. The collection time for the Err0 signal from multiple latches
takes a considerable amount of time. For the original Blade, this delay imposes a practical
limit on the maximum width of the TRW.
2.2.2 Sharp template
Anotherinterestingasynchronoustemplate,derivedfromBlade,Sharp[36],enablesaninter-
estingperformanceimprovementoverOriginalBladeandBladeOptimized: insteadofdelay-
ing the opening of the downstream stage, it delays the closing of the downstream stage.
Figure 2.10 shows the Sharp controller. The controller implements a two-phase handshake
channel using 3 handshaking wires and has been implemented using Click elements [37].
InthecaseofSharp,thecollectiontimefortheErr0signaldoesnotimpactperformance,
asthetemplatecontroltheclosingtimeofthenextlatch,forwhichthereistypicallyalarger
timing margin to cover any collection delay.
21
Figure 2.10: Sharp template, taken from [36]
2.3 Radiation Effects in CMOS Circuits
A charged particle, hitting the chip, loses energy passing through layers of material. In
silicon, electron-pairs have no consequence since they are recombined. In junctions with
electric field applied, the electron-hole pairs are collected, thus the effect of the induced
current shows up in a voltage perturbation. Some floating nodes rely on capacitive charge
storage to maintain data states between charging cycles. These nodes will not recover if are
upset by an ionizing particle and in consequence the incorrect state will persist [5].
Figure2.11.adepictstheinteractionofaradiationparticlestrikingthesensitivedepletion
regions or reverse-biased p-n junctions inside a transistor, generating a trail of electron-hole
pairs. These free carriers can drift creating a transient current pulse which results in charge
collection that generates a transient voltage at the drain node.
22
2.3.1 SingleEventTransient(SET)andSingleEventUpset(SEU)
The generated transient voltage is known as a Single Event Transient (SET), and depending
on the induced voltage fluctuation and duration, can propagate to downstream logic. If
the SET reach a state holding element at the right time and condition, it can induce the
state to flip, a condition known as a Single Event Upset (SEU). An SET or SEU, both are
non-destructive forms of Single Event Effects (SEE) [38].
Figure2.11: Radiationparticlehitonasensitiveareaandtheradiationparticlestrikemodel
2.3.2 Linear energy Transfer (LET)
LinearEnergyTransfer(LET)isthetheenergythatistransferredintothematerialwhenan
ionizingparticlepassesthroughit[39]. TheLETeventistraditionallyrepresentedinSPICE
simulations as a independent current source, connected between drain and body transistor
terminals. This current source has been traditionally implemented as a double exponential
waveform [40], but different waveform implementations are explored in [41]. Depending on
23
the logic state, the junctions of the off transistors in a gate are vulnerable to SETs. Figure
2.11.b depicts the independent current source connection of CMOS circuits. For a n-hit, or
a hit on an off NMOS transistor, the direction of the independent current is from ground
to drain. In contrast, for a p-hit or a hit on an off PMOS transistor, the direction of the
induced current changes, from V
dd
to drain.
Another important characteristic is the critical transient pulse defined as the pulse
length required to propagate through an infinitely long chain of inverters. At pulse widths
smallerthatthecriticalwidth,theinherentinertialdelayofthegateattenuatesthetransient.
As an approximation, transients width less than half of the critical width, will fade after
the affected gate. Note that a transient pulse between 100-200ps. becomes critical for
technologies under 350nm [5].
2.4 Radiation Hardening By Design Techniques
These are techniques that introduce changes to a circuit design to mitigate non-destructive
SEEs, namely SETs and SEUs events at critical circuit nodes; i.e., nodes where an SET
event can impair the functioning of the circuit.
To mitigate the impact of radiation impacts on digital circuits, there are three general
categories to consider:
24
• Increasing node capacitance is the most common way to increase node robustness
to SET. This can be achieved by adding a capacitor, increasing transistor sizing, or
adding transistor redundancy to the node.
• Logical masking is implemented by the abscence of a functional radiation-sensible
path from the inputs to the primary outputs. This is usually estimated by fault
simulations.
• Electrical masking occurs if the SEU is attenuated as it propagates along the
radiation-sensible path to the primary outputs. This can be estimated by SPICE
simulations by deposited charge on key nodes.
In subsequent subsections some important RHBD techniques are explained.
2.4.1 Temporal Masking
Temporal masking occurs if a SET reaches the primary outputs at a different moment
than the capturing window [42]. (Note: the literature refers to Clock period, not cycle
period).
2.4.1.1 Glitch filters
Spurioustransitions(alsocalledglitches)aregeneratedwhenaSEToccursinacombinatorial
circuit. Glitch filters ( GF) takes advantage of logic gates’ inertial delay, because of that
suppressing input pulses that are smaller width than the inertial delay of the logic gate, and
25
passing unattenuated duration pulses that are longer that the gate’s inertial delay. GF are
usedtoremoveunwantedpulsesgeneratedbySETpropagatinginthecombinatorialcircuits,
providing temporal masking of SET pulses, suppressing propagated SETs before they can
be captured in latches/flip-flops (becoming an SEU). A Glitch Filter is tuned during design,
adjusting the maximum width of the propagated SET perturbation that can be suppressed.
The time performance penalty is proportional to the width of the maximum propagated
SET glitch perturbation. A Tunable Transient Filter (TTF) is proposed in [43], designed
Figure 2.12: Traditional glitch Filter
for short and fine tunning pulse filtration. Another application, implementing a C-element,
is proposed in [44].
Figure 2.13: Glitch Filter implementing a C-element
2.4.1.2 Dual Interlocked Cell (DICE) latch
Implements two nodes to represent a logic 1 and two nodes for logic 0. This latch present a
delay for writing and node redundancy to provide radiation tolerance. If the two inputs are
26
drivenbyseparatebutidenticalcombinationallogiccircuits,or“rails,”thenaSEToccurring
in only one of the rails, it will not latched.
2.4.2 Spatial redundancy or modular redundancy
Spatial redundancy or modular redundancy (MR) usually doubles or triplicates (TMR) the
circuit and implement a voting element at the output to filter the SET propagation.
2.4.2.1 Triple Modular Redundancy (TMR)
Triple modular Redundancy(TMR) is used to provide full correction, at expense of area and
power. DualredundancyhavebeenusedtodetectSETandavoidthembyaguardgate(GG)
that stops the SET propagation. This RHBD technique, implements 3 sets of independent
inputs and outputs [45].
2.4.2.2 Guarded Dual Modular Redundancy (GDMR)
Doubles the module to provide spatial redundancy and the dual outputs terminate in the
guardgate. TheguardgatefiltersallSETsoccurringattheoutputofanygate[46]. Analysis
of single event transients in Guarded Dual Modular Redundancy (GDMR) is resilient to
SETs. The two rails are independent of each other and this introduces spatial redundancy.
Whenever there is an SET, only one rail1 is affected, and may eventually propagate up to
the guard gate, if there was no masking. But the other input to the guard gate which is
comesfromrail2willnothaveanerror. So, boththeinputstotheguardgatewillnotmatch
27
and the guard gate will hold the previous value. It seems to have better tolerance to MET
compared to TMR [46].
2.4.3 Gate Sizing
This is a circuit-level hardening technique that increase the critical charge (Qcritical) for
gate nodes. Qcritical is the minimum amount of charge that needs to be provided by a
particle strike to produce a SET. A node is hardened by adding capacitance or increasing
drive to dissipate the charge. This is achieved by sizing up gates or transistors inside the
gate itself, altering the W/L ratios of the transistors. This technique is compatible with all
other RH techniques [47], [42] and [48].
2.4.3.1 Guard Gates (GG)
A GG is a C-element that have been removed the state-holding element (cross-coupled
inverter structure). The GG only relies on its parasitic capacitance at the output node to
retain the previous valid state, until the inputs agree to update the output value. In the
case of a ion strike in one of the nodes of an input circuit, the SET will be contained at the
input [49].
The above techniques, despite being effective, introduce significant area, power, and per-
formance overheads. For this reason they must be used carefully to achieve the required
robustness while not incurring unnecessary overheads in terms of power, area, or perfor-
mance.
28
2.5 Hardware Components selected for RHBD Mutex
In this subsection, some important hardware components that support the RHBD Mutex
are explained.
2.5.1 Set-Reset (SR) latch
The Set-Reset (SR) latch [50] is a state-holding element implementing two cross-coupled
NAND gates and its States table is presented in Figure 2.14. For this application, we must
consider all input conditions and transitions, even those that are traditionally considered
illegal. For now, lets assume R
0
and R
1
input signals remain stable until the circuit’s
Figure 2.14: SR latch implementing cross-coupled 2-input NAND implementation
outputs become stable. The idle state ”a” has both inputs low. In turn this implies both
outputs are high. From this state, the latch can go to state ”b” or ”c”, implying that one
of the outputs will flip to low correspondingly. Furthermore, moving from state ”b” to state
”d.2”orfromstate”c”to”d.1”doesnotincuranychangeattheoutputs. At”d.1”or”d.2”
states, the SR latch behaves as a state-holding element, remembering the output values
from the previous state. Conversely, moving from states ”b” to ”d.1” or ”c” to ”d.2” is not
possible. Rather, from states ”b” or ”c” it is only possible to go back to state ”a”. Finally,
29
from ”d.1” it is possible to go to state ”c” or ”a” and from ”d.2” it is possible to go to state
”b” or ”a”. These behaviors are deterministic. On the other hand, transitioning directly
from state ”a” when both inputs simultaneously causes the circuit to non-deterministically
choose between entering state ”d.1” or ”d.2” and this choice may depend on which way the
internal metastability in the circuit is resolved (which may take an unbounded amount of
time) [3]. We later call this arbitration as it provides mutually exclusivity to the outputs.
We emphasize, however, that electrical disturbances to the outputs can force the circuit to
transition between states ”d.1” and ”d.2”. This behavior is not expected during normal SR
operation, but can be caused by an SEU.
2.5.2 Metastability filter
The 2-input metastability filter, illustrated in Figure 2.15.b, is comprised of two voltage-
controlled inverters, namely X
0
− G
0
and X
1
− G
1
respectively. They are cross coupled
where one’s inverter input is connected to other’s V
dd
terminal. Input X
0
controlling the
V
dd
of the X
1
− G
1
inverter, guarantees that the output G
1
will only go high when the
mutually exclusive logic condition at inputs X
0
low and X
1
high is guaranteed. Similarly,
the complementary mutually exclusive condition holds for output G
0
. More precisely, the
voltage at output G
0
cannot be raised if the difference between the input voltages at X
0
and X
1
is below the PMOS threshold voltage. Consequently, when X
0
and X
1
voltages are
metastable, near V
dd
/2, the outputs remain zero. We note that for applications where a
voltage-controlled inverter is not available (e.g., an FPGA), an alternate is to replace the
30
inverters with 4-input NOR gates with inputs tied together [51]. This is effective because
the NOR gates have a relatively low switching threshold and thus still prevents metastable
voltages from propagating.
2.5.3 CMOS Schmitt Trigger
TheinvertingSchmittTrigger[52],depictedinFigure2.15.a,iscomposedofadouble-stacked
NMOS(MN
0
,MN
1
)andPMOS(MP
0
,MP
1
)transistorsattheinput,followedbyMN
2
and
MP
2
feedback transistors that provide hysteresis; i.e., different input voltage thresholds for
riseandfalltransitionsattheoutput. Thehysteresisprovidesbetternoise-rejectingmargins
at the input with low capacitance overhead at the output. The proposed mutex circuit is
Figure 2.15: Inverting Schmitt Trigger and the baseline mutex
designed using novel 3-input NAND gates using a similar transistor topology. This will be
discussed in Section 4.3.2.1.
31
2.6 SERAD Template
The Soft error resilient asynchronous bundled-data template (SERAD) [7] implements a
combinationoftemporalandspatialredundancytomitigateSingleEventTransients(SETs)
and upsets (SEUs) that can arise in harsh environments like space. A Transient Voltage
pulse (Single Event Transient (SET)) can be caused by a charged particle strike on the
combinational logic. A SET can be tolerated if it does not cause a timing violation on
registers. ASingleEventUpset(SEU)canoccurbyaparticlestrikeonastateholdelement.
This event is unrecoverable. Registers needs to be SEU resilient. SERAD implements an
ErrorDetectionLogic,asdescribedinpreviousSection2.2.1.2. InordertodetectSETsatthe
inputs of sequential elements and correct them via re-sampling. There is only time penalty
in the presence of a SET, event that rarely occurs, thus the expected average performance
is comparable to the baseline unhardened bundled data design.
The SERAD stage is depicted in Figure 2.16. The main difference with a traditional
bundled data is the double redundancy handshaking lines between modules. The template
implements single-rail combinational logic, error detecting logic (EDL), delay lines, Dual
InterLocked Storage Cell (DICE) latches, and the novel SERAD controller. The SET events
thatoriginateinthesequentialelementsofthepipelinearecoveredbyDICElatches. ADICE
latch implements a cross coupled inverter that makes robust to charged particle strikes.
SERAD prevents SETs that originate in the combinational logic from propagating from
one pipeline stage to the next using a special form of temporal redundancy. Any SET which
appears at the input of a pipeline latch when it is transparent is identified as an error by the
32
EDL and is mitigated by stalling the pipeline until the data is re-sampled. This template
utilizes an Error Detection Logic as described in Section 2.2.1.2.
Figure 2.16: SERAD template
2.6.0.1 Soft Error Resilient Timing
Figure2.17showstheexpectedbehaviorofthe CLK signalsassociatedwithtwoinstructions
flowing through a four stage pipeline. The arrows connects the dependency between CLK
transitions. AsInstruction1propagatesthoughtStage1. ASEToccursinthecombinational
logic path. This event is detected by the EDL on Stage 2 at the end of δ +σ . The Stage
2’s latch opens again and check for any SET occurrence. Because there is no SET detected,
the rising edge of Stage 3’s CLK signal is delayed by δ after the second falling transition
on Stage 2 latch. Note that if there are multiple SET occurrences, the re-sample action on
33
Stage 2 latch is performed again and the pipeline is halted as necessary to avoid any data
corruption to be propagated downstream in the pipeline. The controller is specified as a
Figure 2.17: SERAD timing diagram, taken from [7]
BM FSM [53] and explained in 2.1.2.1. The sequence of actions in the BM specification are
shown in Figure 2.18:
• On reset signal, all initial outputs are set to low.
• When Lack+ signal is asserted by the previous stage, the FSM is transitioning from
state 0 to state 1.
• On receiving Lreq+ from the previous stage and Rack+ from the next stage, the
controller sets the clock signal high (CLK+), then waits for the EDL error signal
Corr-and the delay line done+ to make CLK-. the FSM now is at state 4.
34
• If the EDL signal Err+ is received, indicating the occurrence of a SET, the controller
sets the clock signal high (CLK+) again, thus the controller loops to state 3.
• If the EDL signal Corr+ is received, indicating there was no SET detected, the con-
troller set Rreq+ to the next stage and set Lack- to the previous stage, signaling that
the latched data is error free (state 6).
Figure 2.18: SERAD Burst Mode specification
Thecontrollerfollowsatwo-phasehandshakingprotocol,sotheBMarcsinthepathbetween
states 6 to 1 are replicated with inverted signal polarity. The previous actions are performed
cyclically until the reset signal is asserted again.
35
2.7 Hardware Description Language (HDL)
In this section, some hardware design languages used are discussed.
2.7.1 Verilog
It is used in the design and verification of digital circuits at the register-transfer level (RTL)
description. AmongotherHDLs,likeVHDL,Verilogrepresentedaproductivityimprovement
forcircuitdesignerswhowerealreadyusinggraphicalschematiccapturesoftwareandwriting
software to describe and simulate logic circuits [54].
Verilog lacks a higher level of abstraction, i.e., Interfaces, new Data types (logic, integer
representation), Enumerated types, Arrays, Hardware-specific expressions (different always
blocks) and others that support modeling of RTL designs easily. SystemVerilog [55] added
all the previous benefits but lacked industry acceptance, development and support.
2.8 Hardware Construct Language (HCL)
Chisel[56](ConstructingHardwareInaScalaEmbeddedLanguage)isahardwareconstruc-
tionlanguage. ChiselisbasedonScalaasanembeddeddomainspecificlanguage(DSL)and
inherits the Scala’s object-oriented and functional programming to describe circuit genera-
tors.
36
Chiselcanbeusedtodescribedigitalcircuitsatregister-transferlevel(RTL),butinstead
of expressing an RTL design directly, a user instead writes a Chisel program to construct
the desired circuit.
Chisel’s host language Scala features allow designs to be more parameterizable, reusable
and modular compared to Verilog designs. For example, a Chisel program can implement
a recursive Scala function to construct an adder-reduction tree or a custom function logic,
parameterized on buses’ bit-width or some other design meta-parameters. Unlike the explic-
itly unrolled code required in Verilog, the same Chisel generator could be re-used, just
redefining the required parameters. Similarly, a complementary Chisel program can imple-
ment the condition-checking hardware modules required for testing.
2.8.1 Scala Language
Named by blending scalable and language, Designed in
´Ecole Polytechnique F´ ed´ erale de
Lausanne (EPFL) (in Lausanne, Switzerland), it is a programming language that support
functional programming. Scala provides interoperability with Java, having interchangeable
libraries. [57] Using Scala functions allows to manipulate circuit components, encode inter-
faces in Scala types, and use object-orientation features to create circuit libraries. This form
of meta-programming enhances productivity and robustness. Some of the Chisel’s caveats
are listed below:
• Writing custom transforms requires deep knowledge about the Chisel compiler.
• Chisel semantics are under-specified and difficult to match from other languages.
37
• Error checking is difficult due to under-specified semantics that lead to incomprehen-
sible error messages.
• Learning a functional programming language (Scala) is difficult for RTL designers.
2.8.2 FIRRTL: Flexible Internal Representation for RTL
When the Scala-Chisel program is compiled, an design Intermediate representation(IR) is
obtained. Instead of directly emitting Verilog, Chisel emits an intermediate representation
called FIRRTL, which represents the elaborated (parameter-resolved) RTL instance. FIR-
RTL represents the circuit immediately after Chisel’s elaboration but before any circuit
simplification. Additionally, FIRRTL is a platform composed of circuit-level Scala transfor-
mations aimed to simplify, verify and transform the input circuit [58].
TheFIRRTLrepositoryconsistsofacollectionofScalatransformationsaimedtosimplify,
verify and transform the input circuit [58]. FIRRTL has first-class support for high-level
constructs such as vector types, bundle types, conditional statements, partial connects, and
modules. From the emitted IR, successive FIRRTL transforms gradually removed high level
Scala constructs (vector types, bundle types, conditional statements, partial connects) by a
sequence of lowering transformations. During each lowering transformation, the circuit is
rewritten into an equivalent circuit using simpler, lower-level constructs. FIRRTL presents
different IR forms as follows (from Chisel output down to FIRRTL netlist): Chirrtl, High
Firrtl, Middle Firrtl, and Low Firrtl.
38
Eventually the circuit is simplified, resembling a structured netlist, This netlist is trans-
lated to an output language (e.g., Verilog).
FIRRTL can be serialized (converted to a String for writing to a file), and this serialized
syntax is human readable for review. Internally, it is a data-structure organized as a tree of
nodes, called an abstract-syntax-tree (AST). Some of FIRRTL use cases are programmatic
transformationsofaChiseldesign, ratherthanasagenerator. Thismeansadding, replacing
ortransformingpartoftheIRcircuitrepresentationbeforeemittingthefinalRTL(Verilog).
The latest FIRRTL specification can be found in [59].
2.8.2.1 Creating a FIRRTL transform
By selecting the IR form used as input to the transform (selecting from Chirrtl, High Firrtl,
Middle Firrtl, and Low Firrtl) gives the spot in the compiler’s transform queue. Writing a
Firrtl pass usually requires writing functions which walk the Firrtl data structure to either
collect information on the actual circuit or replace IR nodes with new IR nodes. FIRRTL
transforms can be standalone (i.e., analyses the IR and take a decision) or can take annota-
tions (i.e., user’s markings in the Chisel code) as inputs. Another use for annotations is to
move information between FIRRTL transforms (i.e., modules to flatten, group, and more).
2.9 Network on Chip (NoC) Fundamentals
Theneedtoscaleandintegratemultiplecores,leadtocomputersystemscomprisingmillions
of transistors on a single chip, namely Systems on Chip (SoC). A simple backbone used in
39
SoCs is the shared medium arbitrated bus. The issue is that a bus does not scale with the
number of cores in the SoC, thus the performance is reduced. A Network on Chip (NoC),
providesfunctionallycorrectandreliabledatatransferimplementingarouter-basednetwork
that interconnects the SoC’s cores. The main NoC’s advantage communication architecture
is modularity and scalability [60].
2.9.1 NoC components
The NoC is composed of three main components, depicted in Figure 2.19:
• Channels/Links that implement shared communication among cores.
• Routers that implements the communication protocol.
• Network interfaces (NI) that serves as bridged between cores and the network.
Figure 2.19: Network on Chip Components
40
2.9.1.1 Router
The router performs these actions:
• Flow Control Manages packet movement in the network, the management strategy
can be centralized or distributed. In the centralized approach, routing decisions are
made externally to the network and applied to all nodes. This require a notion of
synchronicity between routers. On the other hand, the distributed routing decision is
decide locally. Each router works asynchronously from others.
• Routing Algorithm This is the logic function that selects the output port to move
forward a received data packet. In the deterministic routing, a received data packet
always is forwarded to the same output port. For adaptive routing, alternative output
ports can be selected.
• Arbitration Some routers need to perform arbitration if there is contention in the
routing algoritm outcome, i.e., two input ports need to be forwarded to the same
output port.
• Buffering is storage data capabilities to absorb any slack in the network while for-
warding data packets.
• Switching Data is transferred among routers in the path from the source node to the
target. In the circuit switching approach, the whole path is reserved exclusively per
packet transfer. The packet− based switching approach, on the other hand, each link
is reserved as the packet moves along the network path.
41
2.9.1.2 Network Interface (NI)
This module performs the translation connection between the IP cores and the network,
serving as interface between core’s data protocol and the common network data protocol,
managing communication core requests. This implement the functional separation between
computation (performed at cores) and communication (performed at the network). This
allows the reuse of the communication infrastructure among multiple dis-pair cores.
2.9.1.3 Links
A communication link is an interface that interconnects two adjacent routers in the net-
work. It can implement some level of buffer capabilities.
Typically in a synchronous system, a single electronic oscillator generates a clock signal
thatgeneratesaclockdomain,definedasthestate-holdingelementsclockedbythatsignal.
Because of the increased SoC complexity, multiple clock domains are implemented in the
same system, usually each core implement its own clock domain [61].
2.9.2 NoC Topology
The NoC is characterized by the structure of the routers interconnections. The Topology
and is represented by a graph G(N,C) where N is the set of routers (vertices) and C is the
set of communication channels (edges). The routers can be connected in two ways:
42
• Direct topology, each router is associated to a processor becoming a node in the
network. Additionally,eachnodeisdirectlyconnectedtoafixednumberofneighboring
nodes. Figure 2.20 shows the mesh and the hypercube/torus topologies.
Figure 2.20: NoC Direct router-core topologies
• Indirect topology, not all routers are connected to processing units as in the direct
model. Some routers (leaves) interface to nodes whilst the others in the network are
used to forward the messages through the network. Figure 2.21 shows the fattree and
the omega− network topologies.
Figure 2.21: NoC Indirect router-core topologies
43
2.9.3 Omega-Network Topology
This is an indirect NoC topology that relies on conflict-free routing, i.e., there is no need for
arbitration, as each input has a mutual-exclusive forwarding output. This network topology
isusedtoprovidethehighestparallelismbetweenanarrayofsenderstoanarrayofreceivers,
and still implementing shared path and resources with minimal footprint.
2.9.3.1 Network Interconnection
As described in Figure 2.22, an eight by eight, senders and destinations are given addresses
as shown. Each interconnection column are connected to the next stage following a conflict-
free connection pattern. This means that the connections at each stage implements a cyclic
logical left shift where each bit in the address is shifted once to the left.
Adjacentpairsofinputsareconnectedtoarouterelement. ForsizeN,anOmeganetwork
contains N/2 switches at each stage, and log2N columns. The routing performed on each
router determines the connection paths from senders to destinations at any given time so a
path can always be made from any input to any output.
2.9.3.2 Routing
There are two possible routing algorithms. destination-tag routing, the tag is determined
by the message destination following this algorithm: first, the most significant bit of the
destination address is used to select the output of the router in the first column; if the most
significant bit is 0, the upper output is selected. Otherwise the opposite is true. Then,
44
Figure 2.22: Omega-Network topology
the next-most significant bit of the destination address selected to choose the output of the
following router in the next column. This process continues until the last column’s router is
reached ad the data is delivered to the destination.
For example, if a data’s destination address is 001, the routing selections will be: upper,
upper, lower. Another example, if the destination address is 101, the routing selections will
be: lower, upper, lower.
XOR-tag routing, the routing selections are based on the formula source XOR
destination. Additionally, the XOR-Tag contains swaps (being ones) or pass-thru (being
zeros). Again the most significant bit(MSB) is used in the first column’s router, and so on
until reaching destination.
For example, if source 001 sends a message to destination 010, the XOR-tag will be 011
and routers in the path will follow pass-thru, cross, cross.
45
2.10 Fully Homomorphic Encryption (FHE) Funda-
mentals
HomomorphicEncryptionisaformofencryptionthatallowsthird-partysidestoperform
computations on encrypted data (cipher-text) without decrypting it. The resulting compu-
tations are performed on the cipher-text which, when decrypted, the result is identical if
computations have been performed on plain (non-encrypted) data.
Acrypto-systemthatexecutesarbitrarycomputationsonciphertextsisknownasFHE
system. SincethealgorithmsthattheFHEsystemexecutes,neverdecryptsitsinputs,itcan
be run by an untrusted party without revealing its inputs and internal state. These systems
have great practical implications in the outsourcing of private computations, for instance, in
the context of cloud computing [62]. Nowadays, traditional encryption protects data while
storedorintransmission. Ontheotherhand, theinformationmustbedecryptedtoperform
a computation. This requirement exposes plain data to malicious exploitations or leaks.
Additionally,performingFHEdemandsahugeamountofcomputationalresourcescompared
toplaindataoperations. RunningFHEinsoftwareonstandardprocessinghardwareremains
untenable for practical data security applications due to the massive processing overhead.
Several factors prevent near-term industry adoption of the technology, including perfor-
mance issues, lack of standardization and complexity [63].
Figure2.23, adaptedfrom[64], showsthedataflowofanend-to-endencryptedcomputa-
tionbasedonhomomorphicencryption. Theclientsendsencrypteddata(cipher-text)tothe
46
Figure 2.23: End-to-End encrypted computation
third party (cloud server). The cloud server performs computations on the encrypted data
and then sends the results back to the client. Encoding/decoding and encryption/decryp-
tion are performed on the client-side, thus the secret key is not required for the computation
operations [64].
47
Chapter 3
Novel Timing Resilient Asynchronous
Protocols and Templates
This chapter provides a description on timing resilient asynchronous templates that were
developed.
Traditional synchronous designs must incorporate a substantial timing margin in the
clock period to ensure the correct operation considering worst-case delays on critical paths
causedbyprocess,voltageandtemperature(PVT)variations[25]. Incontrast,asynchronous
bundled-data resilient designs provide a more adaptive design by replacing the clock distri-
bution circuitry by asynchronous controllers that implement synchronization between con-
secutive stages. This can be achieved by converting the traditional flip-flop design into a
latch-baseddesignthatallowsformoreflexiblehandlingofholdtimingviolationsandhiding
the performance overheads generally associated with asynchronous handshaking.
Additionally, asynchronous resilient designs take advantage of data-dependency of the
combinationaldelaytokeepthepipelineworkingwithalowaveragestagedelayinsteadofthe
48
worst case delay required by synchronous circuits. Compared to their synchronous timing-
resilientcounterparts,theyofferseveraladvantages,includingmetastable-safeoperation,low
timingpenalties,anduniversalapplication,notrequiringcomplexreplay-basedlogic[29,33].
3.1 Blade Template Optimization
The Blade template was further optimized in two ways: reducing handshaking overhead by
reducing the number of wires required and adding more flexibility by explicitly controlling
register’s opening and closing.
3.1.1 Blade-Open Template
The new Blade open template implements two-phase asynchronous channels [3] L/R., see
Figure 3.1. The L/R channels are a typical bundled-data push channel, comprised of Req,
Ack, and Data. The first delay line is of duration δ - ∆ and controls when the EDL becomes
transparent,allowingthedatatopropagatethroughthelatch. Thelatchtransparentwindow
duration is ∆ defined by the second delay line.
The Blade-Open controller checks if the data was stable during the latch transparency
window, then sends an output request along the bundled-data channel to the next stage. If
data changes duringthe highphase ofthe Clksignal, butstabilizes beforethe latchbecomes
opaque, itisrecordedasatimingviolation, whichcansubsequentlybecorrectedbydelaying
the handshaking with the downstream stage. Consequently, ∆ defines a timing resiliency
49
Figure 3.1: Blade-Open template
window (TRW) after δ - ∆ during which the average-case timing assumption may be safely
overridden.
Inparticular,ifthecombinationaloutputtransitionsduringtheTRW,theerrordetection
logic flags a timing violation by asserting its Err signal, which is sampled by the controller.
TheBladecontrollerthentriggersoncemorethe∆delaylineinordertoprovideaworst-
case delay and allow the correct data propagation along the latch to the next combinational
stage. Details are described in Section 3.2.1.
While Stage 2’s latch is transparent, a timing violation occurs indicating the δ delay line
in Stage 1 was shorter in duration than the combinational logic path. The rising edge of
Stage 3 CLK signal is nominally scheduled to occur δ time units after Stage 2 CLK, that
correspondstothedottedgrayregion;however,thetimingviolationextendsthistime,giving
Instruction 1 a total of δ +∆ to pass from Stage 2 to Stage 3, covering a worst case delay.
50
Conversely, Instruction 2 does not suffer a timing violation in Stage 2, which allows Stage 3
CLK signal to activate δ time units after Stage 2 CLK.
Figure 3.2: Blade-Open Handshaking and controller’s internal signals
An example of the handshaking protocol that achieves this behavior using two-phase
signaling is shown in Figure 3.2, the first handshake does not exhibit a timing violation and
thusprocessesthehandshakingsignalswithnoextradelay. Thesecondhandshakeexhibitsa
timingviolationandthusillustratesthedelayedhandshakingsignals. TheBlade-Openstage
receives a request and data value on its L channel. The Blade controller then opens its own
latch and at latch closing checks for glitches during that period, i.e., flagging if the previous
stage experienced a timing violation. When no timing violations occur in the previous stage
(See Figure 3.9.a), the R.req signal is immediately asserted, indicating that an average case
occurred and no extension is required. In Figure 3.9.b, a timing violation occurs in the
previous stage causing the R.req signal to be delayed by ∆ time units, allowing extra time
51
for propagation of the worst case delay data passes through the stage’s combinational logic.
In both cases this stage is given a nominal delay of δ to process an average case.
In addition, notice that the information of whether a timing violation occurred is trans-
mitted between stages encoded into the variable delay to send R.req.
3.1.2 Burst Mode Implementation
The intended behavior of Blade-Open controller is specified in a Burst-Mode state machine
[65] and synthesized using the tool 3D [23]. Figure 3.3 shows the state machine for pipeline
stages with EDLs.
Figure 3.3: Burst Mode specification
The Burst Mode FSM takes as inputs signals on the handshaking channels L/R and
internalsignalsclk, Err0, Err1andSampletocontrolthephasesofthelatchesandtheEDL.
52
Figure 3.4: Token Burst Mode specification
The Signal Err1 flags a timing violation detected by the EDL. Otherwise, the Err0 signal is
received.
The template has four controller types: Token EDL, Non-Token EDL, Token Non-EDL,
Non-Token and Non-EDL. The token version, shown in Figure 3.4, sends a token right after
53
reset time. Versions that do not implement EDL modules can be used for time borrowing
only as the datapath in Blade most closely resembles a standard time borrowing design [66].
Non EDL modules do not check for timing violations. These modules allow to optimize
the pipeline area implementing no EDL modules where the pipeline implements non-critical
datapaths and some pipelining optimization strategy is implemented.
3.1.3 Blade compared to Blade-Open
The Blade handshaking protocol was optimized in the Blade-open template by removing
channel LE/LR and embedding the required delay associated with a timing violation in the
delay of the right request signal (Rreq)
1
. Additionally, the number of internal delay lines
was reduced from 3 to 2, improving the implementation area. The Speculative Handshaking
ProtocolimplementedintheoriginalBladewasreducedintoa2-phasedual-railpushchannel,
as shown in Figure 3.5. Due to the fact that an L.req transition is followed by an LE.req
transition δ time delay, with or without the TRW error. Thus, the LE/RE channel can be
removed embedding in L.req signal transitions containing delays (δ and ∆) and the TRW
delay.
1
The simplified Blade protocol was independently derived by myself and Dylan Hand.
54
Figure 3.5: Blade-Open timing diagram
3.2 Novel Blade Open-Close (Blade-OC) Template
The Blade template open-close [12] also replaces the synchronous global clock with asyn-
chronous controllers that drive error-detecting latches. Compared to its predecessors, the
template supports a larger timing resiliency window (TRW) enabling higher performance
for designs with wide variations in delay, such as seen in near and sub-threshold computing.
To quantify the performance improvement, a variety of pipeline structures assuming log-
normal datapath delay distributions were evaluated. Blade-OC, similar to Sharp template
[36], briefly explained in Section 2.2.2, controls the opening and the closing of subsequent
stages. This template implements a four-phase handshaking channel implemented on two
wires, Request (Req) and Acknowledge (Ack), similar to a traditional micropipeline imple-
mentation [67].
Compared to original Blade and Sharp that use two-phase handshaking protocol, the
use of 4-phase handshaking reduces the required number of handshaking wires as well as the
complexityofthecontrollerlogic. ApipelinestageimplementinganEDLmoduleisdepicted
55
in Figure 3.6. The template is supported by an automated design flow that converts syn-
Figure 3.6: Blade-OC stage
chronous register-transfer level (RTL) designs to gate-level asynchronous timing-resilient
circuits. It enables a larger timing resiliency window (TRW) than the Original Blade tem-
plate that yields higher performance for designs with wide variations in delay [35], which
often occurs at near and sub-threshold voltage domains [68]. More specifically, the TRW
is defined as the period of time in which the controller can recover from a timing violation
at the latches, as explained in detail in Section 3.1. The Blade-OC larger TRW is achieved
by careful control of the opening and closing of the local clocks allowing an overlap period
betweenneighboringclockpulses. Thisoverlapperiodiscontrolledbyaprogrammabledelay
line, providingapost-silicontunabletrade-offbetweenholdmarginsandhigheraverageper-
formance.
The controller behavior is specified using Burst-Mode Finite State Machines. The con-
troller is comprised of two burst-mode (BM) machines, as depicted in Figure 3.7. The
56
left BM machine manages the L handshaking channel, and the dic delay that controls the
closing of the latch. The right machine controls the R handshaking channel and the dir
delay that controls the opening of the next latch. Note that the request signal Go has a
direct acknowledge signal ackGo. The acknowledge signal ackErr is indirectly acknowledg-
ing Sample and controls the sampling of the EDL module that eventually outputs an Err
signal to the Right BM machine. The BM specifications were synthesized to a unmapped
gate-level description using the 3D tool [23]. This tool produces hazard free, Mealy-type
finite state machines, allowing multiple input bursts. State transitions in the burst-mode
specification are of the form of an input burst that in turn generates an output burst [53].
Figure 3.8 shows the burst-mode specification of the Blade-OC controller. The gate-level
Figure 3.7: Blade-OC controller
57
implementationwasmanuallymappedtoaSTMicroelectronics28nmgatelibraryandtested
to ensure all fundamental mode timing assumptions were satisfied [69]. Note that in addi-
(a) Left Machine
(b) Right machine
Figure 3.8: Blade-OC controller burst mode diagrams
tion to the control logic area overhead, the delay lines represent the biggest portion of the
area overhead of bundled data asynchronous designs [29], accounting for up to 25% of the
total area overhead. For this reason, we designed the BM controllers to reuse delay lines,
sharing a common portion of the δ and ∆ delay lines. To provide the extra time required
when a timing violation is detected, the delay lines Di and dir are triggered twice in the
same operational cycle. The Blade-OC controller is depicted in Figure 3.7, the controller
is comprised by a left and right Burst-Mode machines and three reconfigurable delay lines.
58
Additionally, it implements control wires to the EDL module. The implementation provides
awell-controlledoverlappingofneighboringclocksandsavesareainthedelaylines, avoiding
redundancy in overlapping delay periods. This special configuration enables fine tuning on
the latches becoming transparent and opaque and the overlap period.
3.2.1 Blade-OC Handshaking Protocol
The handshaking protocol is specifically designed to provide a simple way to synchro-
nize internal stage events without creating additional asynchronous channels between non-
neighboring stages. A Blade-OC stage has asynchronous channels to both its left and right
neighbors. There is a input channel (L) connected to its predecessor stage and an output
channel (R) connected to its successor stage. The implemented channels L and R are the
typical bundled-data push channels [3], comprised of a request (req) and acknowledge (ack)
signal implementing a four-phase protocol. The use of a 4-phase instead of a two-phase pro-
tocol, simplifies the implementation of the control, because there is no complementary logic
that is typically required to implement the second phase of a two-phase design. Figure 3.9
illustrates a simplified timing diagram of the handshaking protocol. We assume a homoge-
neous two-stage ring composed of Master (M) and Slave (S) stages. Following Figure 3.8, at
first, the M stage sets the Rreq high, starting the handshaking to the S stage. This triggers
the opening of the next stage’s latch S, which in turn replies toggling its left acknowledge
Lack to high, acknowledging the receipt of the handshaking token (input burst from state 0
to 1 on Left BM). After a predetermined delay Di, the S controller senses Lreq signal going
59
low(inputburstfromstate1to2onLeftBM),triggeringLacktogolow(outputburstfrom
state 1 to 2 on Left BM) and the start of its own control delay lines: dic to set the precise
time to close the latch (input burst from state 3 to 4 on Left BM) and dir (input burst from
state 1 to 2 and output burst from state 2 to 3 on Right BM) to send the Rreq signal to
the next stage M, closing the Rreq signal loop. Note that state transition between states 0
to 1 on Righ BM is for synchronization between BM machines. As a consequence, δ and ∆ delays are partially independent, enabling overlapped transparency of neighboring latches,
saving area in the delay lines as the Di delay line is shared.
Additionally, the implemented handshaking protocol allow a reduction of the controller
and delay line implementation area, furthermore improving the implementation flexibility in
linear pipelines and rings.
After the TRW finishes, the controller sends the Sample signal to the EDL module
(output burst from state 3 to 4 in the Left BM). This signal checks the status of the EDL
logic. If there is no timing violation detected by the EDL module, the signal Err0 is set
high and received by the controller (input burst from state 3 to 4 in the Right BM), so the
closing of the next stage is asserted without any extra delay. The transition between state 4
to 0 in the Right BM is for synchronization.
Ifatimingviolationoccurs, the EDLmodulesetsthesignal Err1high(inputburstfrom
state 3 to 5 in the Right BM), so the closing of the next stage is delayed by an additional
latch pulse width ∆ (input burst from state 5 to 6 and output burst from state 6 to 0 in the
Right BM) and the opening of the second next stage latch is delayed by the same amount as
60
(a) No timing violation, no time extension needed
(b) Timing violation, time extension needed
Figure 3.9: Blade-OC timing diagram showing latch pulses on the Master (M) and Slave (S)
stages for ideal TRW.
illustrated in Figure 3.9b. After checking the EDL status, Sample is asserted low (output
burst from state 4 to 0 in the Left BM), preparing the controller for the next handshaking.
The Blade-OC handshaking protocol allows the next stage to open, independent of
whether the latch in the previous stage is opened or closed. However, closing the next
stage before closing the current stage is not allowed.
3.2.2 Time Resiliency Window Comparison
The analysis will cover the previous asynchronous templates, original Blade, Sharp and
the Blade-OC template, determining the ideal TRW (in other words, the longest latch
pulse duration) that each template is able to achieve and how a wider TRW can improve
61
time performance. For the analytic comparison among asynchronous templates, we will
define equations that will allow us to obtain the ideal TRW. Original Blade [29] controller
implements two two-phase handshake channels over 4 wires. It provides timing resiliency by
delaying the opening of the downstream stage’s latches upon detecting a timing violation.
The template requires two delay lines, δ and ∆. δ accounts for a preferred short delay and
∆ is the time window in which the latch is transparent and thus also represents the TRW.
The summation of these delay lines, δ and ∆, defines the worst case stage delay ( C) that
can be supported. Note that also C accounts for the period in time from the opening and
closing of consecutive stages.
We can define (B.1) C = δ + ∆ The ideal TRW is defined as the longest latch pulse duration, for Blade template is
denoted as ω
b
. To quantify this ideal delay, consider a two-stage ring. Each Blade controller
speculatively assumes that the data at the input of its associated latches is stable before
it makes them transparent and sends an open request to the downstream pipeline stage.
The request goes through the δ delay before it is received at the downstream stage. Upon
receiving this request, the downstream stage sends back a request to the upstream stage
asking for confirmation regarding the speculative assumption. To process this request, the
upstream stage checks to see if a timing violation occurred in any of its latches and, if
detected, delays the response to the downstream stage by an additional ∆. Otherwise, it
responds immediately, allowing the downstream stage to open its latch and proceed without
delay. A simplified timing diagram is shown in Figure 3.10. In summary, an original Blade
62
(a) No timing violation, no time extension needed
(b) Timing violation, time extension needed
Figure 3.10: Blade timing diagram showing latch pulses on the Master (M) and Slave (S)
for ideal TRW.
controller checks if a timing violation occurs during the δ time units after opening its own
latch. But, the latch must be opaque before this decision can be made. Consequently, to
avoid slowing down the system, the TRW, i.e,. the time the latch is transparent, needs to
be less or equal to δ , as formalized in B.2. Hence, there is the constraint between δ and ∆:
(B.2) ∆ ≤ δ Thesystemcycletimeisideallylimitedbythesummationoftheforwardlatenciesaroundthe
ring. Thisimpliesthatthebackwardlatenciesassociatedwiththeasynchronoushandshaking
should ideally be hidden. One stage has its latch open during ∆ and the complementary
stage is closed during the same time by δ . There is no overlapping of the local clocks in the
system. To determine the ideal ω
b
for Blade and its relationship with the worst case stage
63
delay (C), from B.2, the best case for ω
b
is when ∆ is equal to δ . Replacing B.3 in B.1 we
obtain B.4.
(B.3) ω
b
= ∆ (B.4) C = 2*ω
b
Thus, we can conclude that original Blade ideal TRW is 50% C. In other words, assuming
that no timing errors occurred, the performance improvement can be up to two times the
performance of a non-resilient implementation. This performance improvement is explained
by the fact that the preferred average stage delay is 50% of the worst combinatorial delay
stage. Unfortunately, the collection of error signals from all latches within the EDL can take
a significant amount of time. This added delay represents a time penalty between closing
latches and opening downstream latches, effectively reducing the ideal TRW significantly
to less than 50% C [29], impacting the achievable performance gain. It may also be useful
to note that because original Blade manages timing violations by delaying the opening of
the next stage, it immediately imposes a time penalty when a timing violation is detected,
specially for those datapath delay cases that are close to the average case but suffering a
time penalty delay like a worst case.
Sharp defines two main delays, θ and λ . λ is tuned to the preferred average case and is
used to open downstream latches, similar to δ in Blade. θ is matched to the worst case delay
of the combinational stage logic and is used to close downstream latches, see S.1.
λ is independent of θ , and satisfies the constraints 0 ≤ λ ≤ θ . After opening the stage’s
latch, the controller opens the subsequent latch after λ and closes it after θ . In addition, the
controllertriggersanotherdelaylinewiththesamedelayθ afterclosingitsownlatch. Ituses
64
(a) No timing violation, no time extension needed
(b) Timing violation, time extension needed
Figure 3.11: Sharp timing diagram showing latch pulses on the Master (M) and Slave (S)
stages for ideal TRW.
this latter delay line to close the subsequent latch if the source latch experiences a timing
violation. By not explicitly delaying the opening of the latch upon a timing violation, the
template defers executing the penalty associated with a timing violation. In particular, this
deferment enables the chance for a subsequent short path to mitigate the timing violation
in such a way that the next stage exhibits no timing violation and the penalty associated
with the initial timing violation is avoided, increasing the average performance. This is
particularly useful in multistage rings but not applicable in a two-stage ring in which the
handshaking forces a delay in the opening of the downstream latch. A simplified timing
diagran is presented in Figure 3.11. Consider a homogeneous two-stage ring and let us
denote the ideal TRW for Sharp (ω
s
) as the difference between θ and a minimum λ (λ m
) in
the system, we obtain S.2
65
(S.1) C = θ (S.2) ω
s
= θ - λ m
Where λ m
is the minimum possible delay used to set an average system delay. Note that
the Sharp handshaking protocol imposes that the immediate downstream stage latch needs
to be closed in order for the upstream latch to open. Consequently, by examining the delays
around the loop we can conclude that the maximum TRW satisfies
(S.3) ω
s
= λ m
Thus, λ m
needs to accommodate the ω
s
of the next stage. Replacing S.3 and S.2 in S.1 we
obtain
(S.4) C = 2 * ω
s
Following the same reasoning described in Blade, there is no high latch phase overlap period
allowedbetweenneighboringstages. Consequently,wecanconcludethatSharp’sidealTRW
inatwo-stageloopisatmost50% C. ThislimitisthesameasinBladebecausetheydonot
allowlatchpulseoverlappingonconsecutivestages. ThislimitationwillberelaxedinBlade-
OC template. Similar to original Blade, δ accounts for the preferred average case and ∆ for
the difference between the worst and preferred average cases. Moreover, the summation of
the delay values δ + ∆ must be set equal to C. However, the implementation of these delays
is different from previous templates. Each delay is divided across two connected controllers.
Di is common to both δ and ∆ delays and is implemented in the precedent stage. dir and
dic, that are smaller delays lines compared to Di, are implemented in the consecutive stage.
The relationship is stated in BOC.1 and BOC.2.
(BOC.1) ∆ = Di + dic (BOC.2) δ = Di + dir
66
Thisdelayimplementationallowsthecontrollertobeconfiguredproviding δ tospeculatively
control the opening of the next stage’s latches, targeting a preferred average case in the
combinatorial logic delay. Moreover, ∆ defines the TRW, i.e., the period of time the latch
is transparent in which any data transition at the input of the latch is deemed a timing
violation. This timing resiliency window can partially overlap with the TRW of the next
stage. The implementation provides a well-controlled overlapping of neighboring clocks and
saves area in the delay lines, avoiding redundancy in overlapping delay periods. In stages
that are more susceptible to hold time violations, a delay line on the Lack wire can also be
added. More details are discussed in Section 3.2.4. If a transition at the input of a latch
occurs during the TRW, a timing violation is detected and two events in downstream stages
are delayed. In particular, the controllers delay the closing of the latch in the downstream
stage and the opening of the latch in the stage that follows the downstream stage. Notice
that, in contrast to Sharp, this template delays the opening of the second next downstream
stage. This is also in contrast with Blade which only delays the opening of its nearest
downstream neighbor. This strategy is necessary to allow TRW overlap, as by the time
the actual controller knows that a timing violation has occurred (at the end of the TRW),
its nearest downstream controller may already have opened its latch. The combination of
delaying the closing and opening of downstream latches allows enough time to cover a worst
casedelaythroughthecombinationaldatapath. TheuniquenessofBlade-OCisthatitallows
overlapping of the transparency phase of neighboring latch pulses. In fact, the timing of the
67
closing of the next stage and the opening of the second next stage can be tuned post-silicon,
allowing the fixing of any hold time issues.
3.2.3 Blade-OC Timing Resiliency Window
Lets assume the same two-stage ring homogeneous system to obtain the Blade-OC ideal
TRW (ω
B
) and its relationship with C. Note that C accounts for the time period from the
opening of the master latch to its next opening. C in Blade-OC can be defined as follows:
(BOC.3) C =∆+ δ =2∗ ∆ − o
Where o is the overlapping of TRWs illustrated in Figure 3.12. Let us denote the minimum
combinational stage delay in both stages to be δ C
to generalize a basic hold time verification
for the system.
(BOC.4) δ C
≥ t
h
+ o
Where t
h
is the hold time constraint of a latch. Note that any overlap introduced implies a
larger δ C
. To obtain the maximum overlap, lets assume in BOC.4 that t
h
is negligible, then
BOC.4 becomes
(BOC.5) δ C
≥ o
Moreover, to avoid having a timing violation every cycle, δ cannot be too small, as it should
account for the average case delay and at least for the minimum combinational stage delay
(δ C
) plus any latch overlapping o introduced in the path. See BOC.6. The overlap period
o can start no earlier than δ C
after the latch opens. Moreover, to satisfy the hold time
verification, o must start no sooner than δ C
before the closing of the latch. The maximum
68
overlap period happens when the overlap period starts right at the middle of ∆. Thus, the
overlap o needs to satisfy BOC.7
(BOC.6) δ ≥ δ C
+ o (BOC.7) ∆ /2≥ o
Plugging in the expression of C from BOC.3, ω
B
expressed as ∆ as a ratio to C can be
derived as follows:
(BOC.8) ∆ /C =∆ /(2∗ ∆ − o)
Combining this with the value of the maximum o from BOC.7 we obtain
(BOC.9) ∆ /C =∆ /(2∗ ∆ − ∆ /2)=2/3
The ideal Blade-OC TRW is 66.6% C, a 16.6% improvement over ideal TRW on previous
asynchronous resilient templates. In other words, Blade-OC can support an average case
delay δ that is 3 times smaller than the worst-case delay. For very wide delay variations in
whichlongdelaysarerare,thesystemcanbeupto3timesfastercomparedtoanon-resilient
implementation. Moreover, the handshaking protocol provides at least ∆ /2 time units for
the controller to collect all error signals from the EDL and determine if the downstream
latches must be delayed. For most systems implementing wide datapaths, this should be
sufficient to not force the template to move away from this ideal TRW.
3.2.4 Managing Hold Timing Constraints
Due to the fact that Blade-OC allows overlapping clocks in consecutive stages, the hold
timing verification becomes critical. A hold violation can occurs when the data propagation
in the combinational logic is faster than the overlap period of the consecutive latch pulses
69
Figure 3.12: Blade-OC consecutive latch pulse overlap.
and data passes through the next stage earlier than expected. In general, there are three
main strategies to address hold issues:
1. Adding buffers on the data-path,
2. Adding an intermediate latch stage,
3. Delaying the acknowledge back-propagation, decreasing the overlap.
Let us review each briefly. Adding buffers can be accomplish at the design stage before
fabrication. This task is generally accounted by the CAD tool that performs the Static
Timing Analysis (STA) inserting buffers as required [70, 71].
Adding an intermediate lock-up latch that delays the propagation on the data-path,
provides enough time to the previous data to overcome the hold constraint. From a Blade-
OC flow perspective when necessary this would place three pipeline stages in a loop. This
increasesholdmargingreatlyatthecostofathirdbankoflatches. Last,butnotleast,delay-
ing the acknowledge backpropagation decreasing the overlap period, is an unique option of
asynchronous controller implementations. In particular, Blade-OC template can contain a
70
third small delay line inserted in the acknowledge line in those stages that are susceptible
to hold time violations. Inserting this delay will decrease the overlap period and thereby
improving hold margins at the cost of some decreased performance, slowing down the hand-
shaking back-propagation with the previous stage.
Each strategy presents a different trade-off in area and performance that should be con-
sidered carefully.
3.2.5 Performance Analysis
In order to test Blade-OC and a comparison to original Blade and Sharp, a one-token two-
stage and a one-token three-stage homogeneous rings were evaluated in terms of average
cycletimeandachievableTRW. Aonetokentwo-stageringstructureiscommonlyobtained
through various forms of desynchronization [72] of typical synchronous designs, where each
Flip-Flopisdecomposedintwostages,namelyMasterandSlave,performingdatapathretim-
ing between them. For simplicity, we will assume the ring is homogeneous with balanced
stage delays.
Each stage in the loop was given a time-varying combinatorial delay determined by an
independent truncated log-normal distribution with common parameters µ , σ , a and b,
where b represents the upper bound that is truncating the trail to 0.1%. Values on the
log-normal distribution corresponding to time a: 0ns. and b: 1ns, where b matches the
worst combinatorial stage delay (C). During each cycle, a minimum delay δ C
constraint was
applied to each stage to accommodate the possible overlapping period and hold time.
71
Figure 3.13: Effective cycle time vs. TRW for a one-token 2-stage ring implementing log-
normal delays a: µ =0.219ns, σ =0.069ns; b: µ =0.366ns, σ =0.114ns
Figure 3.14: Effective cycle time vs. TRW for a one-token 3-stage ring implementing log-
normal delays a: µ =0.219ns, σ =0.069ns; b: µ =0.366ns, σ =0.114ns
Each template was implemented using a 28nm FD-SOI cell library at nominal voltage,
following the same automated flow described in [29, 35]. Each sample delay was created and
evaluated by a behavioral EDL module. If the delay is less than δ (or λ for Sharp), the Err0
signal is raised so the controller can immediately trigger downstream logic. On the other
hand, if the delay value was greater than δ , this sample is considered a timing violation, the
72
EDLmoduleraisestheErr1signal,andthecontrollertakestheappropriatecountermeasures
to provide timing resiliency.
TRW wassweptfrom25%to67%ofC usingtwolog-normaldistributions. Thefirstlog-
Figure 3.15: Effective cycle time vs. TRW for a: one-token 3-stage ring. b: one-token
2-stage ring assuming best performance case
normal distribution was selected to have the mode at 33% of C, corresponding to µ : 0.366ns
and σ : 0.114ns. This initial log-normal parameters were selected to test the ideal Blade-OC
TRW, δ B
described in BOC.9. The second log-normal distribution was set to test when the
distribution is even more displaced to the left. The value of the mode was set to 20% of C,
corresponding to µ : 0.219ns and σ : 0.069ns. By selecting a relatively small σ implies that
the main weight of the probability density is located close to µ and resides in the main bell
body. Thus, the most frequent samples can be covered by a smaller δ . Additionally, because
of the low probability density in the trail, a wide TRW implementation would be optimal,
flagging few timing violations.
73
Figure 3.16: Effective cycle time vs. TRW for a: one-token 3-stage ring. b: one-token
2-stage ring assuming worst-case delay
Table 3.1: Simulation parameters for the two-stage implementation
Parameter Value Description
C 2.0ns Worst two-stage combinatorial delay
θ 1.0ns Worst stage combinatorial delay
δ , λ (33 - 75) % C Swept
∆ (67 - 25) % C Swept
Err0 delay 150ps Collect delay time
Err1 delay 150ps Collect delay time
The general simulation settings are listed in Table 3.1. The relevant results for the two-
stage ring implementation are shown in Table 3.2. Figure 3.13 plots the Effective Cycle
Time (EC) versus the TRW for both log-normal distributions applied to the two-stage ring.
Effective Cycle time is defined as the average time to process a token in the system, as
defined in [73]. For the two-stage implementations using log-normal distribution ( µ : 0.219ns
and σ : 0.069ns), the optimal Blade-OC TRW is a higher fraction of C and is much flatter.
The resulting EC is also much lower, demonstrating in general the advantage of Blade-OC
for these types of delay distributions and implementations. As mentioned earlier, however,
74
Table 3.2: Minimum average cycle time obtained for the two-stage implementation
Parameter Value Description
Blade-OC Min. EC 1.1ns @ 58% TRW Avg.cycle t. (µ : 0.219ns, σ : 0.069ns)
Blade-O Min. EC 1.2ns @ 44% TRW Avg.cycle t. (µ : 0.219ns, σ : 0.069ns)
Sharp Min. EC 1.6ns @ 52% TRW Avg.cycle t. (µ : 0.219ns, σ : 0.069ns)
Blade-OC Min. EC 1.7ns @ 44% TRW Avg.cycle t. (µ =0.366ns, σ =0.114ns)
Blade-O Min. EC 1.7ns @ 42% TRW Avg.cycle t. (µ =0.366ns, σ =0.114ns)
Sharp Min. EC 1.7ns @ 43% TRW Avg.cycle t. (µ =0.366ns, σ =0.114ns)
Table 3.3: Minimum average cycle time obtained for the three-stage implementation
Parameter Value Description
Blade-OC Min. EC 1.5ns @ 57% TRW Avg.cycle t. (µ : 0.219ns, σ : 0.069ns)
Blade-O Min. EC 1.8ns @ 42% TRW Avg.cycle t. (µ : 0.219ns, σ : 0.069ns)
Sharp Min. EC 1.6ns @ 49% TRW Avg.cycle t. (µ : 0.219ns, σ : 0.069ns)
Blade-OC Min. EC 2.1ns @ 42% TRW Avg.cycle t. (µ =0.366ns, σ =0.114ns)
Blade-O Min. EC 2.1ns @ 42% TRW Avg.cycle t. (µ =0.366ns, σ =0.114ns)
Sharp Min. EC 1.7ns @ 43% TRW Avg.cycle t. (µ =0.366ns, σ =0.114ns)
small values of δ lead to large values of overlap for which managing the hold time violations
in a two-stage ring can be difficult. One solution to this problem is to add a third stage in
the ring, effectively implementing a lock-up latch. The resulting clock waveforms mitigate
hold time problems at a cost of a third bank of latches. For this reason, we repeated our
experiments on a three-stage one-token ring, as reported in Table 3.3 and plotted in Figure
3.14. The plot shows that a Blade-OC’s minimum average cycle time on the three-stage ring
is close to 2/3 of C for the distribution with lower mean µ = 0.219ns and σ = 0.069ns. For
µ =0.366ns and σ =0.114ns, as the TRW is reduced, the error rate reduces but δ increases,
leading to a relatively flat portion of the EC time vs. the TRW curve. Sharp obtains a
lower EC because of its ability to mitigate time penalties not delaying the handshaking due
to time violations in multi-stage rings.
75
Table 3.4: Best case cycle time for two-stage implementation
Parameter Value Description
Blade-OC Min. EC 1.3ns @ 58% TRW Avg. cycle time (3-stage ring)
Blade-O Min. EC 1.8ns @ 44% TRW Avg.cycle time (3-stage ring)
Sharp Min. EC 1.6ns @ 48% TRW Avg.cycle time (3-stage ring)
Blade-OC Min. EC 0.85ns @ 58% TRW Avg.cycle time (2-stage ring)
Blade-O Min. EC 1.21ns @ 44% TRW Avg.cycle time (2-stage ring)
Sharp Min. EC 1.12ns @ 52% TRW Avg.cycle time (2-stage ring)
Table3.5: Averagecycletimeforthethree-stageimplementationassumingworst-casedelays
Parameter Value Description
Blade-OC Min. EC 2.6ns @ 58-67% TRW Avg. cycle time (3-stage ring)
Blade-O Min. EC 3.1ns @ 25-43% TRW Avg.cycle time (3-stage ring)
Sharp Min. EC 2.1ns @ 33% TRW Avg.cycle time (3-stage ring)
Blade-OC Min. EC 1.7ns @ 57-67% TRW Avg.cycle time (2-stage ring)
Blade-O Min. EC 2.1ns @ 25-44% TRW Avg.cycle time (2-stage ring)
Sharp Min. EC 1.5ns @ 32-35% TRW Avg.cycle time (2-stage ring)
In Table 3.4 and Figure 3.15, the plotted performance assumes zero error penalty (no
time violations are flagged in the EDLs) and only the handshaking and the controller over-
head is taken into account. Blade-OC shows a lower EC due to a higher TRW on both
implementations. On the other hand, Table 3.5 and Figure 3.16 shows the scenario where
the performance is impacted due to a maximum error penalty (time violations are flagged
in each EDL). This experiment aimed to test the implemented handshaking mechanism to
mitigate time violations. Blade-OC and Sharp show a different strategy where Blade-OC
aims to support larger TRWs.
To test the correctness and real-life applicability of Blade-OC, the template was imple-
mented on a three-stage version of Plasma, a MIPS OpenCore CPU [74], using a 28nm
FD-SOI cell library at nominal voltage, following the same automated flow described in
76
[29, 35]. For the purposes of this comparison, the same delay line values δ and ∆ were used
for both Blade and Blade-OC designs. Because of that, Blade-OC does not implement latch
pulse overlapping in this experiment. As shown in the Table 3.6, under this configuration
Blade-OC performs better that Blade, primarily due to the reduction of timing overheads
described in Section 2.2.1.2. In fact, the performance of Blade-OC may be further improved
Table 3.6: Comparison of Blade and Blade-OC on Plasma
Template Avg. Cycle Time δ ∆ Blade-OC 1.86951ns 0.603ps 0.804ps
Blade 1.92617ns 0.603ps 0.804ps
by tuning the delay lines or comparing these designs at a lower voltage in which the delay
variations are larger. Experimental results show that Blade-OC works best when the vari-
ability of delay is high, such as when the delays exhibit skewed log-normal distributions.
3.3 Timing Resilient templates as support for RHBD
templates
The previous templates support the RHBD protocol and template implemented in SERAD-
Click, because of the following reasons:
• Blade-Open reduced from 4 to 2 the wires required to implement handshaking, com-
pared to the original Blade.
77
• BladeOpen-CloseexplorednotjustmanagingtheOpeningbutadditionallytheClosing
of the downstream stage, implementing a configurable delay on request and acknowl-
edge paths for hold timing management.
• Blade Open-Close increased the TRW, enabling even higher level of data dependency,
enabling higher performance for designs with wide variations in combinational delay,
very similar to a radiation event affecting the datapath.
• BladeOpen-Closeexploredlognormaldatadelaydistributionscomparisonstoprevious
timing resilient templates.
• Timing resiliency is a trade off to Spatial redundancy (TMR), appropriated on designs
with complicated stage logic where spatial redundancy imposes an important area
penalty.
• Timing resiliency opens the door to soft-error tolerance, by the use of similar EDL
that is able to detect transients in the datapath due to voltage disturbances caused by
radiation strikes.
78
Chapter 4
Novel Radiation Hardened By Design
(RHBD) Templates and Cells
ThischapterdescribesthenovelSERAD-Click-EDL,SERAD-Click-TMRtemplatesandthe
RHBD Mutex cell. These templates are used by SERAD Chisel flows described in Chapter
5.
4.1 SERAD-Click Template
The Original SERAD controller was implemented as a Burst Mode (BM) Finite State
Machine (FSM) [53].
Becauseoftheunbalanceddelaysatoutputsignalsandthedifficultiesderivedbyahand-
made BM specification to technology mapping, a Click version was developed, following the
Click design style described in [24]. Click templates, in general, uses two-phase handshaking
and implements flip-flops(FF) as state holding elements. The main advantage of a Click
implementation is the use of positive edge-triggered flip-flops(FF) as the storage elements,
not implementing any special asynchronous custom cell like a C-element, required by the
BM implementation. This characteristic makes this template well suited for a traditional
79
synchronous synthesis flow. Additionally, the area overhead implementing an asynchronous
controller is low compared to the BM counterpart.
4.1.1 SERAD-Click-EDL Handshaking
The SERAD-Click template follows the same timing behavior as shown in Figure 2.17 and
described in Section 2.6.0.1. The Click implementation is based on the single-rail controller
described in [7]. The developed template is shown in Figure 4.1.
Figure 4.1: SERAD-Click-EDL template
The Click implementation performs the following sequence of actions:
• Initially, the output of all gates are set to low by the reset signal.
80
• WhenLreq+isreceived,indicatingnewdataisavailable,theAnd-2HandshakingPhase
0 gate transition to 1, the Handshaking AND-OR three will resolve to 1, generating
a rising transition on the Handshaking FF’s input and a Clk output rising transition
due to the OR-2 gate.
• Due to the fact that the D port in the Handshaking FF is its own inverted Q, the
Handshaking-δ delay line makes the And-2 Handshaking Phase 0 gate transition to 0,
creating a falling pulse on the Handshaking-FF’s clock input.
• If the EDL signal Corr+ is received, indicating there was no SET detected, generating
a rising transition on the Lack-FF’s clock input. This set Rreq+ to the next stage
and set Lack+ to the previous stage, signaling that the latched data in the datapath
is error free and ready to start a new handshaking sequence.
• If the EDL signal Err+ is received, indicating the occurrence of a SET, the And-2 Err
Phase 0 gate transition to 1, thus the Err AND-OR three will resolve to 1, generating
a rising transition on the Err FF’s clock input.
• Due to the fact that the D port in the Err FF is its own inverted Q, the Err δ delay
line makes the And-2 Err Phase 0 gate transition to 0, creating a falling pulse on
the Err FF’s clock input that makes the Clk output to fall, finishing the re-sampling
action on the datapath.
The Clk signal is the output of the OR gate that it is connected to the handshaking FF
and Err FF clock inputs. Note that the Sample signal is the inverted Clk signal.
81
The controller follows a two-phase handshaking protocol, so the complementary gates,
And-2 Phase 1 will be enabled by the output of the delayed Handshaking FF Q signal.
The previous actions are performed cyclically until the reset signal is asserted again. These
actions match precisely the BM specification.
The proposed SERAD-Click-TMR implementation is unpublished at the moment of this
dissertation and requires further testing.
4.2 RHBD Architectures
The original SERAD [7] is a soft error resilient asynchronous architecture implemented on
both SERAD-CLick implementations. This section provides an overview of architectures,
highlighting some similarities and discussing their differences.
Figure 4.2 illustrates the SERAD-Click-TMR.
Figure 4.2: SERAD-Click-TMR architecture
82
In the SERAD-Click-TMR architecture, the delay line t
ctrl
controls both clk
i
and
smp
i
pulse widths. The sum of this delay element with the external delay line t
comb
must be
matched to the critical path of the logic stage. However, the high-phase of clk
i
must also be
as long as the Cmp
i
critical path. This way, there will be enough time to outvote any hit
on the datapath.
Figure 4.3 illustrates the SERAD-Click-EDL architecture and the interaction with
other controller stages, datapath, and the SERAD error detecting Logic (SERAD-EDL).
Instead of master-slave flip-flops, SERAD uses lower-overhead radiation hardened (RH)
latches (e.g., [75]). SETs originating in the combinational datapath are detected by the
EDL module and are mitigated by stalling the pipeline until the data is re-sampled. Note
that SERAD-Click-TMR do not implement an EDL module.
(i+1)th
controller stage
i
nErr
i
clk
i
ack
i
req
i-1
req
i+1
ack
ith tcomb (i+1)th tcomb
i
Err
ith
controller stage
tctrl tctrl
n
i
l
n
tctrl - Err delay
SERAD
EDL
(i-1)th
register
stage
l
ith
logic
stage
ith register stage
(i+1)th
logic
stage
n
i
r
RH
Latch
Figure 4.3: SERAD-Click-EDL architecture
83
The main difference between SERAD-Click-EDL and SERAD-CLick-TMR is how they
protect against SETS that arise in the combinational logic. SERAD-Click-EDL uses time
redundancy. Under normal operation, the RH latches become transparent only after their
inputs are supposed to be stable. Consequently, when the clock is high, and the latches are
transparent, any transition at the input of the latches is caused by an SET. Such transitions
are identified by the SERAD-EDL, which triggers the controller to re-sample the datapath
latches until the inputs of the RH latches are stable throughout the transparency window.
Thismeansthatthehigh-phaseoftheclockmustbelargerthanthemaximumSETduration,
placing a practical limit on the duration of SETs that SERAD-Click-EDL can tolerate to
less than 1-2ns.
For applications where predictable performance or SETs can have longer or even
unbounded duration, the SERAD-Click-TMR is more suitable. Spatial redundancy is used,
and TMR duplication of the combinational logic is applied. This approach rejects SETs of
any duration in the datapath. Consequently, is expected for this approach to have addi-
tional area than SERAD-Click-EDL due to the TMR combinational logic implementation.
Despite major differences in their datapaths, the two radiation-hardened approaches share
similarities in their control paths. Both controllers implement the same 2-phase handshak-
ing [3] with neighboring controllers. SERAD-Click-TMR simplifies the controller logic, just
using the generalized SERAD approach and the same Click template described in Section
2.1.2.2, not requiring re-sampling logic. SERAD-Click-EDL implements the protocol can be
described with the BM diagram shown in Figure 4.4. Here, every arc is labeled with the
84
associated”input/output”transitions(’+’syntaxindicatesariseand’− ’afall). Onlywhen
allinputschangetothespecifiedvaluesthecontrollermovestothenextstate. Thesequence
of actions is as follows:
• On rst+, all output signals are set to low.
• When L.req+ signal is asserted by the neighboring input controller and R.ack- is
received from the neighboring output controller, the finite state machine (FSM) tran-
sitions from state 1 to state 2 asserting clk+ output.
• As the pwc+ signal is received, clk is set to low, transitioning from state 2 to state 3.
• During state 3, if nErr goes high and pwc falls, output signals R.req and L.ack signals
are asserted, transitioning to state 4 and indicating there was no SET in the datapath.
On the other hand, if Err goes high and pwc goes low, indicating the occurrence of
an SET, reaching state 5. Later, when Err returns to low, clk is set high again,
and the FSM transitions back to state 2. Those two last state transitions implement
re-sampling.
• Once the FSM is in state 4, as pwc goes low, the FSM is prepared for the next hand-
shaking. Because of the 2-phase handshaking, the paths included between state 6 to
state 1 implement the same functionality as the paths going from state 1 to 6 but with
opposite phases of handshaking signals L.req and R.ack.
Atimingdiagramofanerror-freehandshakingcommunicationfollowedbyanerrordetec-
tionandcorrectionprocessisshowninFigure4.5. EachL.req transitionrepresentsthatthere
85
10
0
2
3
4
nErr+ pwc- /
R.req+ L.ack+
L.req- R.ack+ / clk+
L.req+ R.ack- / clk+
rst+ /
(all outputs low)
pwc+ / clk-
1
pwc+ / clk-
nErr+ pwc- /
R.req- L.ack-
nErr- /
nErr- /
5
Err+ pwc- / clk+
Err- /
6
Err+ pwc- / clk+
Err- /
7
8 9
Figure 4.4: SERAD-Click-EDL BM specification
is new data to be propagated in the datapath. As there is no SET detected, denoted by
the signal nErr going high, the controller proceeds to flip the signal R.req, representing the
data can propagate to the output neighboring stage. At the same time, L.ack is asserted.
Finishing the handshaking, the L.req signal is received. For simplicity, signals L.ack and
R.ack are omitted from the diagram. As mentioned before, in the following cycle, an SET
occurs, denoted by signal Err going high, so the controller re-samples the datapath until no
SET is detected.
4.2.1 Hardening of the Click Controller
As previously pointed out, one of the Click constraints is that input handshake signals must
remainstableduringthehighphaseoftheclock. However,thiscannotbeguaranteedassum-
ing hostile environments, such as the ones in space application, where prolonged exposure
86
clk
i
smp
i
R.r eq
i
L.r eq
i
nErr
i
Err
i
Figure 4.5: SERAD-Click-EDL controller timing diagram
to ionizing particles and the radiation-induced increase the probability of soft errors, thus a
fault tolerance technique must be applied to the Click template.
Since Click uses both combinational and sequential logic, the RHBD Click must provide
protection for SEUs and SETs faults. Ideally, RHBD click should remain compliant with
original Click features, such as compatibility with static timing analysis and standard-cell
libraries,andalsoshouldnotresultinhighareaandpowerpenalties. Therefore,basedonthe
RHBD techniques and the related discussions presented in Chapter 2, our proposed RHBD
Click controllers rely on spatial redundancy by triplicating combinational and sequential
logic and adding a Majority gate to outvote any SET or SEU that can arise.
4.2.1.1 Guard Gate applied in SERAD-Click templates
The GG behaves similar to an inverted C-element, where the output only changes when
both its inputs have the same value. When both inputs are equal, it acts as an inverter.
In addition to the inverted output, the main other difference between a GG and C-element,
87
showninFigure4.7andtheGGgatethatdoesnotincludeakeeperconnectedtoitsoutput,
asillustratedinFigure4.6. RemovingthekeeperreducestheareaandremovesanSEUfault
target. However, it also implies the gate is subject to degradation due to leakage currents,
anditsoutputmightenterahighimpedancestate. Theoutputsignalisalsomoresusceptible
to noise, but this can be managed during place-and-route by keeping these signals short and
increasing the wire spacing requirements around them.
Q
A
B
Figure 4.6: Guard Gate [76]
Q
A
B
keeper
Figure 4.7: M¨ uller C-element [77]
88
More specifically, the GG voltage degradation can be a problem when the output voltage
reachesthelowerorupperslewthresholdsdefinedinthecelllibrary. Usually,thesepointsare
between 70-80% of V
DD
and 30-20% of V
DD
[78] for the upper and lower points, respectively.
However, the GG cell designed and incorporated in the proprietary 130nm technology only
achieves this critical point after dozens of microseconds. At the same time, the longest SET
duration reported in the literature is smaller than ten nanoseconds [79].
ThespicesimulationshowingthedegradationoccurringwhentheGGinputsaredifferent
is presented in Figure 4.8. The simulation time is 2.5 milliseconds, and V
DD
is 1.2V. The
simulation first shows the output degrading from a strong zero and later from a strong one.
The measurements were taken from the first degradation using the lower slew threshold as
20% V
DD
240mV) and indicate that our GG can hold the output state for at least 88.44us.
1.00044ms 240.0mV
Q V
-0.1
0.3
0.7
1.1
1.3
912.0us 600.0mV
B V
-0.1
0.3
0.7
1.1
1.3
A V
-0.1
0.3
0.7
1.1
1.3
Time (ms)
0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 2.25 2.5
88.44us
Figure 4.8: Spice simulations showing GG output degradation
89
4.2.1.2 RHBD Click Approach
The generalized RHBD approach for Click based controllers is illustrated in Figure 4.10,
following the same generic Click description presented in Figure 2.3.
FF
rst
L.ack
R.ack
L.r eq
clk
R.r eq
t
ctrl
tcomb
Figure 4.9: Click controller
ForcomplexClickimplementationsthatrequiremorethanoneFF,thesameRHstrategy
can be used, except that one pair of cross-coupled FFs should be used to replace each FF
in the original design. More specifically, the steps for achieving a generic RHBD method for
Click based controllers are listed below:
1. Duplicate the combinational logic and delay lines
2. Duplicate sequential elements (FFs)
3. Cross-couple equivalent FFs outputs with GGs
4. Cross-couple equivalent FFs clock inputs with GGs
90
5. Add GGs for “voting” equivalent control outputs
6. Set the delay of the delay lines accounting for extra GGs
The combinational block duplication also includes the duplication of the external delay
line t
comb
. It results in two redundant communication channels that run side by side from
one Click controller to another. This prevents SETs in the handshake channels affecting
the controller behavior. After duplication, pairs of cross-coupled GGs are inserted for each
FF enable signal and each FF output signal. In this arrangement, the combinational and
sequential parts of the controllers are isolated from each other, in the sense that whenever
these two parts interact, both duplicated blocks must agree on the resulting value.
GG GG
rst rst
GG
FF1
GG
FF2
Controller
Comb. logic
C1
t comb Controller
Comb. logic
C2
t ctrl t ctrl
GG
clk
R.r eq
dup
R.ack
dup
L.r eq
dup
L.ack
dup
L.r eq
L.ack
R.ack
R.r eq
t comb
Figure 4.10: Generalized RH-Click controller diagram
For instance, combinational blocks C1 and C2 will resolve individually when to enable
their FF. The GGs will only propagate the enable signal to the corresponding FFs once
both blocks present the same value. The same will occur when the values stored in the FFs
must affect the combinational blocks state or update the FFs value. The remaining control
outputs are also coupled with GGs (i.e., clk), and in the last step the delay lines (t
comb
and
91
t
ctrl
)aredesignedtoaccountforthepropagationdelaysofthecontroller, includingthe GGs,
thus attenuating the performance impact of the proposed solution.
It is important to address how the RHBD Click recovers from SEUs. While GGs prevent
SETsfrombeingpropagatedtosensitivepartsofthecontroller, itdoesnotpreventanupset
in the state of an FF. In this case, there is the chance that at some moment, the FFs will be
storingdifferentvalues. Forinstance, assuminganinitialstateof (0,0) forthe FFs ofFigure
4.10. When a request arrives and the FFs are enabled, the new state will be (1,1) since the
GG acts as an inverter, and the expected state for the next request is (0,0). However, before
that, an upset occurs, and the state values become (0,1). Now the cross-coupled GGs have
different inputs, but the output is still holding the last valid state, which is (0,0). Unless a
new state update takes more time than the GG degradation, when the request arrives, the
controller will self-correct the FFs to the next valid state.
4.2.2 SERAD-Click Controller Design
ThedesignofClick-basedcontrollersispresentedinthissection, onefortheTMRandother
for the EDL [7] architectures. The controllers share the same handshake protocol and have
a similar interface. The main difference is that the TMR version does no interact with the
datapath’serrordetectionmechanism,sodoesimplementastandardtwo-phasehandshaking
protocol. However, the different design choices, target applications, and time constraints of
each architecture reflect into two different controller designs.
92
4.2.2.1 SERAD-Click-EDL Controller
SERAD’sClickEDLimplementationproposedinthispaperispresentedinFigure4.11. The
template performs the following sequence of actions:
rst
rst
nErr
R.r eq
R.ack
clk
L.ack
L.r eq
G1 G2
G3
Lack
FF
HS
FF
Err
G4
t
ctrl
Figure 4.11: SERAD-Click-EDL controller implementation
• Initially, the output of all gates and FFs are set to low by the reset signal.
• WhenL.req+isreceived,indicatingnewdataisavailable,theANDgateG1 transitions
to 1, the Handshaking AND-OR gates (namely G1, G2, and G3) will resolve to 1,
generatingarisingtransitionontheHS-FF’sclockinputandthecontroller’sclk output
signal via the OR gate G4.
• Since the data pin of the HS-FF is tied to the FF’s inverted output, the t
ctrl
delay line
makes the AND gate G1 transition to 0, creating a falling transition on the HS-FF’s
clock input, implementing effectively the pwc.
93
• If nErr+ is received, indicating that there was no SET detected, a rising transition on
the Lack-FF’s clock input is generated. This asserts R.req to the neighboring output
controller and L.ack to the neighboring input controller, signaling that the latched
data in the datapath is error-free and that the controller is ready to receive new data.
• If the EDL signal Err+ is received, indicating the occurrence of an SET, the OR gate
G4 output rises, rising clk and beginning the re-sampling action on the datapath.
However, because the EDL is designed such that Err falls soon after clk rises, the
re-sampling clk pulse can be shorter than the t
ctrl
delay. To solve this problem, a
special asymmetric delay line is added to the combinational path between the EDL
Err output and the controller. This delay line is similar to the one described in [7] but
with input and output inverted to provide fast rising transition and a slow, controlled
falling transition duration, enabling a uniform clk pulse.
The behavior of the next cycle is similar. The AND gate G2 transition to 1, accommo-
dated to the inverted phases of the handshake signals. This behavior repeats indefinitely
until a subsequent assertion of rst restarts the system.
Note that in SERAD-Click-EDL, the pwc signal in the BM diagram illustrated in Figure
4.4 is effectively implemented using two different delay lines. For cycles were no SET was
detected,thepwc signalisimplementedbythetcrl symmetricdelayline. Ontheotherhand,
when an SET is detected, pwc is provided by the combinational path provided from the Err
signal and the special asymmetric delay line shown in Figure 4.3. Moreover, the sample
signal to the Q-Flop within the SERAD EDL is generated by simply delaying the clock
94
(see [7]). Despite differences with SERAD-Click-TMR, the RHBD version of the SERAD-
Click-EDL controller is obtained by applying the same RHBD methodology, as described in
Section 4.2.1.
4.2.3 Design Flow
A custom design flow using a proprietary 130nm technology was used to synthesize the
different controllers and conduct fault simulations. The design flow consists of commercial
EDA tools that are integrated to generate the asynchronous netlist. Design Compiler from
Synopsys is used for synthesis, and the Incisive Functional Safety Simulator (IFSS) from
Cadence is used for the faults simulations.
A set of TCL and Shell scripts are used to synthesize a synchronous FEDC [80] circuit
and replace the global clock with AFECD [81] and SERAD controllers. In the first step, a
synchronousRTLdescriptionoftheFEDCissynthesizedforagivenclockperiod. Inthenext
step, the synchronous netlist generated in the previous step is converted to an asynchronous
bundled-data netlist, where the global clock signal is replaced by asynchronous controllers,
one for each pipeline stage, and the required delay lines are inserted.
The conversion flow is currently limited to the specific pipelined test cases, is thus not
capable of converting arbitrary synchronous RTL descriptions, and does not support the
variety of error detecting schemes associated with AFECD and SERAD. This prevents fair
comparisons regarding power and performance of the overall RHBD schemes but does not
affectthecomparisonofcontrollerareaand,mostimportantly,therobustnessoftheproposed
95
RHBDclickcontrollerstotransientfaults. ThedesignflowcanalsogenerateaTMRversion
of both controller variants as a baseline for comparison.
4.2.4 SERAD-Click-EDL Controller Evaluation
SERAD-Click-EDLwascomparedagainstAFEDCClicktemplate[82]. Thecontrollerswere
verifiedandevaluatedusingalinear5-stagepipelineversionofthe32-bitnon-restoringarray
divider presented in [83]. This case study represents a realistic scenario for extracting area
results and exploit the robustness to SETs and SEUs of each controller. All subsequent
analyses are focused exclusively on the controllers and do not take into account the differ-
ences in datapath and error detection logic. Consequently, our comparisons focus on the
proposed RHBD method for Click based controllers and not comparisons between SERAD
and AFEDC.
4.2.4.1 SERAD-Click-EDL Area Evaluation
Table4.1 shows thearearesultsfor thefivecontrollersusedinthenon-restoringdivider, one
for each pipeline stage. This area focuses on the logic needed in the controller and thus does
not include the area of the external and internal delay lines. As described in Section 4.2.3,
the design flow cannot optimize the delay lines for arbitrary controllers, and thus, for this
test, all delay optimizations were disabled.
96
The total area showed in Table 4.1 accounts for the combinational logic and sequential
logic of five controllers. The GGs area is included in the combinational logic, each cell
consuming 9.450µm
2
.
We note that, as expected, the area more than doubles when comparing the unprotected
versionwiththeradiation-hardened(RH)design. However, theproposedsolutionsshowsig-
nificantly less area overhead than the equivalent TMR implementation, SERAD-Click-EDL
being 45% smaller than SERAD-Click-TMR and AFEDC-RH 38% smaller than AFEDC-
TMR.
Table 4.1: Comparison of Click controllers in terms of area
Controller Area µm
2
Incr. (%)
Comb. Seq. Total
SERAD-Unprotected 79.38 64.26 143.64 -
SERAD-TMR 495.18 192.78 687.96 379%
SERAD-RH 253.26 128.52 381.78 166%
AFEDC-Unprotected 119.07 64.26 183.33 -
AFEDC-TMR 614.25 192.78 807.03 340%
AFEDC-RH 370.44 128.52 498.96 172%
4.2.4.2 Fault Verification
ThefaultexperimentswereexecutedwiththeIFSStoolfromCadencethatisintegratedinto
thedesignflow. TheIFSSautomaticallycreatesafaultlistofallpossibleSETsandSEUsfor
the specified control modules and excluding the remaining modules, such as the data path
detection mechanism. Then a random sequence of faults for injection is generated. Next,
97
a fault-free simulation is executed to create the gold patterns for comparison. Once this is
complete,thefaultsimulationisperformed. Basedonfunctionalstrobesandcheckerstrobes,
that must be specified for each controller, the tool provides a final fault report. Regarding
SETsduration,itwasusedasyntheticlog-normaldistributionrangingfrom100psto2000ps,
with100pssteps. Thisrangetakesintoaccountthat,despitebeingrare,longSETscanoccur
depending on the target technology [79, 84], but yet these SETs are way below the duration
limits that the GGs can handle.
The evaluation demonstrates the flexibility and effectiveness of the proposed RHBD
method and quantifies the area overhead.
4.3 Radiation Hardened By Design Mutex
Themutualexclusionelement(mutex)[85],isthecorecomponentofarbitersinasynchronous
circuits and multi-clock domain digital systems [86]. A mutex helps control the mutual
exclusive access to a single resource from two or more independent requestors. If contention
betweenrequestsoccurs,metastabilitycanoccurwithinthemutexasitdecideswhichrequest
to grant. Before the mutex grants access any metastability must have resolved, which can
take unbounded time. This unique behavior makes this circuit not suitable for a triple
modular redundant (TMR) solution: after arbitration, two out of the three modules will
agree, but an SET in any mutex node can change the decision agreement, momentarily or
permanently, inducing a SEU. This change in decision does not happen in a typical TMR
98
solution in which the module suffering an SET or SEU would always be out-voted. In the
context of arbitration, this change in decision could lead to a system deadlock.
Naqvi’s et al. in [87] proposed a RHBD arbiter cell that uses Yakolev’s et al. tree arbiter
cell (TAC) [88]. The RHBD arbiter uses modular redundancy on the mutex’s surrounding
handshaking circuitry, not to the mutex cell itself. Hence, the mutex is still susceptible to
SET and SEU events. The RHBD arbiter’s grant-request generator circuit has two variants:
The first variant tolerates SETs at the handshaking C-elements, but they are susceptible
to SEUs immediately after an arbitration event. The second variant implements multiple
handshaking C-elements in a tree topology incurring in significant area overhead, but may
be susceptible to deadlock under certain conditions.
Jangetal.,proposedamutex[89]usingadualinterlockedCell(DICE)[90]usingmodular
redundancy that incurs significant area penalty. Set-Reset (SR) latches[50] are connected
in a daisy-chain configuration, generating a double modular mutex arrangement. Any SET
event will restore automatically to the correct previous state via the self correcting property
of DICE.
In contrast, the proposed mutex circuit does not rely on modular redundancy. Rather,
it incorporates noise-tolerant features of Schmitt-Trigger circuits, guard gates, and SR latch
behavior control. The noise rejection helps prevent glitch propagation, the SR latch control
prevents SEU events, and the guard gates prevents any SET (glitch) propagation. The
proposed circuit is SET and SEU tolerant. Compared to the mutex, it has low area and
99
performance overhead. On the other hand, compared to the state-of-the-art DICE mutex,
it provides better area and performance with similar SET and SEU tolerance.
4.3.1 Related work
This section reviews and analyzes the baseline[85] and DICE [89] mutexes. Excluded from
a detailed analysis is Naqvi’s work [87] because the underlying mutex cell is not RHBD.
4.3.1.1 Baseline mutual exclusion element
The2-inputmutualexclusionelement(baselinemutex), illustratedinFigure2.15.b, consists
of an SR latch and metastability filter [3]. The filter ensures that metastability in the SR
latchdoesnotpropagatetodownstreamlogic. Themutexcontrolsaccesstoasingleresource
from two independent requestors. If the time between requests (R
0
and R
1
) is large, the
mutex’s decision is straightforward, granting access to the earlier request. However, when
the time between requests is shorter than the time needed for the SR latch stabilization,
metastability may occur at the output of the SR latch, as it decides which request should be
granted, but cannot propagate through the filter. The critical nodes of a mutex are along
thepathsR
0
,X
0
,G
0
andR
1
,X
1
,G
1
astheycanimpairthefunctioningofthebaselinemutex
if an SET event occurs. During a request-grant operation, the baseline mutex can exhibit
twomodesofoperation. Oneisdeterministicoperation,comprisingtransitionpathsstarting
from state ”a” to ”b” or ”c”, then going back to ”a”. The second is the arbitration mode
comprising transitions starting from state ”a” to ”d.1” or ”d.2”, then returning back to ”a”.
100
These two modes are repeatedly executed, one or the other, returning to the idle state ”a”
after each request-grant operation.
4.3.1.2 State-of-the-Art Mutex design
A DICE based mutex is presented in patent [89]. The core of their design is four SR
latches arranged in a ring, as illustrated in Figure 4.12, implements a form of modular
redundancy. The input and output ports are replicated, creating two parallel request and
grants for each logical input/output. Thus, the equivalent baseline mutex inputs R
0
,R
1
become R
0a
,R
0b
,R
1a
,R
1b
and the output nodes from the SR latch ring structure are now
X
0a
,X
0b
,X
1a
,X
1b
. Because of the redundant SR latch arrangement and increased capac-
itance at intermediate nodes, these nodes are considered SET tolerant compared to the
baseline mutex. A R
0
(R
1
) request is in the form of both R
0a
and R
0b
(R
1a
and R
1b
) rising.
To grant R
0
(R
1
), outputs X
0a
and X
0b
(X
1a
and X
1b
) need to go low exclusively. Compris-
ing the SR ring, the NAND gates labelled ”n” have three instead of the normal two inputs
and each is paired with a 2-input NAND gate labelled ”p”. Both drive a common node.
This configuration avoids contention between pairs of latches, ensuring their output grants
agree. In the case of an SET event, the arrangement allows the SET to propagate at most
one downstream logic stage. But, the modular redundancy prevents this propagation from
altering the mutex output. The metastability filter arrangement consists of four structures,
one per output. Each structure includes two 2-input voltage controlled inverters that, in
turn, drive a 2-input C-element that provides mutex’s output nodes G
0a
,G
0b
,G
1a
,G
1b
. The
101
Figure 4.12: Arrangement of SR latches
metastability filters structures are not depicted in Figure 4.12. One concern with this design
is that non-hardened C-elements used at outputs are susceptible to SETs that can gener-
ate an SEUs when their inputs differ. This can happen if there is some arbitration delay
imbalance between cross-coupled mutexes. Thus, the DICE mutex critical nodes are found
at outputs, namely nodes G
0a
,G
0b
,G
1a
,G
1b
.
4.3.2 Novel RHBD Mutex Cell
The proposed RHBD mutex is shown in Figure 4.13. It is a modified baseline mutex at its
core with surrounding circuits that provide RHBD. In particular, the SR latch implements
special 3-input NAND gates and the metastability filter with Schmitt Trigger capabilities.
We added a feedback path from each metastability filter output to the SR latch inputs. The
purpose of these feedback structures is to provide SEU tolerance. In addition, a guard gate
[49] is included after the output of each metastability filter, to avoid SET propagation to
downstream logic. Because of the RHBD strategies applied, there are no nodes susceptible
to SET or SEU events.
102
Figure 4.13: Proposed RHBD mutex
4.3.2.1 3-input NAND gate implementing Schmitt Trigger circuitry
To take advantage of input noise rejection, a special 3-input NAND gate that includes
Schmitt Trigger circuitry was developed. Noise margin at a node can be measured in terms
of the node’s critical charge defined to be the node capacitance time V
dd
/2. Many previous
works[52]assumedthatSETamplitudeoverV
dd
/2crossesthenoisemarginofthesubsequent
gate. Byincorporating ST circuitrytothe outputnode ofthe3-input NANDgate, the noise
margincanexceedV
dd
/2,andSETrobustnessisthusimproved. AsillustratedinFigure4.14,
the PMOS network of the NAND gate consists of three Inverting Schmitt Trigger PMOS
networks, one for each input A, B and C. These include transistors MP
0
to MP
8
. The
NMOS network, on the other hand, uses only one inverting Schmitt Trigger NMOS network.
The key observation is that the single NMOS network can be controlled by both input A
and B (transistors MN
0
, MN
1
and MN
3
). Inputs A and B requires higher noise rejection,
103
for example, the R
0
− X
0
3-input NAND, receives R
0
on input A, X
1
on input B and rG
1
on input C. A and B receive critical nodes on the standard mutex. In this case, they are
susceptible only to SET events that reach the node’s threshold switching voltage. Input C,
ontheotherhand,islocatedatthebottomoftheNMOSstack(MN
2
)receivingtheinverted
feedback path from G
1
.
Figure 4.14: Special 3-input NAND implementing Schmitt Trigger circuitry
4.3.2.2 Modified SR latch with feedback path control
The modified SR latch consists of two cross coupled 3-input NAND gates implementing
Schmitt Trigger (ST) circuitry to increase noise margin at its inputs. In particular, the
third inputofeachNANDSTisdrivenbythecorrespondingfeedbackoutputoftheinverter
Schmitt Trigger of the metastability filter. For example, in the case of the R
0
− X
0
3-
input NAND ST gate, the connected Inverter Schmitt Trigger originates at node G
1
. These
connections reinforce the NAND gate whose output is low such that an SET on any of the
104
NAND-gate inputs has no effect, preventing any SEU event. This topology provides the
desired SR latch functionality with tailored noise rejection, SEU tolerance, and low area
overhead.
4.3.2.3 Guard Gate implementation
Guard gates (GG) were proposed in [49] and compared to 2-input inverting C-elements that
have the same basic functionality in [13]. In particular, when both GG inputs are equal, the
gate just acts as an inverter. But, unlike a C-element, a GG does not implement a back-to-
back inverter (keeper) structure at the output. Instead, it relies on output capacitance to
hold state and provide the desired level of SET tolerance. This capacitance is achieved by
means of transistor sizing or the addition of an explicit capacitor. GG gates synchronizes
the request and grant signals after a request-grant operation. More precisely, R
0
(R
1
) and
G
0
(G
1
) are fed into one GG to provide a latched and inverted G
0
(G
1
) output signal. We
add a second symmetric GG to drive the inverted G
1
(G
0
) output. The delay in arriving
inputs to the GG provides temporal masking. Moreover, to increase the temporal masking
effect, theinput R
0
(R
1
)inputpathdowntotheGGgatemayincludebuffers. Carefulplace
and route and layout considerations are required to optimize this path delay.
4.3.3 Experiments and Results
Becauseourintentistocompareandcontrastdifferentmutextemplatesatthetransistorlevel
using SPICE simulations, a custom design flow using the PTM-MP 20nm HSPICE library
105
models [91] based on BSIM-CMG modeling [92] was developed to synthesize the baseline,
DICE, and the proposed mutexes. A common test setup was used during experiments with
homogeneous transistor sizing of L=30nm, W=30nm for NMOS and L=30nm W=60nm for
PMOS transistors was used on all designs. Additionally V
dd
= 1V and the input R
0
(R
1
)
pulse amplitude was fixed to 0.95V at high and 0.05V at low. Period duration was variable,
depending on the experiment. Rise and fall times were set to 10% of the period duration.
The methodology presented in flowcharts in [93] and the results presented in [94] where
used to develop SET independent current sources for simulation [41] to imitate the strike
effect on nodes and subsequent voltage transient. Additionally, following the methodology
described in [94], each SET current model was developed following proper boundary condi-
tions to avoid over-voltage at the output of the CMOS gate. Experiments using a chain of
two inverters driving FO1 and FO5 circuits served as calibration setup to match results pre-
sented in [94]. In Figure 4.15, current shapes and transient voltages on the affected node for
LETs with values 2, 4 and 7 MeV-cm
2
/mg are shown. Considering SET events that propa-
gatevoltagedisturbancesintothenextdownstreamlogicstage(werethevoltagedisturbance
is greater than V
dd
/2). Thus, all previous mentioned LET generate SET events driving FO1
fan-out structure. Only the SET generated by 7 MeV-cm
2
/mg was able to generate a SET
on a node that drives a FO5 fan-out structure. For this reason, this LET value is selected
for further experiments. A testbench using two independent voltage sources implementing
a 250 picoseconds (ps) period square wave, each driving a 2-stage inverter’s chain that feed
the inputs of the mutex under testing. At the output of the mutex FO5 fan-out circuits
106
Figure 4.15: Iset current model for LET values 2, 4 and 7 MeV-cm
2
/mg. and Voltage upset
at the intermediate node. Inset: Design under test.
were connected to test different output loads. The independent current source (Is) SET for
7 MeV-cm
2
/mg was selected to test SET mitigation. Simulation results on this experiment
are shown in Table 4.2.
AllnodesonthebaselineandDICE’soutputnodesaresusceptibletoSETsandpropagate
thevoltagedisturbance. Theoppositesituationoccursontheproposedmutex. Additionally,
thebaselineissusceptibletoSEUifanSEToccursonR
0
(R
1
)inputorX
0
(X
1
)nodes. Asan
exampleonthis,Figure4.16showswaveformsofSETsattheoutputnodes. Foreachmutex,
voltage waveforms are superimposed. The continuous line represents the output mutex node
and one subsequent FO5 node (dashed line) voltage. Note that the SET is clearly shown for
baseline and DICE mutexes. For the proposal, the voltage disturbance was minimal and not
propagated downstream.
107
Table 4.2: Single 7 MeV-cm
2
/mg LET event in nodes
Node Baseline[85] DICE [89] Proposed
R
0
/R
1
SET No * No
X
0
/X
1
SET No ** No
G
0
/G
1
SET SET *** No ****
Note: For DICE mutex:
* Input nodes are R
0a
,R
0b
,R
1a
,R
1b
.
** Intermediate nodes are X
0a
,X
0b
,X
1a
,X
1b
.
*** output nodes are G
0a
,G
0b
,G
1a
,G
1b
.
For Proposed, **** output nodes are G
0
, G
1
.
To test the proposed mutex against SEU, an SET event on node X
0
was generated after
arbitration. For this specific case, the simulation resolved arbitration granting request R
0
.
The SET is mitigated at the input C at the R
1
− X
1
NAND gate. Although the controlling
feedbackpathisslowerbyonelogicstage, theSETatnode X
0
waseffectivelycontainedand
not propagated. Voltage node waveforms on R
0
, X
0
, G
0
and rG
1
(feedback G
0
to input C
at R
1
− X1 NAND) are shown in Figure 4.17. Similar experiments were conducted on the
baseline and DICE mutexes, showing that for baseline the SET at node X
0
is propagated
downstream to node X
1
. Note that X
0
and X
1
node voltages disturbed beyond V
dd
/2 may
lead to an SEU. On the other hand, DICE suffered voltage perturbation at X
0
a but the
subsequent X
1
a node did not reach V
dd
/2, thus no SET propagation occurred.
TofurthertestSEUtolerance, anSETpulsewitha100psdurationwasapplied(as100ps
is a typical SET duration for advanced technologies [95]). The longer Iset was configured
using a burst of consecutive 7 MeV-cm
2
/mg LETs and the source (R
0
) period was increased
108
Figure 4.16: Waveforms for Iset 7 MeV-cm
2
/mg and SET events at outputs and at the
subsequent logic stage
to 350ps. Figure 4.18 shows that the transition signal is delayed by the SET duration, and
the SET was not propagated, yielding minimal voltage disturbance. A brief summary of our
experimental results can be found in Table 4.3 that include measurements of the latency
from request to grant raising signal, area, SET and SEU tolerance at different mutex nodes.
The baseline design exhibit the shortest latency, due to low node capacitance, but also the
lowest tolerance to SETs. The proposed circuit required 1.42 times longer to complete a
request to grant event. But, this was expected and in fact is much shorter than the DICE
mutex, which required more than twice of the baseline (2.28 times).
These results show the tradeoff between node capacitance and SET resiliency, as the
increased capacitance improves SET resiliency but slows down the circuit. Transistor area
was found to be 8.7 times greater for DICE than the baseline mutex. Thus, SEU tolerance
109
Figure 4.17: Waveforms for Iset 7 MeV-cm
2
/mg at proposed mutex’s node X
0
after arbitra-
tion
can be provided by means of modular redundancy, with high area penalty. In contrast,
the area of the proposed circuit is 0.58 times the area of the DICE mutex, demonstrating
that RHBD robustness can be achieved by focusing on the vulnerable circuit conditions.
Moreover, the DICE mutex has an Achilles’s heel on its outputs: the decision of driving
the outputs using C-elements makes it vulnerable to SET and SEU events which can lead
to system deadlock. Experimental results on the proposed mutex demonstrated that SETs
are filtered at inputs and outputs and no SEU was generated even on the typical SET event
duration. Additionally, avoiding an SET on each node in the proposed mutex can mitigate
multiple strikes of same intensity that in turn can generates multiple SET events on the
baselineimplementation. Finally, althoughthepowerwasnotmeasured, itcanbeestimated
to be proportional to transistor area.
110
Figure 4.18: Proposed mutex waveforms for Iset of duration 100ps at node X
0
after arbitra-
tion
Table 4.3: Comparison to baseline and state-of-the-art circuits
Mutex template
Baseline [85] DICE [89] Proposed
Request-to-grant latency(ps) 25.6 58.5 (2.28x Std.) 36.5 (1.42x Std.)
Transistor area µ m
2
16.2 140.4 (8.7x Std.) 82.8 (5.1x Std.)
SET tolerant at internal nodes No Yes Yes
SET tolerant at outputs No No Yes
SR latch SEU tolerant No Yes Yes
SEU tolerant at outputs No * No Yes
Note: * Does not apply
111
Chapter 5
RHBD Asynchronous Flows
ThischapterexplainsindetailtheframeworkthatimplementsRHBDtemplatesfromChisel
specifications. Theexplanationstartsintroducingdataflowimplementation,thefoundational
BD flow [96] and at the end the RHBD flows that are special modifications on the BD flow.
The BD flow is based in the general Chisel flow, presented in [56] and [2].
The architecture style used by asynchronous flows is the Dataflow implementation
that reflects the dataflow itself, constructing circuits based on pipelines, forks, and rings.
This is accomplished by connecting modules and channels like nodes and edges in a directed
graph. Additionally, the directed graph can have cycles, but each cycle iteration is broken
by the Click controller itself. A dataflow circuit is a hardware implementation of a Dataflow
network where each instruction execution is determined based on the availability of input
arguments. Controlsignalsanddataencodesthestateofthecircuit. Thecontrolflowsalong
data, in fine-grained steps of execution, as described in [97].
Dataflow circuits have been successfully implemented in high speed specialized hardware
requiredindigitalsignalprocessing,networkrouting,graphicsprocessingandtelemetry[98].
AprimarymotivationforusingdataflowNetworksasahardwarerepresentationisthefact
thatthethroughput,latency,andslackmatching(FIFOchannelsbalance)canbeperformed.
112
This is extremely difficult in a FSM representation due to a state-explosion problem, as
described in [9].
On the other hand, the standard methodology to model hardware is using finite state
machines (FSM). An FSM is semantically close to synchronous hardware implementations.
A transition on the FSM corresponds to a tick of the system clock. In contrast to dataflow,
theFSMencodesstatesindedicatedstatusregisters. Thiskindofimplementationassumesa
centralized FSM that is responsible for the sequence and scheduling of each block execution.
5.1 Original Bundled Data (BD) Flow
The BD flow have been developed by the Niobium Microsystems Hardware Team. [96]. The
flow depicted in Figure 5.1, is comprised by a specific Chisel design file, the Chisel compiler,
an intermediate circuit (IR) representation and the Verilog compiler. The outcome of the
flow itself is the RTL Verilog file.
TheobtainedRTLVerilogfilegetsintoanin-houseSynthesisandPlaceandRoute(P&R)
flow implemented using commercial EDA tools with custom scripts.
5.1.1 BD modules and channels
The BD flow supports BD channels to interconnect asynchronous modules and a BD inter-
faces to connect I/O ports.
113
The traditional BD channel is a two wire bus that incorporates a delay line on the
Request wire and another delay line in the Acknowledge wire. Both delay lines are indepen-
dent from each other and the delay is configurable. In contrast to the BD channels, the BD
interface do not implement delay lines.
The BD modules supported by the flow are the basic asynchronous modules that sup-
port building linear pipelines and rings. The list of modules includes token generator (TG),
tokenbucket(TB),bufferpipeline(BP),copy(C),merge(M),andsplit(S)controlmodules.
Every module has two versions: slackfull and slackless, except from the TG and the TB,
that have only the slackfull version.
Theslackfull version incorporates in the datapath a bank of registers, effectively imple-
menting a slack equal to one per BD stage [3]. On the other hand, the slackless version
lacks in the datapath the bank of registers or there is no datapath at all. Both possibilities
implement a slack equal to zero per BD stage. The slackless versions are required for control
BDchannels,wherethetokendoesnotrequireadatapayload. Thissetofmodulesconforms
the BD Scala package that is used in the BD Chisel designs.
Figure 5.1: Bundled data flow
114
5.1.2 BD Chisel Design file
The flow input is a Chisel design described in Scala language and incorporates elements
from the predetermined BD Scala package. This can be comprised by one or multiple BD
modules, conforming a complex design, interconnected by BD channels and implementing
BD interfaces at I/O ports. Generally a complex design is described using multiple BD
Chisel Design files.
5.1.3 Chisel Compiler
The Chisel compiler incorporates a C++ simulator for RTL debugging and generation.
Initially, the compiler produces a C++ class for each Chisel design, with a C++ interface
that includes clock-low and clock-high methods to control sequential elements [56]. The
outcome of this compiler is an Intermediate Representation (IR) RTL file.
5.1.4 Intermediate Representation (IR) Circuit file
This is the input file for the Verilog compiler, IR supports a form of human-readable format,
providing a powerful intermediate representation for compiler transformations and anal-
ysis, while providing a way to debug and visualize the transformations.
The IR representation aims to be light-weight, keeping a low-level of abstraction, but
keeping the expressiveness, types, and extensions from the high level language at the same
time. A complete IR explanation can be found in [99].
115
5.1.5 Verilog Compiler
ThisisaHardwareCompilerFramework(HCF)thatusesahardwareintermediaterepresen-
tation FIRRTL (Flexible Intermediate Representation for RTL) aimed to transform general-
purposeIRcircuitintoapplication-specificVerilogRTLapplyingconsecutivetransformations
that include analyzing, simplifying, replacing, adding, renaming, and transforming the cir-
cuit [100].The FIRRTL library collects and control how each transformation is applied into
the HCF framework. The Verilog compiler starts applying successive transforms on the IR
circuit. After each transformation, a modified IR circuit is generated, providing simplifica-
tion and optimization until the latest IR circuit is provided to the Verilog RTL emitter.
5.1.6 RTL Verilog file
The outcome from the Verilog compiler is an application-specific Verilog RTL that represent
the final hardware circuit and is ready to be plugged in to the subsequent synthesis flow.
5.2 Novel RHBD flows
Two RHBD versions are presented, both based on the BD Chisel flow and implemented
a highly modified BD Scala package, the Timing Resilient (TR) and the Triple Modular
Redundant (TMR) flows. The main idea is to reuse the BD Chisel design as input to any of
the RHBD flows to obtain an specific RHBD RTL Verilog.
116
5.2.1 RHBD flow setup
SomeaspectsareneededtobeconsideredtoreuseanexistentBDChiseldesignintoaRHBD
flow. The development of a RHBD Scala library was required to provide the input RHBD
Chisel Design file.
• First, the RHBD channel is a redesigned BD channel that is now a four wire, with
double Request wires and double Acknowledge wires, following the SERAD template
Section 2.6. Each wire incorporates a delay line, where both delay lines in the Request
channel are independent but configured with the same delay. The same configuration
follows for the Acknowledge channel.
• Second, the RHBD modules doubles the number of input and output ports and follow
the SERAD-CLick controller template Section 4.1. The RHBD modules range from
token generator (TG), token bucket (TB), buffer pipeline (BP), copy (C), merge (M)
and split (S) control modules, slackfull and slackess versions. Specific versions were
developed for the timing resiliency (TR) and the Triple Modular Redundancy (TMR)
flows. Basically the RHBD BD Scala library provides a 1-to-1 module and channel
equivalency to the BD scala library.
• Third, although no extra transforms are required for the TR flow, an specific FIRRTL
transform is required for the TMR flow. Specifically this transform triplicates the
registers in the datapath and adds a 3-input Majority gate in the IR circuit, following
the RHBD TMR strategy described in Section 5.2.3.
117
• Fourth, a set of special components described in Scala that provides the TR strategy
into the datapath, were aggregated into the RHBD Chisel library. These components
follows the RHBD TR strategy described in Section 5.2.2.
Inthenextsections,thespecificmodificationstoobtainRHBDontheBDflowareexplained.
118
5.2.2 Timing Resilient (TR) flow
Figure 5.2 depicts the TR flow. At first, a BD Chisel design is feed into the module
Figure 5.2: Timing resilient flow
renaming algorithm, described in Algorithm 1, where the RHBD-TR Scala library replaces
the instances of the BD Scala library.
Second, the obtained RHBD-TR Chisel Design file lacks the TR components required
in the datapath, namely the transition detector, asymmetric C-elements, OR-tree and the
EDL logic, described in Section 5.2.2. The EDL datapath insertion algorithm, described
in Algorithm 2, performs the required connection and instance creation.
119
Algorithm 1: Chisel module renaming
Input: BD Chisel design,
set X = BD Scala library,
set Y = RHBD Scala Library, rhlabel: TR/TMR
Output: Partial RHBD-TR Chisel Design
1 Open file(filename.scala) in read mode
2 Open file(filename+rhlabel.scala) in write mode
3 for each line in the input file do
4 for each module that match an element in set X do
5 Replace module by matching module in setY
6 Append the edited line in the file filename+rhlabel.scala
7 return filename+rhlabel.scala file
8 Close files
At this point the original BD Chisel file have been converted into a complete RHBD-TR
Chisel design, and it is ready to continue the BD flow, obtaining at the end the RHBD-TR
Verilog RTL file.
Algorithm 2: EDL datapath insertion
Input: Output file from algorithm 1, set X = RHBD Scala library
Output: RHBD-TR Chisel design containing EDL datapath insertion
1 Open file(filename.scala) in read mode
2 Open file(filename+rhlabel.scala) in write mode
3 for each line in the input file do
4 for each Slackfull module that match an element in set X do
5 Append the edited line in the file filename+rhlabel.scala
6 Add line connecting the EDL input bus to the datapath bus
7 Add line connecting the EDL interface to the Controller’s interface
8 return filename+TR.scala file
9 Close files
120
5.2.3 Triple Modular Redundant (TMR) flow
A BD Chisel design is feed into the module renaming algorithm, described in Algorithm 1,
where the RHBD-TMR Scala library replaces the instances of the BD Scala library.
Second,theobtainedRHBD-TMRChiselDesignfilehastheRHBD-TMRcontrollerpath
ready, but still it is required to apply the TMR strategy in the datapath.
Figure 5.3: Triple Modular Redundant flow
The TMR annotation insertion algorithm, described in Algorithm 3, adds the anno-
tation required on every register in the datapath.
121
The original TMR transform, presented in [101], was modified in order to accommodate
thetriplemodularregisterandthemajoritygate,RHBD-TMRstrategydescribedinSection
4.2.
Algorithm 3: TMR datapath Annotation insertion
Input: Output file from algorithm 1, set X = RHBD Scala library
Output: RHBD-TMR Chisel design containing TMR annotation
1 Open file(filename.scala) in read mode
2 Open file(filename+rhlabel.scala) in write mode
3 for each line in the input file do
4 for each register instantiated in the datapath do
5 Append the read line in the file filename+rhlabel.scala
6 Append an extra line including ”tmr.annotation(”+registername+”)
7 return filename+TMR.scala file
8 Close files
AtthispointtheoriginalBDChiselfilehavebeenconvertedintoacompleteRHBD-TMR
Chisel design with annotations.
Annotationsaremeta-informationaddedintotheChiselinputfilethatlaterintheVerilog
compiler will trigger the specific TMR transform, responsible for the instantiation of TMR
components in the datapath at IR circuit level.
After the TMR transform is executed, each register previously annotated in the Chisel
design have been replaced by the TMR counterpart. At the end of the Verilog compilation,
the RHBD-TMR Verilog RTL file is obtained.
122
Chapter 6
Application - Case Study
This chapter describes the case study implementing novel templates and testing the novel
design flow. Specifically a network on chip inside a fully homomorphic encryption (FHE)
Accelerator is described.
6.1 Fully Homomorphic Encryption (FHE) Accelera-
tor
AFHEacceleratorisanapplication-specificarchitectureforfullyhomomorphicencryption
(FHE) computations.
TheFHEacceleratorpresentedisbasedintheBASALISCFHEarchitecture[102]. Figure
6.1 shows the FHE accelerator block diagram. This implementation comprises processing
elements (PE) and supporting units, described as follows:
Supporting units:
• AXI modules are responsible for transfer data and instructions between the external
memory to the CTB and from external instruction memory to the Instruction Queue.
• Instruction Queue implementing a buffer for TCU instructions.
123
• Traffic Control Unit (TCU), unit that decodes and issues instructions to other compo-
nents.
• Cipher-text Buffer (CTB). This unit stores cipher-text inputs and outputs for the PE
units.
Figure 6.1: FHE accelerator block diagram, adapted from [102].
Processing elements:
• Multiplyaccumulate(MacPE)unit, responsibleformultiplication, additionandaccu-
mulation operations.
• PermutationPE(PNoC),aswitchednetworkthatinterconnectsCTBmemorymodules
to PEs in an specific permutation order.
• NumberTheoreticTransformPE(NTT),thatcomputesanarrayofcipher-textsimple-
menting NTT butterfly operations.
124
• Cipher-text Buffer (CTB). This unit stores cipher-text inputs and outputs for the PE
units.
6.1.1 FHE accelerator components
The main components are described below.
6.1.2 AXI modules
These modules implement the physical interfaces between the external memory and the
internal FHE submodules CTB and Instruction Queue. For a description on AXI protocols,
see [103].
6.1.2.1 Instruction Queue
This is an first in, first-out (FIFO) register that stores pre-fetched instructions in a queue,
thus the instruction fetch operation is optimized.
6.1.3 MAC PE
This is an array of multiplier accumulate units. This unit computes the product of two coef-
ficients and adds that product to an accumulator. Additionally, can perform multiplication
or addition operation per cycle. Accumulation operation require two cycles.
125
6.1.3.1 Traffic Control Unit (TCU)
This unit is responsible for executing a sequence of instructions provided by the SoC via the
instruction interface. These instructions determine how the TCU manage and dispatches
control signals to orchestrate operations in the CTB, indicating what addresses to store or
forward; NTT arithmetic operation instructions; and tag-routing constants to the PNoC.
6.1.3.2 Number Theoretic Transform (NTT)
The Number Theoretic Transform computes at a time an array of FHE coefficients. Imple-
ments an internal butterfly network interconnecting an array of special NTT multipliers
that performs specific cypher-text operations. After the NTT operations on cyper-text coef-
ficients, the unit forward back to the CTB the ciphertext result via the write PNoC.
6.1.3.3 Cipher-text Buffer (CTB)
TheCTBhastwoindependentinterfaces. Theread/write(R/W)externalinterfaceinter-
acts with the SoC, and is responsible to store cipher-text coefficients in the internal buffer,
and to forward results from the buffer to the external memory.
On the other side, a bridge interface with the PNoC forwarding and receiving data
generated in the NTT unit.
The CTB is divided into many memory banks that can be accessed in parallel. Each
bank is assigned a unique address (sender/destination identification) that allows clash-free
communication between PEs and CTB memory banks.
126
6.1.3.4 Permutation NoC (PNoC)
This unit is composed by a write-permutation unit that interfaces PE units to the CTB
memorymodulesandaread-permutationunitthatinturninterfacesCTBmemorymodules
to PE units, and the Permutation NoC is going to be explained in detail in the next section.
A mayor requirement in the design-space exploration is the scalability of the FHE archi-
tecture. The number of PEs can range from 2 to 512, data-width is set to 32 bits, but this
parameter is also scalable.
6.1.4 RHBD Permutation NoC Application
TheRHBDAsynchronousNoCdesignisbasedinhybridredundancyofRHBD’sspatialand
temporal solutions, explained in Section 2.4. The architecture implementation will reflects
the dataflow itself and therefore the algorithm that is implemented. The NoC follows the
indirect topology and implements deterministic routing as described in Section 2.9.2. This
can be accomplished by describing the design in a Chisel program interconnecting modules
and channels like nodes and edges in a directed graph.
TheOmega-network[104]topologyisusedtoprovidecommunicationparallelismbetween
PEsandmemoryblocks, implementingtheinterconnectNoCthattransferdatabetweenthe
Memory units and the PEs. These units re-order vector elements of ciphertexts according to
predefined programmable patterns. A pair of Omega-network NoC, depending on the data
flow, where implemented to interface the memory unit.
127
Write-permutation unit interfaces memory to the PEs units, andread-permutation
unit interfaces PEs units to the memory unit.
Thesepermutationunits arenecessarytoaccommodate theconflict-freeroutingaddress-
ing scheme, explained in Section 2.9.3.2. Each permutation unit reorders the input array
based on coefficients ( a,b,c), calculating a destination array.
Thewritepermutationunitpassesdataintheoppositedirection,fromthePEstothe
memory unit. The write permutation unit can perform the same permutation operation as
the read unit. But it can additionally perform any permutation of the form i− > (i∗ a+b)
XOR c, for any a (being an odd number), b, and c, all three being smaller than size and
greater than zero. Note that size is a power of 2. For testing, the formula can be written as
i− > ((i∗ a+b)mod(size)) XOR c.
The read permutation unit passes data from memory units to the PEs. Data comes
from the memory banks in a scrambled ordering. The read permutation is a special case
of the write permutation, where a = 1 and b = 0, therefore the same design is reused to
implement both. One instance will have those inputs fixed appropriately to simplify the
logic, but there will be only one permutation module in the design that will be used in both
cases.
6.1.4.1 Block Diagram
Forsimplicity,belowwedescribetheimplementationforsize=4depictedinFigure6.2. The
design implements asynchronous channels, tag generator units and routers, the later two are
128
referred as cores. The cyclic logical left shift is applied at the output of the asynchronous
channels. i.e., from Tswitches to switches, the channel’s input keep natural array order
(0,1,2,3) but are connected to switches in following the cyclic logical left shift order.
Figure 6.2: Permutation block diagram
Note that a special NoC implementation case is size equal two, where just one switch
interconnect straight inputs to outputs. Note that this special case is easily managed in
the Chisel code by implementing the correspondent conditional statement. For greater size
values, the design will scale up following this guideline:
• Number of columns : log2(size).
• Number of cores: size/2∗ log2(size).
• Number of Patern generators: size/2.
129
• Cyclic logical left shift performed at inputs of cores.
• Number of possible pipelines: size∗ 2
log2(n)
, where log2(n) is the number of columns,
for size (256), 8 columns.
• xa,xb,xc distribution tree: Just one level for size <= 32. For size >= 64, a second
level (branches) is implemented, each branch is connected to 16 tag generators (TFG).
Those requirements are managed at the Chisel code by describing in Scala the condition
required for the HDL implementation.
See Table 6.1 for a parameter list and Table 6.2 for specific Permutation NoC I/O ports.
.
Table 6.1: Table with parameters
Name Default
Value
Legal Value Description
size 256 Anypower-of-2greater-than-
or-equal-to 2
Number of elements in input
and output arrays
a 1 Positive odd integer smaller
than size
input (constant a)
b 0 Positive integer smaller than
size
input (constant b)
c 3 Integer with value ranging
from 0 to size− 1
input (constant c)
dWidth 32 Positive fixed value Datapath width aWidth
aWidth log2(size) log2(size) a data width
bWidth log2(size) log2(size) b data width
cWidth log2(size) log2(size) c data width
Notes: - log2(size)− 1 if we hard-wire the LSB to 1.
- Size can range from 2 to 256.
- For initial testing, size=4.
130
Table 6.2: Permutation NoC I/O ports
Channel
Name
Direction Description
clock Input Notusedinternally. Itcanbeconnectedtoaconstant
0 or 1
reset Input Active Hi Reset signal
xa hs req Input Bundled-data request for xa bits Width: 1 bit
xa hs ack Output Bundled-data acknowledge for xa bits Width: 1 bit
xa bits Input Multiplier a coefficient input data Width: aWidth
xb hs req Input Bundled-data request for xb bits Width: 1 bit
xb hs ack Output Bundled-data acknowledge for xb bits Width: 1 bit
xb bits Input Addend b coefficient input data Width: bWidth
xc hs req Input Bundled-data request for xc bits Width: 1 bit
xc hs ack Output Bundled-data acknowledge for xc bits Width: 1 bit
xc bits Input XOR c coefficient input data Width: cWidth
xs sindex hs req Input Bundled-data request for xs sindex bits Width: 1 bit
xs sindex hs ack Output Bundled-data acknowledge for xs sindex bits Width:
1 bit
xs sindex bits Input Input data from sender Width: dWidth
ys dindex hs req Output Bundled-datarequestfor ys dindex bits Width: 1bit
ys dindex hs ack Input Bundled-data acknowledge for ys dindex bits Width:
1 bit
ys dindex bits Output Output data to destination dindex Width: dWidth
Notes: - sindex = Sender index.
- dindex = Destination index. (i∗ a+b)XORc.
- Range = 0 to size− 1.
6.1.4.2 Sub Blocks
The main core is the data-permutation portion of the logic which is implemented using 2-
input,2-output(2x2)routersplacedusingtheOmega-NetworktopologydescribedinSection
2.9.3.
131
Thepermutation core is an Omega-network of 2x2 routers that are used to shuffle the
input data array using the pattern described above into the output data array. In Table 6.3
the router core’s input and output ports are described in detail.
The routing information is conveyed through the input channel and concatenated with
the payload. Each router looks at the MSB portion of the routing information field and
decides to swap (if the MSB is a one) or to forward straight and then shifts the routing
information down one bit before forwarding to the next layer. The two channels arriving at
the node should agree on the MSB otherwise an internal error is generated. This generated
error is for verification purposes only.
Thetag forward generator (TFG) receives constants a,b,c and generates the routing
pattern based on sender indexes (sindex). The constants needs to be feed each sending
cycle. to the first column routers, called tswithes.
During implementation, i is an internal parameter that is set by the Chisel generator
to match the Omega-network index data channel (this index is hardwired at instantiation).
Example: 2x2 Tile receives S0 and S2, thus sender index for tag
a
is 0 and for tag
b
is 2.
EachrouterdropstheMSBinthetagrouting. Thisway, therequiredside-bandcarrying
the TAG information becomes narrower as it moves forward towards destination in the
network.
132
Table 6.3: Router core I/O channels
Channel
Name
Direction Description
a hs req Input 1-bit Request handshaking signal
a hs ack Output 1-bit Acknowledge handshaking signal
a bits Input 32-bit input data to the tile
tag a hs req Input 1-bit Request handshaking signal
tag a hs ack Output 1-bit Acknowledge handshaking signal
tag a bits Input log2(size)-column index -bit input data to the tile
b hs req Input 1-bit Request handshaking signal
b hs ack Output 1-bit Acknowledge handshaking signal
b bits Input 32-bit input data to the tile
tag b hs req Input 1-bit Request handshaking signal
tag b hs ack Output 1-bit Acknowledge handshaking signal
tag b bits Input log2(size)-column index -bit input data to the tile
x hs req Output 1-bit Request handshaking signal
x hs ack Input 1-bit Acknowledge handshaking signal
x bits Output 32-bit + log2(size)-column index- 1 output data to the
next tile, Upper out port
y hs req Output 1-bit Request handshaking signal
y hs ack Input 1-bit Acknowledge handshaking signal
y bits Output 32-bit + log2(size)-column index- 1 output data to the
next tile, lower out port
Notes: - Field (MSB,MSB− log2(size)− column index) is routing.
- Field (MSB− log2(size)− column index,0) is data.
133
6.1.5 Application Results
Both RHBD flows were applied to the BD Permutation Noc Design file using the same set
of parameters, described in Table 6.1, size set to 4. Layout of both RHBD implementations
are depicted in Figure 6.3.
Figure 6.3: TMR and TR layout images obtained for Permutation NoC size = 4
The resulting layouts show that timing resilient design is 2.46x smaller than the more
traditional TMR counterpart.
134
Chapter 7
Conclusions
This chapter summarizes the major contributions of the research and discusses potential
future work.
Summary of the achieved contributions.
• Developedanoptimizedbundled-dataresilienthandshakingprotocolandthetemplate
implementation and correctness verification, called Blade Open.
• A novel asynchronous design template, Blade-OC was developed. This template
enables the overlapping of consecutive latch pulses, increasing the size of the timing
resilience window and design flexibility. Additionally, it was compared analytically in
theachievedmaximumresiliencywindowamongexistingasynchronoustiming-resilient
templates.
• Bundled data Soft Error resilient template implement using Click asynchronous tem-
plate [37].
• Development of RHBD templates and cells, specifically SERAD-Click-TR [13],
SERAD-Click-TMR and RHBD Mutex cell [14].
135
• A network-on-chip description in HCL as an interesting and challenging asynchronous
case-of-study.
For future work we have some suggested further investigation topics based on this
dissertation.
• To develop Routing arbitration implementing the RHBD mutex cell.
• To develop a RHBD Q-flop cell following a similar approach followed in the RHBD
Mutex development.
• Totestnewasynchronoustemplatesimplementingdifferentapplications,aimedtohigh
performance (i.e., Blade templates, QDI).
• To explore some other RHBD strategies, like Error Detection and Correction (EDC)
following the TMR approach.
• To reuse synchronous designs and converting into RHBD asynchronous counterparts.
136
References
[1] “Chisel: Hardware construction language,” https://www.chisel-lang.org/chisel3/docs/
introduction.html, 2019, online; Accessed January 28th 2022.
[2] A.Izraelevitz,J.Koenig,P.Li,R.Lin,A.Wang,A.Magyar,D.Kim,C.Schmidt,C.Markley,
J. Lawson, and J. Bachrach, “Reusability is firrtl ground: Hardware construction languages,
compiler frameworks, and transformations,” in 2017 IEEE/ACM International Conference
on Computer-Aided Design (ICCAD), 2017, pp. 209–216.
[3] P. A. Beerel, R. O. Ozdag, and M. Ferretti, A Designer’s Guide to Asynchronous VLSI.
Cambridge University Press, 2010.
[4] N. N. Mahatme, B. Bhuva, N. Gaspard, T. Assis, Y. Xu, P. Marcoux, M. Vilchis,
B. Narasimham, A. Shih, S. . Wen, R. Wong, N. Tam, M. Shroff, S. Koyoma, and A. Oates,
“Terrestrial SER characterization for nanoscale technologies: A comparative study,” in 2015
IEEE International Reliability Physics Symposium, April 2015, pp. 4B.4.1–4B.4.7.
[5] D. Mavis and P. Eaton, “Soft error rate mitigation techniques for modern microcircuits,”
in 2002 IEEE International Reliability Physics Symposium. Proceedings. 40th Annual (Cat.
No.02CH37320), 2002, pp. 216–225.
[6] M.Cannizzaro,S.Beer,J.Cortadella,R.Ginosar,andL.Lavagno,“Saferazor: Metastability-
robust adaptive clocking in resilient circuits,” IEEE Transactions on Circuits and Systems I:
Regular Papers, vol. 62, no. 9, pp. 2238–2247, Sep. 2015.
[7] S. A. Aketi, S. Gupta, H. Cheng, J. Mekie, and P. A. Beerel, “Serad: Soft error resilient
asynchronous design using a bundled data protocol,” IEEE Transactions on Circuits and
Systems I: Regular Papers, vol. 67, no. 5, pp. 1667–1677, 2020.
[8] B. Fritz, A. Steininger, V. Simek, and V. S. Veeravalli, “Setup for an experimental study of
radiation effects in 65nm CMOS,” in 2017 Euromicro Conference on Digital System Design
(DSD), 2017, pp. 329–336.
[9] S. Tripakis, R. Limaye, K. Ravindran, and G. Wang, “On tokens and signals: Bridging the
semantic gap between dataflow models and hardware implementations,” in 2014 Interna-
tional Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation
(SAMOS XIV), July 2014, pp. 51–58.
137
[10] “ACT - An Open-Source Design Flow for Asynchronous Circuits,” http://avlsi.csl.yale.edu/
act/lib/exe/fetch.php?media=gomactech2019.pdf,2019,online; AccessedJanuary28th2022.
[11] A.Lines,P.Joshi,R.Liu,S.McCoy,J.Tse,Y.-H.Weng,andM.Davies,“Loihiasynchronous
neuromorphic research chip,” in 2018 24th IEEE International Symposium on Asynchronous
Circuits and Systems (ASYNC), 2018, pp. 32–33.
[12] M. Herrera, T. Wang, and P. A. Beerel, “Blade-oc asynchronous resilient template ‡ this
research has been supported in part by nsf grant 1619415.” in 2018 28th International Sym-
posium on Power and Timing Modeling, Optimization and Simulation (PATMOS),July2018,
pp. 147–154.
[13] F. A. Kuentzer, M. Herrera, O. Schrape, P. A. Beerel, and M. Krstic, “Radiation hard-
ened click controllers for soft error resilient asynchronous architectures,” in 2020 26th IEEE
International Symposium on Asynchronous Circuits and Systems (ASYNC), 2020, pp. 78–85.
[14] M. Herrera and P. Beerel, “Radiation hardening by design techniques for the mutual
exclusion element.” [Online]. Available: https://dl.acm.org/doi/10.1145/3526241.3530310
[15] I. E. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, no. 6, pp. 720–738, Jun. 1989.
[16] J. Cortadella, A. Kondratyev, L. Lavagno, and C. Sotiriou, “Desynchronization: Synthesis
of asynchronous circuits from synchronous specifications,” IEEE Trans. on CAD, vol. 25,
no. 10, pp. 1904–1921, Oct 2006.
[17] I. J. Chang, S. P. Park, and K. Roy, “Exploring asynchronous design techniques for process-
tolerantandenergy-efficientsubthresholdoperation,” IEEEJSSC,vol.45,no.2,pp.401–410,
Feb 2010.
[18] N. Jayakuma, R. Garg, B. Gamache, and S. Khatri, “A PLA based asynchronous
micropipelining approach for subthreshold circuit design,” in DAC, 2006, pp. 419–424.
[19] J. Liu, S. Nowick, and M. Seok, “Soft mousetrap: A bundled-data asynchronous pipeline
scheme tolerant to random variations at ultra-low supply voltages,” in ASYNC, May 2013,
pp. 1–7.
[20] Y. Chen, R. Manohar, and Y. Tsividis, “Design of tunable digital delay cells,” in 2017 IEEE
Custom Integrated Circuits Conference (CICC), April 2017, pp. 1–4.
[21] T. Lin, K. Chong, J. S. Chang, and B. Gwee, “An ultra-low power asynchronous-logic in-situ
self-adaptive v
dd
system for wireless sensor networks,” IEEE Journal of Solid-State Circuits,
vol. 48, no. 2, pp. 573–586, Feb 2013.
[22] M.Imai,T.V.Chu,K.Kise,andT.Yoneda,“Thesynchronousvs.asynchronousnocrouters:
an apple-to-apple comparison between synchronous and transition signaling asynchronous
designs,”in2016TenthIEEE/ACMInternationalSymposiumonNetworks-on-Chip(NOCS),
Aug 2016, pp. 1–8.
138
[23] K. Yun, D. Dill, and S. Nowick, “Synthesis of 3D asynchronous state machines,” in ICCD,
Oct 1992, pp. 346–350.
[24] A. Peeters, F. te Beest, M. de Wit, and W. Mallon, “Click elements: An implementation
style for data-driven compilation,” in Asynchronous Circuits and Systems (ASYNC), 2010
IEEE Symposium on, May 2010, pp. 3–14.
[25] K.A.Bowman, J.W.Tschanz, N.S.Kim, J.C.Lee, C.B.Wilkerson, S.-L.L.Lu, T.Karnik,
and V. K. De, “Energy-efficient and metastability-immune resilient circuits for dynamic vari-
ation tolerance,” IEEE Journal of Solid-State Circuits, vol. 44, no. 1, pp. 49–63, Jan 2009.
[26] J. Cortadella, M. Lupon, A. Moreno, A. Roca, and S. S. Sapatnekar, “Ring oscillator clocks
and margins,” in IEEE Symposium on Asynchronous Circuits and Systems (ASYNC), May
2016, pp. 19–26.
[27] D.Sokolov,J.Murphy,A.Bystrov,andA.Yakovlev,“Designandanalysisofdual-railcircuits
forsecurityapplications,”IEEETransactionsonComputers,vol.54,no.4,pp.449–460,2005.
[28] A. Saifhashemi, D. Hand, P. A. Beerel, W. Koven, and H. Wang, “Performance and area
optimization of a bundled-data intel processor through resynthesis,” in IEEE Symposium on
Asynchronous Circuits and Systems (ASYNC), May 2014, pp. 110–111.
[29] M. T. M. H.-H. H. D. C. F. B. Z. L. M. G. M. B. N. L. V. C. Hand, Dylan and P. A.
Beerel., “Blade – a timing violation resilient asynchronous template,” IEEE Symposium on
Asynchronous Circuits and Systems (ASYNC), pp. 21–28, May 2015.
[30] G. Heck, L. Heck, A. Singhvi, M. Moreira, P. Beerel, and N. Calazans, “Analysis and opti-
mization of programmable delay elements for 2-phase bundled-data circuits,” in VLSID, Jan
2015, pp. 321–326.
[31] S.Beer, M.Cannizzaro, andJ.Cortadella, “Metastabilityinbetter-than-worst-casedesigns,”
IEEE Symposium on Asynchronous Circuits and Systems (ASYNC), pp. 101–102, May 2014.
[32] M. Fojtik, D. Fick, and Y. Kim, “Bubble razor: Eliminating timing margins in an ARM
Cortex-M3 processor in 45 nm CMOS using architecturally independent error detection and
correction,” IEEE Journal of Solid-State Circuits, vol. 48, no. 1, pp. 66 – 81, Nov 2012.
[33] M.Cannizzaro,S.Beer,J.Cortadella,R.Ginosar,andL.Lavagno,“Saferazor: Metastability-
robust adaptive clocking in resilient circuits,” IEEE Trans. on Circuits and Systems, vol. 62,
pp. 2238–2247, 2015.
[34] F. U. Rosenberger, C. E. Molnar, T. J. Chaney, and T.-P. Fang, “Q-modules: Internally
clocked delay-insensitive modules,” IEEE Trans. Computers, vol. 37, pp. 1005–1018, 1988.
[35] D. Hand, H.-H. Huang, B. Cheng, Y. Zhang, M. T. Moreira, M. Breuer, N. L. V. Calazans,
and P. A. Beerel., “Performance optimization and analysis of blade designs under delay
variability,” IEEE Symposium on Asynchronous Circuits and Systems (ASYNC), pp. 61–68,
May 2015.
139
[36] M.WaugamanandW.Koven,“Sharp-aresilientasynchronoustemplate,” IEEE Symposium
on Asynchronous Circuits and Systems (ASYNC), pp. 21–24, May 2017.
[37] A. Peeters, F. T. Beest, M. D. Wit, and W. Mallon, “Click elements: An implementation
style for data-driven compilation,” IEEE Symposium on Asynchronous Circuits and Systems
(ASYNC), pp. 3–14, May 2010.
[38] D. Munteanu and J.-L. Autran, “Modeling and simulation of single-event effects in digital
devices and ics,” IEEE Transactions on Nuclear Science, vol. 55, no. 4, pp. 1854–1878, 2008.
[39] P. Dodd and L. Massengill, “Basic mechanisms and modeling of single-event upset in digital
microelectronics,” IEEE Transactions on Nuclear Science, vol. 50, no. 3, pp. 583–602, 2003.
[40] F.Wrobel,L.Dilillo,A.D.Touboul,V.Pouget,andF.Saign´ e,“Determiningrealisticparam-
etersforthedoubleexponentiallawthatmodelstransientcurrentpulses,” IEEETransactions
on Nuclear Science, vol. 61, no. 4, pp. 1813–1818, 2014.
[41] M. Andjelkovic, A. Ilic, Z. Stamenkovic, M. Krstic, and R. Kraemer, “An overview of the
modeling and simulation of the single event transients at the circuit level,” in 2017 IEEE
30th International Conference on Microelectronics (MIEL), 2017, pp. 35–44.
[42] Q. Zhou and K. Mohanram, “Gate sizing to radiation harden combinational logic,” IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 25, no. 1,
pp. 155–166, 2006.
[43] Q. Zhou, M. R. Choudhury, and K. Mohanram, “Tunable transient filters for soft error rate
reduction in combinational circuits,” in 2008 13th European Test Symposium, 2008, pp. 179–
184.
[44] Bainbridge and S. Salisbury, “Hardening of self-timed circuits against
glitches,” us patent application 11/849,312, apr. 22, 2010. Available:
https://patents.google.com/patent/ep2239848a2/ko.
[45] O. Schrape, M. Andjelkovi´ c, A. Breitenreiter, A. Balashov, and M. Krsti´ c, “Design concept
forradiation-hardeningoftriplemodularredundancytspcflip-flops,”in 2020 23rd Euromicro
Conference on Digital System Design (DSD), 2020, pp. 616–621.
[46] S. A. Aketi, J. Mekie, and H. Shah, “Single-error hardened and multiple-error tolerant
guarded dual modular redundancy technique,” in 2018 31st International Conference on
VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID), 2018,
pp. 250–255.
[47] S. Mitra, M. Zhang, S. Waqas, N. Seifert, B. Gill, and K. S. Kim, “Combinational logic soft
error correction,” in 2006 IEEE International Test Conference, 2006, pp. 1–9.
[48] J.Cazeaux,D.Rossi,M.Omana,C.Metra,andA.Chatterjee,“Ontransistorlevelgatesizing
for increased robustness to transient faults,” in 11th IEEE International On-Line Testing
Symposium, 2005, pp. 23–28.
140
[49] A.Balasubramanian,B.Bhuva,J.Black,andL.Massengill,“RHBDtechniquesformitigating
effectsofsingle-eventhitsusingguard-gates,” IEEE Transactions on Nuclear Science,vol.52,
no. 6, pp. 2531–2535, 2005.
[50] J.CharlesH.RothandL.L.Kinney, Fundamentals of Logic Design, Sixth Edition. Cengage
Learning, 1980, ch. Latches and Flip-Flops.
[51] Y. Zhang, L. S. Heck, M. T. Moreira, D. Zar, M. A. Breuer, N. L. V. Calazans, and P. A.
Beerel, “Testable mutex design,” IEEE Transactions on Circuits and Systems I: Regular
Papers, vol. 63, no. 8, pp. 1188–1199, 2016.
[52] I. Filanovsky and H. Baltes, “CMOS schmitt trigger design,” IEEE Transactions on Circuits
and Systems I: Fundamental Theory and Applications, vol. 41, no. 1, pp. 46–49, 1994.
[53] K. Yun, B. Lin, D. Dill, and S. Devadas, “BDD-based synthesis of extended burst-mode
controllers,” Computer-Aided Design of Integrated Circuits and Systems, vol. 17, no. 9, pp.
782–792, Sep 1998.
[54] D. Borrione, R. Piloty, D. Hill, K. Lieberherr, and P. Moorby, “Three decades of hdls. ii.
conlan through verilog,” IEEE Design Test of Computers, vol. 9, no. 3, pp. 54–63, 1992.
[55] D. Rich, “The evolution of systemverilog,” IEEE Design Test of Computers, vol. 20, no. 4,
pp. 82–84, 2003.
[56] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Aviˇ zienis, J. Wawrzynek, and
K.Asanovi´ c, “Chisel: Constructinghardwareinascalaembeddedlanguage,” in DAC Design
Automation Conference 2012, 2012, pp. 1212–1221.
[57] “SCALA programming language,” https://github.com/scala, 2019, online; Accessed January
28th 2022.
[58] “FIRRTL: Flexible Internal Representation for RTL, available: https://www.chisel-
lang.org/firrtl/,” 2021.
[59] “Specificationforthefirrtllanguage,available: https://raw.githubusercontent.com/chipsallia
nce/firrtl/master/spec/spec.pdf,” 2021.
[60] L.BeniniandG.DeMicheli,“Networksonchip: anewparadigmforsystemsonchipdesign,”
inProceedings 2002 Design, Automation and Test in Europe Conference and Exhibition,2002,
pp. 418–419.
[61] M. Amde, T. Felicijan, A. Efthymiou, D. Edwards, and L. Lavagno, “Asynchronous on-chip
networks,” IEE Proceedings-Computers and Digital Techniques, vol. 152, no. 2, pp. 273–283,
2005.
[62] D. Micciancio, “A first glimpse of cryptography’s holy grail,” Commun. ACM, vol. 53, no. 3,
p. 96, mar 2010. [Online]. Available: https://doi.org/10.1145/1666420.1666445
141
[63] “Gartner top strategic technology trends for 2022,” https://www.gartner.com/en/
information-technology/insights/top-technology-trends, 2021, online; Accessed January 28th
2022.
[64] M. S. Riazi, K. Laine, B. Pelton, and W. Dai, “Heax: An architecture for computing on
encrypted data.” [Online]. Available: http://arxiv.org/abs/1909.09731v2
[65] R. Fuhrer, B. Lin, and S. Nowick, “Symbolic hazard-free minimization and encoding of asyn-
chronous finite state machines,” in ICCAD, Nov 1995, pp. 604–611.
[66] K. Sakallah, T. Mudge, and O. Olukotun, “Analysis and design of latch-controlled syn-
chronous digital circuits,” IEEE Trans. on CAD, vol. 11, no. 3, pp. 322–333, Mar 1992.
[67] I. Sutherland, “Micropipelines,” Communications of the ACM, vol. 32, no. 6, pp. 720–738,
June 1989.
[68] I. J. Chang, S. P. Park, and K. Roy, “Exploring asynchronous design techniques for process-
tolerant and energy-efficient subthreshold operation,” J. Solid-State Circuits, vol. 45, pp.
401–410, 2010.
[69] R. M. Fuhrer, B. Lin, and S. M. Nowick, “Symbolic hazard-free minimization and encoding
of asynchronous finite state machines,” in Proceedings of IEEE International Conference on
Computer Aided Design (ICCAD), 1995, pp. 604 – 611.
[70] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective. Addison
Wesley Publishing Company Incorporated, 2011.
[71] A. P. Hurst and R. K. Brayton, “The advantages of latch-based design under process varia-
tion,” in The International Workshop on Logic Synthesis (IWLS), June 2006.
[72] J. Cortadella, A. Kondratyev, L. Lavagno, and C. P. Sotiriou, “Desynchronization: Synthesis
of asynchronous circuits from synchronous specifications,” IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 25, pp. 1904–1921, 2006.
[73] G. Zhang and P. A. Beerel, “Stochastic analysis of bubble razor,” 2014 Design, Automation
and Test in Europe Conference and Exhibition (DATE), pp. 1–6, 2014.
[74] “Plasma CPU,” http://opencores.org/project,plasma, 2021, online; Accessed January 28th
2022.
[75] R. Naseer and J. Draper, “The DF-dice storage element for immunity to soft errors,” in
Midwest Symposium on Circuits and Systems, vol. 1, Aug 2005, pp. 303–306.
[76] A. Balasubramanian, B. L. Bhuva, J. D. Black, and L. W. Massengill, “RHBD techniques
for mitigating effects of single-event hits using guard-gates,” IEEE Transactions on Nuclear
Science, vol. 52, no. 6, pp. 2531–2535, Dec 2005.
142
[77] D. E. Muller and W. S. Bartky, “A theory of asynchronous circuits,” in Proceedings of
International Symposium on the Theory of Switching. Harvard University Press, 1959, p.
204–243.
[78] J. Bhasker and R. Chadha, Static Timing Analysis for Nanometer Designs: A Practical
Approach, 1st ed. Springer Publishing Company, Incorporated, 2009.
[79] J. R. Ahlbin, N. M. Atkinson, M. J. Gadlage, N. J. Gaspard, B. L. Bhuva, T. D. Loveless,
E. X. Zhang, L. Chen, and L. W. Massengill, “Influence of n-well contact area on the pulse
width of single-event transients,” IEEE Transactions on Nuclear Science, vol. 58, no. 6, pp.
2585–2590, Dec 2011.
[80] M. Krstic, S. Weidling, V. Petrovic, and E. S. Sogomonyan, “Enhanced architectures for soft
error detection and correction in combinational and sequential circuits,” Microelectronics
Reliability, vol. 56, pp. 212–220, 2016.
[81] F. A. Kuentzer and M. Krstic, “Soft error detection and correction architecture for asyn-
chronous bundled data designs,” IEEE Transactions on Circuits and Systems I: Regular
Papers, vol. 67, no. 12, pp. 4883–4894, 2020.
[82] D. J. Barnhart, T. Vladimirova, M. N. Sweeting, and K. S. Stevens, “Radiation hardening by
design of asynchronous logic for hostile environments,” IEEE Journal of Solid-State Circuits,
vol. 44, no. 5, pp. 1617–1628, 2009.
[83] D. Marienfeld, E. S. Sogomonyan, V. Otcheretnij, and M. Gossel, “A new self-checking and
code-disjoint non-restoring array divider,” in 12th IEEE International On-Line Testing Sym-
posium (IOLTS’06), July 2006, p. 6 pp.
[84] J. M. Benedetto, P. H. Eaton, D. G. Mavis, M. Gadlage, and T. Turflinger, “Digital single
eventtransienttrendswithtechnologynodescaling,” IEEE Transactions on Nuclear Science,
vol. 53, no. 6, pp. 3462–3465, Dec 2006.
[85] C. Mead and L. Conway, Introduction to VLSI Systems. Addison Wesley Publishing Com-
pany Incorporated, 1980, ch. System timing (C. L. Seitz).
[86] D. Kinniment, Synchronization and Arbitration in Digital Systems. John Wiley & Sons,
Ltd., 2008, ch. Part I, Circuits.
[87] S.R.Naqvi, A.Steininger, andJ.Lechner, “AnSETtoleranttreearbitercell,” in 2013 IEEE
19th International Symposium on Asynchronous Circuits and Systems, 2013, pp. 31–39.
[88] A. Yakovlev, A. Petrov, and L. Lavagno, “A low latency asynchronous arbitration circuit,”
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 2, no. 3, pp. 372–
377, 1994.
[89] A. J. M. Wonjin Jang, Christopher D. Moore, “SEU tolerant arbiter,” U.S. Patent WO
2010/080921 A2, Jul. 7, 2010. [Online]. Available: https://patents.google.com/patent/
WO2010080921A2/ar
143
[90] R. Naseer and J. Draper, “The DF-dice storage element for immunity to soft errors,” in 48th
Midwest Symposium on Circuits and Systems, Aug 2005, pp. 303–306 Vol. 1.
[91] “Predictive technology model, arizona state university, USA,” http://ptm.asu.edu/latest.
html, 2022, online; Accessed January 28th 2022.
[92] “BSIM (Berkeley short-channel IGFET model) group, University of California, Berkeley,
USA,” http://bsim.berkeley.edu/?page=BSIMCMG, 2022, online; Accessed January 28th
2022.
[93] D. A. Black, W. H. Robinson, I. Z. Wilcox, D. B. Limbrick, and J. D. Black, “Modeling of
single event transients with dual double-exponential current sources: Implications for logic
cell characterization,” IEEE Transactions on Nuclear Science, vol. 62, no. 4, pp. 1540–1549,
2015.
[94] Y. M. Aneesh, S. R. Sriram, K. R. Pasupathy, and B. Bindu, “An analytical model of single-
eventtransientsindouble-gateMOSFETforcircuitsimulation,” IEEE Transactions on Elec-
tron Devices, vol. 66, no. 9, pp. 3710–3717, 2019.
[95] B. Narasimham, V. Ramachandran, B. L. Bhuva, R. D. Schrimpf, A. F. Witulski, W. T.
Holman, L. W. Massengill, J. D. Black, W. H. Robinson, and D. McMorrow, “On-chip
characterization of single-event transient pulsewidths,” IEEE Transactions on Device and
Materials Reliability, vol. 6, no. 4, pp. 542–549, 2006.
[96] “Niobium Microsystems homepage,” https://niobiummicrosystems.com, 2022, online;
Accessed January 28th 2022.
[97] S. A. Edwards, R. Townsend, M. Barker, and M. A. Kim, “Compositional dataflow circuits,”
ACM Trans. Embed. Comput. Syst., vol. 18, no. 1, pp. 5:1–5:27, Jan. 2019. [Online].
Available: http://doi.acm.org/10.1145/3274280
[98] L. Josipovi´ c, R. Ghosal, and P. Ienne, “Dynamically scheduled high-level synthesis,” in
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate
Arrays, ser. FPGA ’18. New York, NY, USA: ACM, 2018, pp. 127–136. [Online]. Available:
http://doi.acm.org/10.1145/3174243.3174264
[99] “The LLVM Compiler Infrastructure Project,” http://llvm.org, 2019, online; Accessed Jan-
uary 28th 2022.
[100] A.Izraelevitz,J.Koenig,P.Li,R.Lin,A.Wang,A.Magyar,D.Kim,C.Schmidt,C.Markley,
J. Lawson, and J. Bachrach, “Reusability is firrtl ground: Hardware construction languages,
compiler frameworks, and transformations,” in 2017 IEEE/ACM International Conference
on Computer-Aided Design (ICCAD), 2017, pp. 209–216.
[101] “FIRRTLTransformTutorial,” https://github.com/ucb-bar/firrtl-transform-tutorial/, 2020,
online; Accessed January 28th 2022.
144
[102] R. Geelen, M. V. Beirendonck, H. V. L. Pereira, B. Huffman, T. McAuley, B. Selfridge,
D. Wagner, G. Dimou, I. Verbauwhede, F. Vercauteren, and D. W. Archer, “Basalisc:
Flexible asynchronous hardware accelerator for fully homomorphic encryption.” [Online].
Available: http://arxiv.org/abs/2205.14017v1
[103] “ARM - An introduction to AMBA AXI,” https://developer.arm.com/documentation/
102202/0300/AXI-protocol-overview, 2021, online; Accessed January 28th 2022.
[104] “Omega network,” https://en.wikipedia.org/wiki/Omega network, 2021, online; Accessed
January 28th 2022.
145
Abstract (if available)
Abstract
The complex interrelationships between technology, design, and fabrication require that radiation-hardness must be considered when circuits will be used in low-orbit or deep-space environments. Unlike traditional synchronous circuits, asynchronous circuits do not use a global clock signal to control the update of the state registers and have potential benefits including enabling soft-error tolerance with low overhead. The traditional method to design digital circuits, including asynchronous circuits, utilizes a Hardware Description Language (HDL) such as Verilog, that includes variable assignments, explicit notations to express concurrency and control flow structures. However, HDLs have been compared to assembly languages because of their low level of abstraction, explicitly describing the structure of the circuit in detail. In contrast, Chisel is a modern hardware construct language (HCL) that provides a more abstract description of digital circuits and is part of a synthesis framework that automatically produces lower-level Register-transfer level (RTL) circuit descriptions.
This dissertation proposes a synthesis framework aimed at generating novel Radiation Hardened by Design (RHBD) asynchronous circuits from Chisel specifications using novel timing and soft-error tolerant control templates and a RHBD mutual exclusion element to support arbitration. Specifying these designs in Chisel enables scalability and reusability by reducing the non-recurring engineering costs of design, test, and verification. We demonstrate the utility of this flow on a design case study of an asynchronous network on chip (ANoC) that is part of an accelerator for fully homomorphic encryption. Two different RHBD templates, the traditional Triple Modular Redundant (TMR) and the timing resilient template were used to explore different ANoC designs. The resulting layouts show that timing resilient design is 2.46x smaller than the more traditional TMR counterpart. Our flow highlights the potential benefits of asynchronous circuits, not just for soft-error tolerance, but also for modularity, reusability, and composability.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Clustering and fanout optimizations of asynchronous circuits
PDF
An asynchronous resilient circuit template and automated design flow
PDF
A framework for soft error tolerant SRAM design
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
Circuit design with nano electronic devices for biomimetic neuromorphic systems
PDF
Theory, implementations and applications of single-track designs
PDF
Gated Multi-Level Domino: a high-speed, low power asynchronous circuit template
PDF
Dynamic packet fragmentation for increased virtual channel utilization and fault tolerance in on-chip routers
PDF
Analog and mixed-signal parameter synthesis using machine learning and time-based circuit architectures
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Library characterization and static timing analysis of asynchornous circuits
PDF
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
Modeling and mitigation of radiation-induced charge sharing effects in advanced electronics
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Security-driven design of logic locking schemes: metrics, attacks, and defenses
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Design of modular multiplication
Asset Metadata
Creator
Herrera Buitrago, Moises Fernando
(author)
Core Title
Radiation hardened by design asynchronous framework
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2022-08
Publication Date
07/12/2022
Defense Date
06/13/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asynchronous circuits,Chisel,homomorphic encryption,network on chip,OAI-PMH Harvest,radiation hardening by design,Scala,Verilog,VLSI
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter (
committee chair
), Dimou, Georgios (
committee member
), Nakano, Aiichiro (
committee member
), Nuzzo, Pierluigi (
committee member
)
Creator Email
herrerab@usc.edu,mfherreradi@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC111371069
Unique identifier
UC111371069
Legacy Identifier
etd-HerreraBui-10826
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Herrera Buitrago, Moises Fernando
Type
texts
Source
20220713-usctheses-batch-952
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
asynchronous circuits
Chisel
homomorphic encryption
network on chip
radiation hardening by design
Scala
Verilog
VLSI