Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Microarchitecture design space exploration for multi-fluxon SFQ CPUs
(USC Thesis Other)
Microarchitecture design space exploration for multi-fluxon SFQ CPUs
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
MICROARCHITECTURE DESIGN SPACE EXPLORATION FOR MULTI-FLUXON SFQ CPUS
by
Haipeng Zha
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2025
Copyright 2025 Haipeng Zha
Acknowledgements
During my long journey toward my Ph.D., I have received immense help from many people. First, I would
like to thank my advisor, Professor Murali Annavaram. I could never have achieved these accomplishments
without his support. In this new research area, he guided me to find the right path in my research. He
used his expertise to help me overcome all the challenges I encountered. Beyond academia, he also cared
about our lives and treated me and other students as family. I cannot express enough gratitude to him.
Next, I would like to thank Professor Peter A. Beerel. He recommended me to Prof. Annavaram, which
initiated my Ph.D. journey. I would like to thank him and Professor Mengyuan Li for providing invaluable
feedback on my research as members of my thesis defense committee.
I would also like to thank all my collaborators for the research outcomes we have achieved together. I
would also like to thank all my lab mates for being friends in academia and in everyday life. Additionally,
I would like to thank Yang Zhang, Yue Kun, and Mingye Li for both inspiring my research and sharing
moments of private life.
Beyond academia, I am extremely grateful to my best friend, Shichun "Huzi" Hu. We have spent most
of our time together during our life at USC, and our friendship will endure forever.
Last but not least, I would like to thank my parents. They have been my greatest support throughout
this long journey toward my Ph.D. I could never have become who I am without them.
ii
Table of Contents
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 SFQ Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Clock Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 DRO Memory Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 NDRO Memory Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Splitters and Mergers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Path Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 3: Multi Fluxon Storage and Its Implications for Microprocessor Design . . . . . . . . . . . 10
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Context of Microprocessor Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 DRO Cell With Multi Fluxon Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.1 Design and Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Register File Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4.1 Single Memory Cell for the Register File . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.2 Register File With HC-DRO Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.2.1 Register File Write Operation . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2.2 Register File Read Operation . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Branch Predictor Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5.1 Branch Predictor Circuit Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.1.1 Reading the Branch Prediction Value . . . . . . . . . . . . . . . . . . . . 23
3.5.1.2 Incrementing the Predication State . . . . . . . . . . . . . . . . . . . . . 24
3.5.1.3 Decrementing the Predication State . . . . . . . . . . . . . . . . . . . . . 24
3.5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.2.1 Verilog Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . 25
iii
3.5.2.2 Architecture Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 4: HiPerRF: A Dual-Bit Dense Storage SFQ Register File . . . . . . . . . . . . . . . . . . . 28
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Clock-Less NDRO Register File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.1 Read Port Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Reset Port Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.3 Write Port Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.4 Read and Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2.5 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 HiPerRF: HC-DRO RF with NDRO capabilities . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 HC-DRO Read and Write Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3.2 LoopBuffer for Non-Destructive Readout . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Read Port and Write Port Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.4 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4 Multi-Bank HiPerRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.4.1 Port Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4.2 Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.1 Hardware Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.3 Impact of Wire Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter 5: SuperBP: Design Space Exploration of Perceptron-Based Branch Predictors for
Superconducting CPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 Recap of Perceptron Branch Predictor . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 Recap of Hashed Perceptron Branch Predictor . . . . . . . . . . . . . . . . . . . . . 57
5.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.4 NDRO Baseline Branch Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4.1 Perceptron Weight Storage Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.4.2 Training Unit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.4.3 Inference Unit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.4 Optimization: Using a 3-bit Adder for Efficient Inference . . . . . . . . . . . . . . . 64
5.4.5 Hashed Perceptron Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5 SuperBP: HC-DRO Perceptron Branch Predictor . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5.1 Perceptron Weight Storage Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5.2 Inference LoopBuffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5.3 Training LoopBuffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.4 Eliminating the Decoder and Encoder Circuits . . . . . . . . . . . . . . . . . . . . . 69
5.5.5 Training Unit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5.6 Inference Unit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.7 Hashed Perceptron Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
iv
5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7.1 Perceptron and Hashed Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7.2 Hardware Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.7.3 Detailed Performance Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Chapter 6: SF-QIQ: An SFQ Issue Queue Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.1 α-Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.2 Issue Queues in Out-of-Order Execution . . . . . . . . . . . . . . . . . . . . . . . . 85
6.3 Challenges of Designing an SFQ Issue Queue . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.1 Selection Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.1.1 SHIFT Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.3.1.2 RANDOM Fill Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.1.3 CIRCULAR Queue Approach . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.2 Wakeup Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 SF-QIQ Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.1 CAM-Based Wakeup Logic Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4.1.1 Design Alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.4.2 Shift-Based CIRC Port Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.4.2.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.2.2 Write Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.4.2.3 Issue an Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4.2.4 Squash Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.4.3 Selection Logic Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.4.3.1 CIRC with Correct Order . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.4 HC-DRO Payload RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5.1 Hardware Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.5.1.1 JJ Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.5.1.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.5.2 Software Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.5.2.1 IQ Size and Selection Logic Delay . . . . . . . . . . . . . . . . . . . . . . 107
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Chapter 7: Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Chapter 8: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
v
List of Tables
3.1 Comparision among different memory cells in the context of register file design . . . . . . 21
3.2 Estimate of JJ count for register file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Total JJ count and the percentage over the baseline design . . . . . . . . . . . . . . . . . . 45
4.2 Static power and the percentage over the baseline design . . . . . . . . . . . . . . . . . . . 45
4.3 Readout delay and the percentage over the baseline design . . . . . . . . . . . . . . . . . . 45
4.4 Readout delay, loopback latency with PTL delay . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Total JJ count and the saving % over the baseline . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Breakdown of JJs with the weight table of (a) 16×8 (b) 128×32 . . . . . . . . . . . . . . . . 78
5.3 Total static power consumed by BP only (µW) and the saving % over the baseline . . . . . 79
6.1 The JJ count across different issue queue designs . . . . . . . . . . . . . . . . . . . . . . . . 105
6.2 The static power (mW) and relative overhead across different issue queue designs . . . . . 106
vi
List of Figures
2.1 Schematic of the DRO cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Schematic of the NDRO cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Schematic of (a) Splitter and (b) Merger Cell . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 (a) Path unbalanced SFQ logic (b) Path balanced SFQ logic . . . . . . . . . . . . . . . . . . 8
3.1 A traditional five-stage pipelined CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 HC-DRO and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 A simple parallel access register file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 Implementation of a row in the RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.5 Two-bit branch predictor with C3DRO cells . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.6 Verilog simulation result of the branch predictor system . . . . . . . . . . . . . . . . . . . 25
3.7 SFQ microprocessor performance with or without a branch predictor . . . . . . . . . . . . 26
4.1 NDRO register file design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Schematic of an AND Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 (a) DEMUX built with combinational logic (b) DEMUX built with NDROC (c) 1-to-4
DEMUX with NDROC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 (a) NDRO write circuit (b) Dynamic AND timing . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 32×32 bits NDRO register file timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.6 HiPerRF design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
vii
4.7 (a) HC-WRITE design (b) HC-CLK design (c) HC-READ design (d) state machine diagram
of the counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8 Timing of HiPerRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.9 Timing of dual-banked HiPerRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10 Dual-banked HiPerRF design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.11 CPI overhead over baseline (NDRO RF) of different RISC-V and SPEC 2006 benchmarks
for different designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.12 Placement and routing results of HiPerRF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1 Perceptron branch predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.2 Hashed Perceptron branch predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3 NDRO perceptron branch predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.4 NDROC-based DEMUX design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.5 Write port and DAND gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.6 NDRO training unit design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.7 NDRO inference unit design (a) Original (b) Optimized . . . . . . . . . . . . . . . . . . . . 64
5.8 HC-DRO perceptron branch predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.9 HC-CLK circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.10 HC-DRO training unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.11 HC-DRO inference unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.12 Multiplication and 2’s complement translation . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.13 (a) State machine of the counter (b) HC-DRO serial adder . . . . . . . . . . . . . . . . . . . 73
5.14 3-bit HC-DRO serial adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.15 SuperBP and hashed variant MPKI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.16 MPKI for both designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.17 MPKI for NDRO and SuperBP design with 30K JJs and Reduction in MPKI (black label) . . 81
viii
5.18 SuperBP IPC over NDRO-30k JJs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.1 (a) α-DRO (b) α-AND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Shift logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.3 (a) Collapsible issue queue (b) Non-collapsible issue queue . . . . . . . . . . . . . . . . . . 89
6.4 CAM-based Wakeup Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.5 RAM-based Wakeup Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.6 SF-QIQ Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.7 SFQ CAM-based Wakeup Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.8 CAM-based Wakeup Logic with Threshold Gate . . . . . . . . . . . . . . . . . . . . . . . . 94
6.9 Shift-based CIRC Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.10 Selection Logic Design (a) High Priority (b) Low Priority (c) Overview of the Whole Design 99
6.11 Selection Tree Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.12 (a) Reversed Order (b) Fix Order Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.13 Selection logic with correct order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.14 CPI overhead over baseline (SHIFT) of different SPEC CPU2017 benchmarks for different
designs (a) 32-entry IQ (b) 64-entry IQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.15 Relative IPC with different IQ sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
ix
Abstract
Single Flux Quantum (SFQ) superconducting technology provides significant power and performance benefits in the era of diminishing CMOS scaling. SFQ CPUs can also help scale quantum computing technologies, as SFQ circuits can be integrated with qubits due to their amenability to a cryogenic environment.
Recent advances in design automation tools and fabrication facilities have brought SFQ-based computing
to the forefront. SFQ technology is constrained by the number of Josephson Junctions (JJs) integrated into
a single chip, and prior works focused more on JJ-efficient SFQ datapath designs. However, to successfully build an SFQ-based CPU, many control-intensive microarchitecture structures need to be designed
to be JJ-efficient. We focus on the following significant challenges in SFQ CPU design: (1) Unlike CMOS,
SFQ technology lacks efficient on-chip memory design, such as SRAM. Hence, this thesis first tackles the
challenge of improving the density of SFQ memory cells. We propose a new type of multi-fluxon SFQ
storage cell, called HC-DRO (high capacity destructive readout cell), which stores multiple bits in a single
SFQ memory cell. By storing up to 3 SFQ pulses in a single SFQ memory cell, HC-DRO increases the total
density and JJ efficiency of on-chip memory. (2) Even with a dense HC-DRO cell, there is a significant JJ
overhead in designing microarchitecture structures such as register files in the CPU. Since HC-DRO cells
have a destructive readout property, it is essential to provide a non-destructive property for a register file
read while exploiting the density of the HC-DRO cell. We designed a register file called HiPerRF that uses
HC-DRO cells while still preserving the non-destructive readout property necessary for CPU register files.
HiPerRF shows comparable performance with traditional non-destructive readout (NDRO)-based register
x
file design that is highly JJ-efficient. (3) SFQ CPUs will have deep pipelines, and hence, a good branch predictor is necessary for high performance. We designed the microarchitecture for an SFQ branch predictor
called SuperBP. SuperBP is built on top of a perceptron branch predictor design that was proposed in CMOS
CPUs. SuperBP provides unique circuit and microarchitecture innovations that enable the branch predictor to operate directly on the multi-bit state encoded in HC-DRO cells. The branch predictor does not need
any decoding of the multi-bit information, thereby creating an efficient branch predictor design for SFQ
CPUs. (4) High-performance CPUs also need an efficient instruction scheduling logic. However, designing
instruction issue queues for SFQ CPUs is challenging due to out-of-order entry and exit of instructions
from the issue queues. We designed SF-QIQ that tackles the out-of-order allocation and deallocation in
issue queues. We hope this broad array of microarchitecture innovations provided in this thesis form the
blueprint for future SFQ CPU designs.
xi
Chapter 1
Introduction
Rapid Single Flux Quantum (RSFQ) devices introduced by Likharev et al. [43] have gained traction as one
of the promising technologies to augment CMOS-based computing. Single Flux Quantum (SFQ) technology uses quantized voltage pulses in digital data generation, reproduction, amplification, memorization,
and processing. In particular, applications and kernels that demand substantial compute density and/or
need to operate at extremely low power are well suited for SFQ-based computing. Some examples are
computing in space applications with extremely limited power and iterative linear algebraic computations
in machine learning. The SFQ technology is based on superconducting devices called Josephson Junctions
(JJs). These devices work at a low temperature with a short switching time (∼1ps) and little switching
energy dissipation (∼ 10−19 Joules) [47]. The JJ-based SFQ circuits designed have been demonstrated to
operate at frequencies up to 770GHz [12]. Recent works focused on efficient SFQ logic circuit realizations,
such as designing ALUs and other digital structures that are necessary to build CPUs [14, 18, 21]. Even the
early realization of a simple 8-bit bit-serial CPU has been prototyped [2]. The current SFQ technology is
roughly equivalent to a "150 nm" CMOS node. The SCE technology road map [13] predicts that by 2028,
we will have a "90 nm" equivalent node. Theoretical estimations of the maximum density of SFQ-based
circuits utilizing the geometric inductance of a wire suggest a density of approximately 107
JJ/cm2
. These
projections indicate that building an SFQ-based CPU is within reasonable reach.
1
However, there are still a lot of challenges to resolve when building an SFQ-based modern CPU. One
challenge is on-chip memory. In current-generation microprocessors, on-chip memory has a wide range of
uses that go beyond cache storage. Large physical register files, branch predictor history tables, instruction
issue queues, and prefetching buffers are performance-critical structures that rely on large memory availability. SFQ memories are built as flip-flop-like designs, and hence, memory density is quite low compared
to CMOS SRAM. SFQ currently provides two different memory cell designs: a Destructive ReadOut (DRO)
cell and a Non-Destructive ReadOut (NDRO) cell (more details in Chapter 2). Each DRO or NDRO cell in
current designs stores a single pulse. The content stored in the DRO cells can only be read for once. Hence,
for most CPU microarchitecture memory structures, such as register files, NDRO cells are the first choice
since these structures need to preserve their content after the read for future reads. However, NDRO cells
are JJ-intensive, which heavily restricts the on-chip memory size.
This thesis tackles the challenges of microarchitecture design in building SFQ CPUs. We tackle these
challenges through a holistic collection of techniques ranging from improved memory cell density to SFQspecific design optimizations. The thesis makes four specific contributions.
1. We proposed the High Capacity Destructive ReadOut (HC-DRO) cell [34]. HC-DRO can hold up to
three SFQ pulses, which means they can store 2 bits of information in one memory cell, thereby providing an opportunity to double the memory density. More design details are included in Chapter 3.
2. SFQ-based CPU research currently focuses more on datapath design [22, 15, 72, 27, 40]. Though
these designs will help to improve the overall performance of the SFQ-based CPU, a lot of critical
microarchitectures need to be designed for SFQ CPUs. We embark on using HC-DRO cells for register file design. We built a register file based on the HC-DRO cells called HiPerRF [85]. HC-DRO
cells provide only destructive readout capability. Namely, each value can be read only once. However, CPU register file contents are read multiple times in any program, and hence, a destructive
readout complicates the register file design. HiPerRF provides the non-destructive property using
2
a loopback write mechanism, thereby preserving the higher density of HC-DRO cells without compromising the multi-read demands of a register file. HiPerRF reduces the JJ count of the register file
design, after accounting for all the peripheral access circuitry costs, by 56.1% and reduces the static
power by 46.2%. Furthermore, HiPerRF reduces the JJ count by 16.3% even when considering an
entire in-order RISC-V CPU core. More design details are included in Chapter 4.
3. SFQ-based CPUs are gate-pipelined designs due to the inherent nature of the magnetic pulse storage
and movement (more details in Chapter 2). Deeply pipelined CPUs suffer from a big branch misprediction penalty, which means that a good branch predictor is necessary. To tackle this challenge,
we proposed SuperBP [86]. SuperBP is a perceptron branch predictor that also uses the HC-DRO
cells to store perceptron weights. The naive integration of HC-DRO with SFQ logic is inefficient
as HC-DRO cells store multiple fluxons in a single cell, which requires a decoding step on a read
and an encoding step on a write. SuperBP presents novel inference and prediction update circuits
for the Perceptron predictor that can directly operate on the native 2-bit HC-DRO weights without
decoding and encoding, thereby reducing the JJ use. SuperBP reduces the JJ count by 39% compared
to the NDRO-based design. We also evaluate the performance of Perceptron and its hashed variants
with the HC-DRO cell design using a range of benchmarks. Our evaluation shows that for a given
JJ count, the basic Perceptron variant of SuperBP provides better accuracy than the hashed variant.
The hashed variant uses multiple weight tables, each of which needs its access decoder, and decoder
designs in SFQ consume a significant number of JJs. Thus, the hashed variant of SuperBP wastes the
JJ budget for accessing multiple tables, leaving a smaller weight storage capacity, which compromises prediction accuracy. The basic Perceptron variant of SuperBP improves prediction accuracy
by 13.6% over the hashed perceptron variant for an exemplar 30K JJ budget. More design details are
included in Chapter 5.
3
4. High-performance CPUs are designed using out-of-order (OOO) execution paradigms. OOO CPUs,
however, require efficient instruction issue queues. Instructions exit issue queues in the order their
operands become ready, and hence, designing such a random entry/exit queue is challenging in
SFQs. We design SF-QIQ, an instruction issue queue design based on CAM-based wakeup logic and
the non-collapsible circular queue (CIRC). We compared different design choices, such as CAM or
RAM wakeup logic and collapsible and non-collapsible issue queues. We demonstrate how to build
CAM-based wakeup logic, shift-based CIRC port design without using the heavy decoder, and a
simple, fast, position-based selection logic. We also showed how to fix the order error in the CIRC
issue queue with minimal hardware cost and how to use HC-DRO cells further to reduce the JJ cost
in the issue queue design. SF-QIQ can achieve a similar performance compared with an ideal issue
queue with a more complicated selection logic.
The next chapter will provide some background knowledge of SFQ technology. Then, the following
chapters will provide more details on HC-DRO, HiPerRF, SuperBP, and SF-QIQ.
4
Chapter 2
Background
2.1 SFQ Logic
In CMOS technology, data is represented as voltage levels. For example, "1" is represented as a high voltage
level, and "0" is represented as a low voltage level. However, SFQ logic uses magnetic pulses to represent
"1" and "0". The magnetic pulse is stored in the form of a single quantum flux or fluxon. If a memory cell
stores a fluxon, it means it stores a "1". If a memory cell stores nothing, it means it stores a "0". Once the
fluxons are read from the memory cells, they are transmitted between logic gates in the form of SFQ pulses.
The existence of an SFQ pulse represents a "1", and the absence of a pulse represents a "0". However, most
SFQ logic gates need a clock input to perform the logic function once the input pulse arrives. Hence, SFQ
logic uses gate-level clocking to solve this issue. Since each logic gate has a clock and works like a flip-flop,
each logic gate can be treated as a pipeline stage. Hence, SFQ CPUs will have deep pipelines.
2.2 Clock Distribution
Each gate in the SFQ logic requires a clock with significant overhead. Clock distribution is an active
research area, and several works aim to reduce clock overheads [71, 33, 69]. In particular, using the dynamic
SFQ (DSFQ) technology [57], researchers successfully designed gates that do not need a clock. Instead,
5
these gates are self-timed along with their self-resetting property. This thesis uses this technology in
certain places to reduce the clock distribution demands, but a complete clocking analysis is outside the
scope of this thesis.
2.3 DRO Memory Cell
In SFQ technology, the Destructive ReadOut (DRO) cell [43] is one of the most important cells. It is a basic
memory cell that stores the SFQ fluxon. It can also be used as a buffer cell for path balancing [36] in a
circuit (more details on path balancing later). DRO cells are also known sometimes as D-Flipflops.
D
CLK
L1 L2
L3
J3 J0
J1 J2
Q
Ib
Figure 2.1: Schematic of the DRO cell
Figure 2.1 shows the schematic of a DRO cell. It receives an SFQ pulse at input D. If it does not already
have a fluxon (SFQ pulse) stored in it, it will store the fluxon in the superconducting loop J1-L2-J2. Otherwise, the incoming pulse is dissipated through the buffer junction J0. Once we read from the DRO cell
by sending a pulse to input CLK, the superconducting loop will be reset and release an SFQ pulse at the
output Q. After each read operation, the loop is reset. Thus, the read is destructive for a one-bit DRO cell.
2.4 NDRO Memory Cell
Non-Destructive ReadOut (NDRO) cell [61] is another important memory cell in SFQ technology. Unlike
DRO cells, NDRO cells can keep the data after the read operation. It works similar to a CMOS D flip-flop
with reset. Figure 4.4a shows the schematic of a regular NDRO cell. Once it receives a pulse from the input
6
IN, it will store the fluxon in the loop J3-L5-J7-J10. If the J3-L5-J7-J10 loop already has a fluxon stored in it,
the incoming IN pulse is dissipated through the junction J2. The pulse that comes from the input RESET
will make the fluxon stored in the loop be dissipated through J7. If there is no pulse stored in the loop, the
RESET pulse is dissipated through J5. The pulse from the input CLK, which is essentially a read operation,
will trigger a pulse on the output OUT only if the J3-L5-J7-J10 loop has a fluxon, and the stored fluxon stays
as is. Thus, the NDRO cell keeps the read operation non-destructive.
L9
J1
OUT
Ib5 Ib6 IN
RESET
L10
L1
L3
J4
J11
Ib1
Ib2
J8
CLK
L7
Ib3
J2
J3
L2
L4 J5
J6
L5
L6
J7
J9 L8
J10
Ib4
Figure 2.2: Schematic of the NDRO cell
2.5 Splitters and Mergers
Because a single pulse is generated from a logic gate, it is not possible to drive two SFQ gates by one SFQ
pulse. Unlike CMOS fan-out junctions, an SFQ pulse must be explicitly split at every fan-out point. Thus,
to drive two SFQ gates, a splitter [43] is required that reproduces the input pulse. Figure 2.3a shows the
schematic of a splitter. Once the splitter receives a pulse on its input A, it will generate two SFQ pulses on
both its outputs (output B and C).
In SFQ logic, a merger [43] makes two SFQ pulses drive the same pin possible. Figure 2.3b shows the
schematic of a merger. When two pulses A and B arrive too close in time, there will be only one pulse on
the output C. In this case, the early one will trigger a single pulse to be outputted at C, and the later one is
dissipated through J3 (A) or J4 (B).
7
A
L1
J1
Ib1
L2
J2
Ib2
L3
J3
Ib3
B
C
(a)
J1 L3
C
A Ib1 Ib2
B
L4
L1
L2
J2
J3
J4
J5
(b)
Figure 2.3: Schematic of (a) Splitter and (b) Merger Cell
2.6 Path Balancing
As described before, each SFQ logic is controlled by a clock. To ensure a correct result, all signals at the
input of a gate must be synchronized. The pipeline depth should be the same from any primary input to
any primary output. We call this synchronizing procedure Path Balancing [42]. The most common way
for path balancing is to add DFF (DRO cells) to the paths with shorter pipeline depths.
(a) (b)
Figure 2.4: (a) Path unbalanced SFQ logic (b) Path balanced SFQ logic
Figure 2.4 shows an example to compute Y=A·B·C. While the CMOS design in Figure 2.4a faces no
timing challenges, adapting the same design in SFQ will cause failures. Assume A, B, and C all have a logic
"1" pulse in the same cycle. In the first cycle, AND1 has inputs that are both "1". It will generate a pulse
in the next cycle. However, AND2 has one input as "1" (C) and one input as "0" (A·B), and it will generate
nothing as a "0" output. In the second cycle, AND2 still has one input as "1" (A·B) and one input as "0" (C).
As a result, AND2 cannot generate a correct output "1". To solve this issue, a DFF is added to the shorter
path to balance the path, as shown in Figure 2.4b. The input pulse C is delayed by one cycle and arrives at
8
AND2 at the same cycle as the output pulse of AND1. As a result, AND2 will generate a correct output of
"1" in the third cycle.
9
Chapter 3
Multi Fluxon Storage and Its Implications for Microprocessor Design
3.1 Introduction
One of the impediments to the broader adoption of SFQ designs in microprocessors is the on-chip memory
capacity. SFQ memories are essentially built as flip-flop-like designs, and hence, memory density is quite
low compared to SRAM. In current-generation microprocessors, on-chip memory has a wide range of uses
that go beyond cache storage. Large physical register files, branch predictor history tables, and prefetching
buffers are performance-critical structures that rely on large memory availability.
SFQ currently provides two different reliable memory designs: a Destructive ReadOut (DRO) cell and a
Non-Destructive ReadOut (NDRO) cell. Since SFQ cells are based on magnetic flux pulses, once a memory
cell is read, by default, the pulse is released into the following block, and the memory cell is reset. Additional effort must be expended to restore the pulse once it is read. As such, NDRO cells are usually 3X
larger than DRO cells.
Each DRO or NDRO cell in current designs stores a single pulse. The presence or absence of a pulse is
treated as single-bit binary information. In this work, we propose a mechanism to double the density of
the DRO memory cell by allowing each cell to store up to three SFQ pulses without increasing the footprint
of the DRO cell. This design requires a new readout circuit to decode the two-bit value stored in the SFQ
cell. Similarly, a new write circuit design is needed to encode the two-bit value in the cell.
10
We first present dual-bit storage architecture, along with the read circuit and write circuit design. We
call this architecture HC-DRO (high capacity-DRO). As a demonstration of how HC-DRO can be incorporated in current microprocessor designs, we design a 2-bit branch predictor that uses a single column of
HC-DRO cells to store the branch history. We then design a register file architecture that uses the HCDRO cells to store the content. The goal of these two demonstrations is to show how the HC-DRO can
be integrated into a traditional pipelined microprocessor design. Hence, the emphasis is on architecture
integration rather than claiming the obvious performance advantages of higher-density memory. Later,
we will provide innovative microaarchitecture structures such as register files and branch predictors in
Chapter 4 and 5.
The primary contributions of this chapter are as follows:
• We present the HC-DRO cell design and show how the cell data can be encoded and decoded using
a flow-clocking scheme [23].
• We present a register file design that uses HC-DRO cells. We show how data must be encoded into
2-bit chunks for storage in HC-DRO cells. We also show the decoding process to read the compact
HC-DRO representation into a wider data format.
• We present a 2-bit branch predictor that uses HC-DRO cells for storing the branch history. We
present the read operation that decodes the 2-bit counter value stored in a single HC-DRO cell. We
also present the encoding process that updates the counter value.
• Both the branch prediction counters and register files must preserve their content after a read operation. Since our design uses HC-DRO cells that destroy the content after each read, we present
a unique approach to restore the value after each read. We use a single NDRO cell that is shared
across an entire column of HC-DRO cells to enable a low-cost approach to preserve the HC-DRO
cell contents.
11
3.2 Context of Microprocessor Design
In this section, we provide a brief overview of the architecture of the 5-stage microprocessor, which is our
initial target to integrate the HC-DRO design. We use the notion of a pipeline stage as it is used in CMOS
designs for simplicity of explanation. In SFQ designs, each such pipeline stage is, in turn, pipelined at a
gate level. Since SFQ-based microprocessor designs are in their infancy, we decided to integrate our design
within a simpler in-order 5-stage pipeline that uses a 2-bit branch predictor [64]. Our design is illustrated
in the Figure 3.1.
The first stage is the fetch stage, which reads an instruction using the program counter (PC) register as
the address pointer. Modern microprocessors access a branch predictor using a hash of the PC register to
load the next PC speculatively. Note that at the time of fetch, it is not even known whether an instruction is
a branch instruction or otherwise. The determination of the instruction type is made only after decoding.
Note that even in-order microprocessors require a branch predictor to fetch and decode from the predicted
path. Hence, waiting until the decode to access the branch predictor is impractical since even CMOS-based
microprocessors may require a few cycles before an instruction is decoded, and SFQ-based designs may
require even more cycles to reach the decode stage due to gate-level pipelining.
The 2-bit branch predictor design consists of 2 bits of storage per entry. The 2-bit storage acts as a
saturating up-down counter. The branch prediction entry is incremented if the prediction is true. Otherwise, the entry is decremented. In our processor design, every cycle, the branch predictor, and a branch
target buffer are accessed concurrently using the current PC. If the branch is predicted not taken, the PC
is incremented by 4 (all our instructions are 4 bytes wide). If the branch is predicted as taken, the next PC
is fetched from a branch target buffer, which fills the next PC.
Instructions from the predicted PC are speculatively fetched and decoded, and sometimes, the register
operands may be speculatively read. But no execution is permitted until the predicted path is resolved.
The branch predictor is read once for every instruction fetched, and the counter is updated when a branch
12
Figure 3.1: A simple representation of a traditional five-stage pipelined CPU. Components modified for
HC-DRO usage are shown in 3D boxes
instruction is resolved. The counter value must be preserved after a read so that the counter value may be
incremented or decremented later.
Once an instruction completes the decode, it may access the register file to read the data during the
execution stage. Any load/store operations may access memory in the memory stage. All other instructions
write back the computed results back to the register file. The register read process must preserve the data
after the data is read since the register file is based on HC-DRO (not NDRO) cells.
3.3 DRO Cell With Multi Fluxon Storage
We now describe our HC-DRO design that stores multiple fluxons in a single superconducting loop.
3.3.1 Design and Functionality
Destructive ReadOut (DRO) cell [43] of single flux quantum (SFQ) technology is one of the most important
building blocks for superconducting circuits, which can be used as a memory element for storing the SFQ
pulses. It is also used as a buffer cell for synchronizing the signals [36] in a circuit. DRO cell is also
sometimes known as an RS-Flipflop or D-Flipflop. Figure 3.2a shows the schematic of a regular DRO cell
that receives an SFQ pulse at input D and stores it in the superconducting loop J1-L2-J2 if it does not already
13
have a fluxon (SFQ pulse) stored in it. If the J1-L1-J2 loop already has a fluxon stored in it, the incoming
pulse is dissipated through the buffer junction J0. The stored fluxon is read by input CLK, which resets the
superconducting loop, subsequently resulting in an SFQ pulse at the output Q. Each cell read is destructive
since the loop is reset after each read. In the current design, each cell only stores at most a single flux, and
hence, each cell acts as a single-bit storage.
D
CLK
L1 L2
L3
J3 J0
J1 J2
Q
Ib
iL
(a) (b) (c)
(d) (e)
Figure 3.2: (a) Schematic of a DRO-Cell. Parameters for (i) C2DRO: L1∼ 6 pH, L2∼ 20 pH, L3∼ 4 pH,
J1∼ 115 µA, J2∼ 111 µA, J3∼ 80 µA; (ii) C3DRO: L1∼ 6 pH, L2∼ 28 pH, L3∼ 4 pH, J1∼ 115 µA,
J2∼ 85 µA, J3∼ 80 µA. J0 is removed for HC-DRO cells (b) HC-Write cell: encoder circuit to convert
binary format bits [B1 B0] into a stream of fluxons. Circled S and circled M represent splitter and merger
cells, respectively. (c) HC-CLK cell: circuit producing three SFQ pulses by taking a single SFQ pulse (d)
HC-Read cell: SFQ binary counter using T1-flipflops to convert SFQ pulses into binary format [B1 B0] (e)
JSIM simulation result of C3DRO cell
The primary innovation of our work is to enable the superconducting loop to store more than a single
flux, which we call high-capacity DRO (HC-DRO) cells. HC-DRO cell has the same circuit schematic as a
DRO cell, but the buffer junction J0 is removed. The area consumed by the HC-DRO cell will at most be
equal to the area consumed by a regular DRO but can store more than one fluxon. In the rest of the chapter,
an HC-DRO that can store two fluxons and three fluxons will be called C2DRO and C3DRO, respectively.
14
Required modifications to go from a regular DRO to a high-capacity DRO are: (i) increase the inductance of L2 and modify the critical currents of J1 and J2 appropriately; (ii)Remove J0 as it prevents the
cell from storing more than one pulse. The modifications in terms of the device parameters are given in
the description of Figure 3.2. As expected, the drawbacks of these designs are some reductions in margin
and frequency of operation. However, we contend that reduced frequency of operation is not a major
impediment since the operational frequency is not limiting the design.
Two challenges must be dealt with for these HC-DRO cell designs. First, the write operation must
encode 2 bits of information at once into a single cell by generating multiple fluxons. Second, on a read
operation, multiple fluxons must be read out and transformed into an equivalent 2-bit value for processing.
HC-Write operation: C2DRO cell and C3DRO cell can have memory states 0,1,2 and 0,1,2,3, respectively representing the number of fluxons they can store. For a write operation, the desired number of
fluxons (in the form of SFQ pulses) need to be applied at input D, which will get stored in the J1–L2-J2
loop. However, writing a number of fluxons more than the HC-DRO capacity will lead to pushing out
of the extra fluxon, resulting in an SFQ pulse at output Q. For interfacing with a C3DRO cell in a regular
circuit that uses the binary format to represent information, the circuit in Figure 3.2b must be used first
to convert the binary format data. Let us assume that an ALU computes an 8-bit value that is going to be
written to a register. The 8-bit value must first be split into groups of two bits (starting from LSB), and each
two-bit output must be stored in a single HC-DRO cell. Each two-bit value is used to generate a sequence
of SFQ pulses whose count is equal to the number represented by the binary number. The HC-encoder
encoder circuit splits the MSB of the two bits into two and merges these two pulses along with the LSB
pulse with different delay values to make sure that they do not overlap. The splitter logic is shown as S,
and the merge logic is shown as M in the figure.
HC-Read operation: For reading high-capacity DRO cells without adding extra JJs and inductances
to the regular DRO structures, every read operation will require applying multiple clock pulses at the input
15
CLK to extract the stored fluxons. C2DRO and C3DRO cells require two and three pulses, respectively, to
be applied at input CLK for a single read operation. The output will be a series of SFQ pulses indicating
the stored number of fluxons. As such, the read operation essentially translates into generating multiple
read clock signals successively to extract the fluxon stored in a loop, which can be done by an HC-Read
circuit shown in Figure 3.2c. The total number of fluxons extracted is counted to generate the 2-bit decoded
value. The 2-bit decoded value can be generated with the traditional 2-bit binary counter [48] shown in
Figure 3.2d.
3.3.2 Simulation Results
We implemented the HC-DRO cell along with the associated HC-Write and HC-Read blocks. Figure 3.2e
shows the JSIM simulation waveforms of input pin D, clock pin CLK, output pin Q, and the current in
inductance L2, showing evidence of stored fluxons. Clocking of the cell is shown after storing 0, 1, 2, and
3 fluxons with three pulses used for reading. Note that the current flowing in inductance L2 increases as
the number of input fluxons increases. For every reading event, the number of fluxons stored in the cell is
read and results in SFQ pulses at output Q. When more than 3 pulses are given as input, it is also shown
that the extra fluxon is pushed out of the cell, which results in an SFQ pulse at the output. The C3DRO
cell functions as usual after pushing out the extra pulse since we can still see there are three SFQ pulses at
output Q after the reading operation.
3.3.3 Related Work
Following the introduction of our HC-DRO design, recent studies have extended its core multi-fluxon
storage concept to diverse superconducting circuit architectures.
16
Jardine and Fourie [28] proposed a superconducting neuron design integrating Rapid Single Flux Quantum (RSFQ) and Quantum Flux Parametron (QFP) technologies. Their architecture introduced a bidirectional multi-fluxon storage circuit for synaptic weight storage, capable of retaining four fluxons. This
design enables eight distinct input flux levels to the QFP, equivalent to approximately 3-bit weighting resolution. While their data encoding scheme differs from HC-DRO’s methodology, the shared foundation in
multi-fluxon manipulation directly aligns with our framework’s core innovation.
Ucpinar et al. [78] further advanced this paradigm through their multiflux non-destructive readout
(M-NDRO) cell. Inspired by HC-DRO’s multi-fluxon architecture, their design utilizes three stored fluxons
to represent 2-bit information while enabling repeated readout operations without flux loss—a critical
enhancement for scalable memory applications. Notably, the team has progressed to fabrication-stage
validation, demonstrating the practical viability of HC-DRO-derived multi-fluxon systems.
These efforts collectively highlight HC-DRO’s role as a foundational platform for superconducting
circuit innovation, particularly in multi-bit fluxon storage and non-destructive memory applications.
3.4 Register File Design
In this section, we describe how HC-DRO cells form the basic building blocks of a higher-density register
file in an SFQ microprocessor. Though there were several attempts at demonstrating high-frequency SFQ
microprocessors [2, 8, 59], due to the unique challenges faced by the superconducting electronics [35], the
demonstrations could not compete with the CMOS counterparts. One of the challenges is to have compact
register files and cache memories. Many of the current SFQ-specific register file (RF) designs [70, 37] are
bit-serial designs. These designs chain several DRO cells into a shift register (typically 32 DRO cells to
represent a 32-bit shift register), and the bits are written to/read from this chain. Such a design leads to a
significant increase in access latency. More critically, these designs are not compatible with a register file
17
designed with our innovative HC-DRO cell. The reason is that two bits must be stored in a single HC-DRO
cell, and bit-serial operation is not well suited for such compact storage.
In this section, we present an RF design without bulky cells and with simple bit-parallel read and bitparallel write operations. This design forms the foundation for designing a register file that can exploit
the HC-DRO cells. For the description, we chose 8×8-bit RF, which can be extended to have any number
of rows and rows containing any number of bits.
3.4.1 Single Memory Cell for the Register File
(a)
(b)
Figure 3.3: A simple parallel access register file (a) logical design with inputs and outputs (b) JJ-efficient
parallel access register file using SFQ-specific DAND gate
Figure 3.3 presents one simple memory cell design for a parallel register file design. The memory cell
in the figure in our context is an HC-DRO design that can store two bits. To write the data into a desired
row of the RF, an AND gate is added to the memory cell with its inputs as Address decoder output (AD) and
input-bits to be written (WB). Clocking of the AND gate is done by Write CLK (WC) signal, which makes
sure that the data is written to the desired decoded row when a write signal is given.
18
Figure 3.4: Implementation of a row in the RF: W Bis represent the bits to be written; W cis are write
channels;RCis represent the read channels;RBis represent the read bits; RADi and W ADi represent the
Read and Write address decoder outputs of ith row respectively. Circled S and circled M represent splitter
and merger cells, respectively.
We present a more area-efficient parallel register file design that reduces the JJ count. The need for
Write CLK (WC) distribution can be eliminated by using the Dynamic AND (DAND) gate instead of the
clocked AND gate in Figure 3.3a. Design and the implementation of the DAND gate can be found in [57].
Practical implementation of a single memory cell in the RF can be seen in 3.3b with Write address decoder
output (WAD) used as one of the inputs for DAND gate Read address decoder output (RAD) as Read clock.
Using the newly designed parallel multi-bit access mechanism, we built a microprocessor register file
using HC-DRO cells. We now describe how the register file is accessed (write and read operations).
3.4.2 Register File With HC-DRO Cells
The design of the 8-bit register using the RF Cell from 3.3b is shown in Figure 3.4. This design can have
read and write ports one each, and both read and write operations can be done in parallel. One of the
important blocks for any memory structure is a decoder that decodes the given address to access the
associated memory cell(s) in the memory. For the RF design, any decoder [21, 79] structure that has a
19
minimum count of JJs can be used. Below, the register file design is described in terms of read and write
functionality.
3.4.2.1 Register File Write Operation
WB7 to WB0 in Figure 3.4 represent the 8 bits that would be written into the register file, which are sent
through HC-WRITE cells to convert them into a stream of pulses. These converted pulses will be placed
on write channels, WC3 to WC0. We only need half of the write channels since we convert two bits of data
in binary format into one stream of pulses. The write channels supply the information to be stored in all
the rows of RF. A write decoder output will go to each of the rows, and only one of the WADs is activated
at a time based on the address loaded into the decoder, which selects the single row of the RF. Because of
the DAND operation, as shown in Figure 3.3b, the bits placed on write channels will be written into the
row of memory cells that are activated, and the other rows will lose the information after the hold time of
DAND gate. Note that the DAND inputs will be synchronized to arrive in a stipulated time frame.
3.4.2.2 Register File Read Operation
Similar to the WAD, one of the read address decoder outputs (RAD) is split and given to each cell in the
attached row based on the address loaded into the read decoder. Since we need to read the data in the
C3DRO cell, three pulses will be generated by using a circuit similar to the one shown in Figure 3.2c and
be given to the select input of the decoder, which will result in three pulses at the activated RAD instead of
one. The activated RAD will clock all the cells in a row, and their output will be placed on the read channels
(RC3 to RC0) from which the read bits are decoded using the HC-READ circuit shown in Figure 3.2d. One
HC-READ circuit per channel will convert the serial pulses into binary format, and it will be taken as
output to give input to other parts of the microprocessor.
20
It is important to note that only a single HC-WRITE and HC-READ circuit is necessary for a column
of RF cell. For instance, in a 32 × 8-bit register RF, only four HC-WRITE, and four HC-READ circuits
are needed across all 32 register files. As such, the costs of the HC-WRITE and HC-READ circuits are
amortized across the entire register file.
The comparison of RF structures using DRO cells, NDRO cells, and C3DRO cells is shown in Table 3.1.
In Table 3.2, we demonstrate the area benefits of using HC-DRO cells in a larger 8×8-bit register file and
a 32×32-bit register file. The JJ count of an 8×8-bit register file is about 56% of the traditional DRO RF.
Also, note that as the size of the register file grows, the overheads of HC-READ and HC-WRITE circuits
are amortized. Hence, for a 32×32-bit register file, the HC-DRO design is less than 50% of the DRO register
file design.
Memory Cell DRO NDRO C3DRO
JJ count 4 9 3
Storage capability 1 fluxon 1 fluxon 3 fluxons
Can read more than once No Yes No
No. of pulses to read One read pulse One read pulse Three read pulses
Reset Required No Yes No
Additional circuitry Baseline
Reset pulse
distribution to all
cells
One HC-CLK cell and one
HC-WRITE per write
channel and HC-READ cell
per read channel
Table 3.1: Comparision among different memory cells in the context of register file design
3.5 Branch Predictor Design
Any design of a microprocessor will have an instruction set specific to it, and one of the important instructions is branch in different forms. Branch prediction is an important aspect of computer architecture
to improve the operating latency of an instruction in a microprocessor [25]. A two-bit branch predictor
is one of the most successful early branch predictors that were used to predict if a branch instruction is
taken or not taken. The two-bit branch predictor can be represented as a finite-state-machine (FSM) with
21
Register File size &
Memory Cell Type
8×8 RF 32×32 RF
DRO NDRO C3DRO DRO NDRO C3DRO
1-bit RF cells (Memory
cells + Dynamic ANDs)
576 (256
+ 320)
896 (576
+ 320)
256 (96
+ 160)
9216
(4096 +
5120)
14336
(9216 +
5120)
4096
(1536 +
2560)
RAD distribution 168 168 72 2976 2976 1440
WAD distribution 168 168 72 2976 2976 1440
Write channel 168 168 72 2976 2976 1440
Read channel 280 280 150 4960 4960 2528
Reset distribution - 189 - - 3069 -
Peripheral circuitry - - 140 - - 512
Total JJ count 1360 1869 762 23104 31293 11456
Table 3.2: Estimate of Josephson junction count of different components for 8×8-bit and 32×32-bit sizes
of register file
four states, as shown in Figure 3.5a. The two-bit counter is used to represent different states, as shown
in the figure. In general, whenever a branch is not taken, the counter is decremented, and whenever the
branch is taken, the counter is incremented; these counters are saturating counters and, hence, no overflow or underflow. For predicting a branch instruction at a specific address, if the two-bit counter value
corresponding to a hash of that address is either 0 or 1, the prediction will be not taken. If it is 2 or 3, the
prediction will be taken. In the section below, the design of a 2-bit branch predictor circuit block using
C3DRO cells is explained.
(a) (b)
Figure 3.5: (a) FSM representation of a two-bit branch predictor (b)Two-bit branch predictor system design
with C3DRO cells
22
3.5.1 Branch Predictor Circuit Block
The design of a two-bit branch predictor circuit system using four states of C3DRO cells for branch instructions is shown in Figure 3.5b. In this figure, we assume the branch predictor has 8 entries, and a 3-bit hash
of the PC register accesses each entry. Because of the pulse-based logic, note that a 3-to-8 decoder circuit
can also be used as a 1-to-8 demultiplexer (1-to-8 DEMUX) circuit with three address bits loaded into it.
In general, the branch predictor system will also have address mapping similar to the cache mapping [52],
and the larger the count of branch predictor cells, the better the prediction system. The branch predictor
circuit system design will be explained in terms of its functionality. Specifically, how to implement the
FSM presented in 3.5a with the different number of fluxons stored in C3DRO cells as different states of
FSM.
3.5.1.1 Reading the Branch Prediction Value
The reading mechanism will be explained using the 8-BP cell system shown in Figure 3.5b. To read a cell
in the block, the corresponding 3-bit hashed PC value will be given to the ADD port, which will go to
both the DEMUX cells in the circuit, and then read pulses are given to READ port. The read pulses will be
split and given to the SET pin of the NDRO and the READ pin of the corresponding C3DRO. The NDRO
will be set, and if the C3DRO has at least one pulse stored in it, there will be output pulse(s), which will
also be given to the READ pin of the NDRO along with those pulses appearing on OUT port. Since the
NDRO is set, the C3DRO cell will be restored to the earlier value through the DEMUX cell, as the OUT pin
of the NDRO gives the same number of pulses as it receives at the READ pin. If there is no pulse stored
in the corresponding C3DRO cell, there will be no OUT pulse, and the NDRO will not be read to restore
the C3DRO cell. Reading of the branch prediction value will always be done with two SFQ pulses because
of the following reasons: (1) To read the C3DRO cell, we need three pulses, and hence one pulse is not
enough to read; (2) We do not require three pulses as the decision of the branch prediction will be taken
23
when the cell is in states 2 or 3. Hence, knowing that the cell is in state 2 is enough to make the decision
taken, and there is no need to check further if the cell is in state 3. If the cell state turns out to be 0 or 1
after reading with two pulses, the decision will be not taken.
3.5.1.2 Incrementing the Predication State
Whether the prediction is taken or not taken, if the branch should have been taken, the state of the cell
needs to be incremented. Note that the maximum state is 3, and it cannot be incremented further. Once
the branch decision turns out to be taken, a pulse will be sent to the TAKEN port, and it will be split and
sent to the RESET pin of the NDRO, which resets it and the SET pin of the concerned C3DRO cell. If the
C3DRO’s state is 2 or smaller, its state will be incremented by one. If the C3DRO cell is in state 3, a pulse
will be pushed out through its OUT pin, which will be sent to the READ pin of the NDRO that is reset,
and the OUT port, which will be ignored as it is the incrementing operation. Since the NDRO is reset with
a pulse from TAKEN, no unwanted pulse will be generated to create an infinite loop between a state 3
C3DRO cell and the set NDRO.
3.5.1.3 Decrementing the Predication State
If the actual branch result turns out to be not taken with no regard to the prediction of a branch instruction,
the state of the corresponding cell needs to be decremented. Note that the smallest state of the cell is 0,
and it can not be decremented further. Once the branch decision turns out to be not taken, a pulse will be
sent to the NOT_TAKEN port. It will be split and sent to the RESET pin of the NDRO, which resets it,
and the READ pin of the corresponding C3DRO cell, which also decrements it by one state if at least one
fluxon is stored in it. This pulse will be split to the OUT and the READ pin of the NDRO. Note that the
released OUT pulse from the C3DRO pulse will be ignored as described in section 3.5.1.2.
24
(a)
(b)
Figure 3.6: Verilog simulation result of the branch predictor system: (a) Incrementing the state and reading
(b)Decrementing and state and reading
3.5.2 Results
3.5.2.1 Verilog Simulation Results
A Verilog simulation netlist is built with the branch predictor system shown in Figure 3.5b. All the functions of the system with different combinations of inputs are verified in the simulation. Figure 3.6 shows
the results of verifying the functionality. The timing of the circuit is not discussed here; it is just the
verification, as the focus is on the design with HC-DRO cells.
The functionality of the branch predictor block shown in Figure 3.6a is described here: (1)Bits 101
are loaded as address bits (ADD), which selects the C3DRO at 101 positions of the DEMUX; (2) Branch
instruction relating to address 101 is taken four times consecutively so the TAKEN signal is given four
times and since the capacity is only 3 pulses, the fourth pulse is pushed out showing a pulse on OUT; (3)
Note that the both DEMUX cells are still selected at 101. To read the cell, two pulses are given at the READ
port, which produces two pulses at the OUT port; (4) These two pulses are written back (restored) using
25
Figure 3.7: Relative execution time of different branch predictor compared to not having a branch predictor
the NDRO and DEMUX, which is evidenced by reading the same cell again, which produces two pulses at
OUT port.
Figure 3.6b shows the result of decrementing the state of the C3DRO cell at address 101 and reading the
state of the cell. It is a continuation of Figure 3.6a in time, and the cell is in state 3, holding three fluxons.
It can be seen that two pulses are given at the port NOT_TAKEN, and each pulse releases a corresponding
pulse at the OUT port. After releasing the two pulses, the cell state is read by sending two pulses to the
READ port, and it produces only one pulse at the OUT port, verifying the functionality of decrementing.
Later, to empty the remaining fluxon in the cell, one more pulse is given at the NOT_TAKEN port, and
then the cell state is read, which produces no pulse at the OUT port.
3.5.2.2 Architecture Simulation Results
A full-core simulator is written in C++, and it is based on the traditional five-stage pipeline CPU and RISCV ISA simulator Spike[68]. To show the advantage of a two-bit branch predictor, two different branch
predictions are used, whose results are given here: (1) no branch predictor, all instructions after branch
instructions will be stalled until the branch instruction finishes executing; (2) 2-bit branch predictor, 128
2-bit branch predictors are used, which use a 5-bit PC hash.
26
Figure 3.7 shows the relative performance by using different branch predictors while executing different benchmarks. All benchmarks are provided by the RISC-V test GitHub repository [54]. The relative
performance is computed by dividing the total execution cycles of each branch predictor by the total execution cycles of no branch predictor. The left bar shows the baseline of this comparison. The right bar
shows the 2-bit branch predictor built with HC-DRO has a 4.89% to 28.87% speedup compared to not having
any branch predictor.
3.6 Conclusion
Building a dense and reliable memory structure is a challenge in single flux quantum technology. In this
context, a destructive readout cell with a storage capacity of two fluxons and three fluxons (HC-DRO) is
built, and the nominal parameter values are presented here. The advantages of having the multiple fluxon
storage capability are presented by the design of register files and the design of the branch predictor circuit
in the context of a 5-stage pipeline microprocessor. For the register file, all the peripheral circuits required
for using HC-DROs as basic memory cells are also designed. The basic read and write operations are
explained with an efficient implementation in Josephson junction count. The JJ count is reduced by 50%
with the use of HC-DROs for a 32×32-bit register file. A branch prediction circuit system is built using the
HC-DROs, which will be in the Instruction Fetch(IF) stage of a 5-stage pipeline. It has all the operations
needed for a microprocessor in modern computer architecture. The Verilog simulation result of the branch
prediction system verifies our design and functionality.
27
Chapter 4
HiPerRF: A Dual-Bit Dense Storage SFQ Register File
4.1 Introduction
Although we briefly showcased that HC-DRO cells can be used in the register file design in Chapter 3, there
are some challenges that we must tackle in order to build a CPU register file successfully. The inability to
retain data after a single read makes HC-DRO (or any DRO) cells challenging for register files. The use
of NDRO cells is quite expensive in terms of JJ counts (7X more JJs are needed for 2-bit NDRO compared
to a single HC-DRO cell). Given that the size of the physical register file has a significant performance
impact [16], we propose a solution that relies on HC-DRO cells for high density while at the same time
supporting the need for reading each register multiple times. Our design is based on the intuition that
only a few registers are actively read at any given time window. Hence, one needs to preserve the nondestructive read property only for those active registers. As such, we augment a large HC-DRO register
file with a small victim NDRO buffer that helps the readout data to recycle back to the original register
in a lazy manner outside of the critical path, thereby providing the non-destructive read property. Prior
work [22] has used a rotating shift register where the value of each bit is pushed back from the tail of a
shift register to the head. However, this chapter is centered on designing a rotating shift register and does
not address the architectural challenges of incorporating destructive readout cells into an architecturally
feasible register file design.
28
Given that the architectural challenges of building a register file in a CPU pipeline are substantial, we
design HiPerRF to tackle the numerous challenges. This chapter will discuss these challenges and propose
solutions that allow SFQ-based CPU designs to exploit HC-DRO cells for register files.
The primary contributions of this chapter are as follows:
• We present the design of HiPerRF, which is a register file built with HC-DRO cells. We present the
design enhancements to read and write multiple SFQ pulses without perturbing the rest of the SFQ
CPU pipeline. Since HC-DRO cells lose data after each read, we present an approach to restore the
value after each read. We use a set of NDRO cells shared across an entire column of DRO cells to
enable a low-cost approach to preserve the HC-DRO cell contents.
• We present a dual-banked design of HiPerRF. The dual-banked HiPerRF accompanied with a static
scheduling algorithm reduces the port contention and increases the performance.
• We implemented the design using detailed cell-level libraries and a hybrid pipeline-gate level simulation to evaluate the area and power impacts. Given that JJ counts are the primary design limiter in
SFQ CPUs, we quantified the JJ reduction benefits of HiPerRF when integrated into an in-order RISCV CPU. Our results also show that when considering the whole CPU, the HiPerRF design provides a
16.3% reduction in the JJ count.
4.2 Clock-Less NDRO Register File
This section describes our NDRO-based register file design, which acts as a strong baseline. We first
discuss how we design our innovative clock-less read, write, and reset ports of the NDRO baseline register
file using SFQ logic gates. The goal here is to demonstrate how to eliminate the need for clock distribution
in SFQ register file design, setting aside the memory density improvement of HC-DRO cells. In the next
29
section, we show how the clock-less NDRO register file is enhanced with the dense HC-DRO cells to create
the HiPerRF design.
Figure 4.1 shows the design. We show the design with one read-and-write port. Each rectangular box
with an IN and OUT is one register entry. The figure shows a set of such register entries that together form
the register file. At the output end, an explicit merger gate (marked as M) merges the output. Since there
is a single read port, a single register is read enabled, which produces an output, while all other registers
do not place any pulses on the output. The output from the single register whose read is enabled will then
be sent as R_DATA. The figure shows three blocks: read and write ports that are similar to CMOS design
and an SFQ-specific reset port.
Figure 4.1: NDRO register file design
30
4.2.1 Read Port Design
A register read request is transformed into a read enable signal for the corresponding register. This transformation may be done using a demultiplexer (DEMUX) to decode the read address. CMOS designs may use
combinational gates to achieve DEMUX functionality, as shown in Figure 4.3a. However, in SFQ designs,
implementing such a combinational design is prohibitively expensive.
Figure 4.2 shows the schematic of an AND gate for combinational design. It costs 12 JJs. To build the
DEMUX, we have to split the input signal (IN) and the select signal (SEL) across the two AND gates. The
NOT gate also costs 10 JJs and also needs a clock signal. As such, in SFQ technology, the size of logic
gates, the additional clock signals, and the existence of mergers and splitters will make the combinational
DEMUX design very large. A 1-to-2 combinational DEMUX needs a total of about 50 JJs.
L10
OUT
Ib4
L11
J11 J12
CLK
B
Ib3
J6
L5
J5 L6
J8
J9
J10
L7
L8
L9
Ib2
J4
L3 L4
Ib1
J1
L1 L2
A
J3
J2
J7
Figure 4.2: Schematic of an AND Gate
Rather than a combinational design, we built a demultiplexer using a Non-Destructive ReadOut cell
with Complementary output (NDROC), which was proposed in prior work [70, 2]. Figure 4.3b shows the
NDROC block diagram. The select signal (SEL) is connected to the SET input of the NDRO cell. If the SEL
signal receives a clock pulse (a value of 1), then the NDROC cell’s SET pin is activated. When a pulse to
the clock pin (CLK) is provided, it then outputs that pulse on the Q0 output (OUT0), and the complement
is sent to the Q1 output. Thus, the NDROC can be repurposed to act as a 1-to-2 DEMUX. Note that in the
31
above description, the SEL signal is driven by the source register number. The source register could be an
architected register encoded in the instruction if no renaming is done. Otherwise, the source register is
the physical register number.
The NDROC-based 1-to-2 DEMUX costs only 33 JJs [82], which is about 60% of the combinational
design based on AND gates. However, for this design to work correctly, the RESET signal needs to be
asserted after each demux operation. Recall that the NDROC cell preserves any prior "1" stored in it from
a select signal that deposited a "1". We need to clean the "1" if we want to write a "0" for the next selection.
(a) (b) (c)
Figure 4.3: (a) DEMUX built with combinational logic (b) DEMUX built with NDROC (c) 1-to-4 DEMUX
with NDROC
Since most practical ISA designs have many registers, it is necessary to build a 1-to-n DEMUX using
NDROCs. The proposed hierarchical tree structure DEMUX is shown in Figure 4.3c. The SEL[1] signal is
connected to the first NDROC cell to either activate the top bank or bottom bank of the register file. The
SEL[0] signal selects one of the two registers in either the top or the bottom bank in the second step. The
SEL[0] signal must be split to drive the two NDROCs. The outputs of the first-level NDROC are connected
to the CLK pin of the second-level NDROC.
To activate a register in a read operation, we connect the register number’s select bits to the appropriate
level of the NDROC tree. Then, a single clock pulse is provided as the read enable pulse. The read enable
pulse then traverses through the NDROC tree and will trigger the pulse on the OUT port corresponding
to the register number.
32
To design a read port using the above design, we connect the read enable (REN) signal (generated from
a decoder stage) to the starting input (IN) of the DEMUX and the read address (R_ADDR) (register number)
to the SEL pins. Each of the output pins (OUT0∼OUTN) is connected to the corresponding NDRO register
entry. Since each NDRO register entry may have a 32/64-bit value, the corresponding OUT pin must be
split 32/64 times in a tree hierarchy again to read the register’s entire width. These splitters are omitted in
Figure 4.1 for simplicity.
In this design, no clock distribution is needed since the read enable and read address signals provided
as inputs act as triggers that move the fluxons through various gates to trigger the appropriate register to
be read eventually. Using this design, we eliminate the need for clock distribution in the read port design.
In fact, for the entire register file design, we use the enable signal as the trigger without the need for an
explicit clock, thereby making our design scalable and robust to clock skew.
Finally, the output port merger block connects the output of each NDRO register to the output port of
the register file.
4.2.2 Reset Port Design
In a CMOS design, it is possible to overwrite an existing memory cell content with a new value. However,
in an SFQ memory cell that has an existing SFQ pulse (equivalent to storing a "1" in a cell), it is not possible
to replace it with a "0". Writing a no pulse does not remove the existing SFQ pulse in the cell. Hence, every
write operation must first reset all the bits in a register entry to "0" before a new write operation can be
performed.
Every write operation to an NDRO register must be preceded by a reset so that the existing pulses in
each memory cell are dissipated. Thus, a new reset port is necessary to perform this function. Since the
reset port is used prior to writing the data, the reset port uses the destination register’s address (W_ADDR)
to access the register. It also uses a special RESET_ENABLE signal that is sent as input to the DEMUX of
33
the reset port. The DEMUX circuit moves the RESET_ENABLE pulse through the gates in the port to
eventually select the destination register. The reset pulse reaches the selected register’s reset pins so that
the content of the selected register is set to "0."
4.2.3 Write Port Design
Once the reset operation is completed, the write port is activated to perform the write operation. The write
port has three inputs: write enable (WEN), write address (W_ADDR), and write data (W_DATA). Same as
the read and reset ports, the W_ADDR is decoded using the DEMUX design that we described. The write
data is split and sent to all the NDRO registers. In order to write the data to the correct NDRO registers,
we add Dynamic AND (DAND) gates [57] between the write data and the SET pin of the NDRO register
as shown in Figure 4.4a.
DAND gates do not use clock signals to control the timing; instead, they use the hold time to control.
Figure 4.4b shows the timing of the DAND gate. If two inputs arrive between the hold time, there will be
a pulse on the output. Otherwise, these inputs will not generate any output. By using DAND gates, we
can avoid using clocked AND gates to reduce the complexity of the design. When both WEN signals and
W_DATA arrive within the hold time window, the pulse will arrive at the corresponding NDRO register’s
SET pin and write the data into the NDRO register. Thus, the benefit of using DAND in the write port is
that it eliminates the need for clocking and splitter cells that may be needed for clock distribution.
4.2.4 Read and Write Operation
Read operation: To initiate a read operation on the register file, the decoder extracts R_ADDR and sends
that address along with REN (read enable) pulse to the register file. The REN pulse traverses the gates
selected by the R_ADDR bits to move the register data out to the execution.
34
(a) (b)
Figure 4.4: (a) NDRO write circuit (b) Dynamic AND timing
Write operation: First, the write operation must reset the NDRO register entry. The write operation
sends the WEN and W_DATA to write the corresponding NDRO registers. A reset of all the NDROCs
follows the write operation before the next operation can begin.
4.2.5 Timing
The register read and write operation control signals (REN, WEN, and RESET) are generated in the decode
stage of the pipeline. While these three signals originate at the same time, they must be delivered with appropriate delays. Based on our detailed device modeling simulations (more details on modeling to follow),
the NDROC gate in the current SFQ technology can receive two successive enable signals on its input (IN)
with a 53ps delay. That means two read enable, write enable, or reset signals must be at least 53ps apart.
This delay is the cumulative delay of HoldRESET + CriticalRESET_to_SET + SetupSET delay of the gates in the
register file ports. The propagation delay, which is the time it takes for an SFQ pulse on the IN to reach the
Figure 4.5: 32×32 bits NDRO register file timing
35
OUT, is about 24ps, and it is much less than 53ps. Hence, the NDROC tree DEMUX can be fully pipelined
at a cycle time of 53ps.
For any write operation, the reset signal has to precede any write operation. Our device-level simulation measured that critical time [19] between RESET signal and when data can be sent on the input IN of
a register. This delay to separate the WEN from the RESET is 10ps, less than the 53ps delay needed for the
DEMUX access. Based on these considerations, the clock cycle of the NDRO register file is 53 ps, while
different control signals, such as RESET followed by WEN signals within a clock, are delayed by 10 ps.
The timing for an example instruction sequence is shown in Figure 4.5. During the execution of one
instruction, there is at most one register write operation and at most two source register read operations.
The figure shows a sequence of instructions labeled Inst 0 ... Inst x+1. Let us assume that the write-back
operation from Inst 0 is going to overlap with the source read operation of Inst x. Thus, the write from old
instruction (write back), which is the R1 from Inst 0, and two read operations from current instruction, R1
and R3 from Inst x, will contend for the register file access. Inst X also has a read-after-write dependency
with Inst 0 here.
Given this scenario, our preferred design option is to initiate the write operation of R1 before the read
operation so that we can ensure internal forwarding. The write operation first initiates a RESET operation.
After RESET, the WEN (write enable) signal pulse is provided with a delay, which is based on the critical
time between RESET and IN of the R1 register. Then, the instruction initiates the REN (read enable) pulse.
The second read operation for source register R3 in the second cycle is initiated concurrently with the
RESET and WEN operation of Inst 1.
4.3 HiPerRF: HC-DRO RF with NDRO capabilities
HiPerRF uses HC-DRO cells to replace the NDRO cells to improve register file density. In Chapter 3, we
proposed HC-DRO cells [34] can be used in register file design for density. However, we did not show
36
how to tackle the unwanted destructive readout property in designing a CPU register file, and we also did
not account for the circuit timing requirements and the various challenges associated with scheduling a
register file’s read/write operations. In this section, we will show how to resolve all these issues.
Figure 4.6 shows the HiPerRF design. This design assumes that the HiPerRF design is a self-contained
unit, and the rest of the CPU operates on each bit of information separately. Namely, even though the design stores 2 bits in one cell, they are read out as at most three pulses using an HC-READ circuit (described
below) and fed to the rest of the logic. There are four major components in HiPerRF different from the
baseline NDRO register file—first, the obvious replacement of NDRO cells with HC-DRO cells to store data.
The second is the absence of a reset port. The third is a new output port design that enables the HiPerRF
operation to retain the data. The last one is adding the HC-CLK, HC-WRITE, and HC-READ circuits that
can decode and encode the two-bit storage HC-DRO cell into up to 3 separate pulses and vice-versa.
Figure 4.6: HiPerRF design
4.3.1 HC-DRO Read and Write Circuits
To read and write the HC-DRO cells correctly, we need to design the HC-DRO-specific read and write
circuits based on the initial design shown in the previous Chapter 3.
37
HC-WRITE circuits: Each HC-DRO cell encodes two bits of information into 0 to 3 pulses. We need
an HC-WRITE circuit to convert the two bits of information generated by an ALU into up to 3 pulses
for storage in HC-DRO. The HC-WRITE circuit designed is shown in Figure 4.7a. The write circuit uses
Josephson Transmission Lines (JTL), represented as diamonds with J in Figure 4.7a. JTL is an SFQ design
element that allows the fluxon to pass through it with delay. For instance, in the figure, when two pulses
arrive at B0 (LSB) and B1 (MSB), the pulse starting at B0 goes through the two merge cells to produce the
first pulse on the OUT. The pulse from B1 travels through three JTLs horizontally and goes through the
splitter and merge cell to generate the second pulse. The split pulse from B1 will go through the vertical
JTL path and eventually become the third pulse. The JTLs act as delay elements to create the minimum
required separation to store two consecutive pulses into the HC-DRO cell; in our current design, this delay
is about 10ps due to the requirement of the setup and hold time [87] of HC-DRO cells.
HC-CLK circuits: To read HC-DRO cells, we need to send three consecutive pulses to the input pin
to read out all the fluxons stored inside. For instance, the REN signal that eventually reaches an operand
register from the DEMUX port must generate three pulses to read each HC-DRO cell. The HC-CLK circuit
is used to duplicate one SFQ pulse into three pulses. The circuit is shown in Figure 4.7b. Same as before,
the design uses JTLs to create three pulses that meet the required timing restrictions without any explicit
clock signals.
HC-READ Circuits: Reading HC-DRO cells may produce 0 to 3 consecutive pulses at the input pin.
These serial pulses need to be translated into normal one-bit logic to create two parallel pulses. The
proposed HC-READ design used in this chapter is built using two one-bit counters [48] to build a twobit counter. The design is shown in Figure 4.7c. The state machine diagram of the counter is shown in
Figure 4.7d. After counting the pulses, the circuit generates a 2-bit output as two parallel pulses on B1 and
B0.
38
(a) (b)
(c) (d)
Figure 4.7: (a) HC-WRITE design (b) HC-CLK design (c) HC-READ design (d) state machine diagram of
the counter
4.3.2 LoopBuffer for Non-Destructive Readout
The output port design of HiPerRF provides the non-destructive readout capabilities for HC-DRO cells.
We add a LoopBuffer to the output port, a set of NDRO cells that enables the restoration of register entry
data after a read operation.
Read operation flow: The CLK input of the NDRO cell is connected to the output pulses produced
from an HC-DRO register. When an instruction wants to read a source register, the LoopBuffer’s NDRO
cell is first set to 1; namely, a single pulse is stored in the NDRO cell prior to the start of the register read
operation. Each of the 2-bit source register values (at most three pulses stored in a single HC-DRO cell)
arrive at the CLK pin of the LoopBuffer’s NDRO cell. The incoming pulses from the HC-DRO register
exit the LoopBuffer as output pulses. For instance, if the HC-DRO cell has an encoded value of "10", it
will generate two output pulses, triggering the LoopBuffer NDRO to produce two output pulses. If the
HC-DRO cell has an encoded value of "11", it will generate three output pulses, triggering the NDRO to
produce three output pulses.
39
Those output pulses go through the splitter, and one branch restores the data to the source register
while the other branch is sent to HC-READ to decode into two-bit values for operation by the ALU.
Write operation: In HiPerRF, when an instruction enters the write-back stage, its destination register
write operation is divided into two steps: the first operation reads the register content and erases that
register content using the LoopBuffer. To enable this erase operation, the LoopBuffer is first reset to zero,
and the current content of the destination register is read into the LoopBuffer to dissipate its value. Then,
the write-back of the new value follows normally since the destination register has been cleared.
The LoopBuffer design is based on the intuition that only a few registers are actively read at any
given time window. Hence, one needs to preserve the non-destructive read property only for those active
registers. As such, our design allows a large HC-DRO register file to share a small victim NDRO buffer
that helps the readout data to recycle back to the original register.
Figure 4.8: Timing of HiPerRF
4.3.3 Read Port and Write Port Design
The read-and-write port operates similarly to the NDRO baseline design. The one difference is adding the
HC-CLK circuits between the DEMUXs and HC-DRO cells. The HC-CLK generates three pulses for read
enable and write enable signals. This operation enables us to read each HC-DRO cell using the enable
pulses generated from HC-CLK. The use of NDRO cells in LoopBuffer provides interesting optimization
opportunities. As explained above, the NDRO cell can be reset to erase register content. We use this
property to use a single read port to work as a reset port as well. Thus, the need for a reset port is
eliminated in the HiPerRF design.
40
Unlike the NDRO register file’s write port, the HiPerRF’s write port needs to accept data both from the
regular register write operations and the LoopBuffer. Hence, a new merger gate is added at the write port,
as shown in Figure 4.6. An HC-WRITE is added between the input and the merger to encode the data for
HC-DRO storage.
4.3.4 Timing
Similar to the NDRO RF, the bottleneck in HiPerRF is also the NDROC of the DEMUX. The gap between
two REN signals and two WEN signals is also 53ps, which will be the cycle time. The control pulse timing of
HiPerRF is shown in Figure 4.8. At the start of executing Instruction X, a write operation of the destination
register is initiated. The write operation first generates a REN pulse (to reset the register). The REN pulse
passes through HC-CLK to generate three pulses (each 10ps apart in our design due to the requirement of
the setup and hold time of HC-DRO cells, as shown in three rectangular pulses in the figure). The WEN
pulse follows this reset operation in the second clock, which passes through HC-CLK to generate three
pulses. Concurrent with the WEN pulse, the first source read operation is initiated with a REN pulse in
the second clock (for clarity, the three-pulse sequence is shown as just one pulse). The loopback write
for this read operation is initiated in the third cycle, as the dashed arrow shows. In the third clock, the
second source register’s REN pulse is initiated, followed by a loopback write operation of source 2 in the
fourth clock. During the fourth clock, the second instruction’s destination register’s write operation is also
initiated, and the process repeats every three cycles. Note that in this timing sequence, the write operation
of register R1 is unable to forward the data to Inst X. Hence, Inst X has to go through the full source register
read operation.
The loopback write brings one more issue to the forefront: the Read-After-Read (RAR) hazards. If Inst
x is reading from the R3 twice (R2=R3+R3), the second read operation at cycle three will read out nothing
since the data has not come back yet. In this case, the second R3 should be duplicated from the first read
41
operation rather than being read from the RF. Note that for precise exception handling, the loopback write
operation cannot be optimized even if the same register is being overwritten anyway.
4.4 Multi-Bank HiPerRF
For performance reasons, we considered adding an extra read port to HiPerRF. However, in the HiPerRF
design, adding a read port requires performing the two concurrent loopback operations as well, which
require two write ports. Hence, the HiPerRF design essentially requires read and write ports to scale
together. Based on our design estimate, a 32x32 bits HiPerRF with two read ports and two write ports
costs nearly triples the JJ counts due to a superlinear increase in the merger, splitter, and other peripheral
circuitry needed to support two ports. An area-efficient solution is a banked register file. By splitting
HiPerRF into two banks, we can achieve two read ports and two write ports without the superlinear growth
in the peripheral circuitry. The banked design also reduces the DEMUX depth, speeding up access to the
register file. Since each bank has only half the registers, the banked design removes one merger and one
splitter (about 10ps time) from the loopback path, saving loopback timing. Thus, banking not only reduces
port contention but also reduces loopback path delays.
Figure 4.9: Timing of dual-banked HiPerRF
42
Figure 4.10: Dual-banked HiPerRF design
4.4.1 Port Design
The dual-banked HiPerRF design is shown in Figure 4.10. Each bank will have its read port, write port,
and output port. Although the number of the DEMUXes is doubled, each of the DEMUXes is only half
the size of the one in the HiPerRF design above. Hence, the main JJ overhead is the extra LoopBuffer. We
omit the splitters for illustration simplicity. The rest of the circuits remain the same as the HiPerRF design
mentioned in section 4.3.
4.4.2 Timing
Similar to the HiPerRF design, the main bottleneck is still the DEMUX. The cycle time remains at 53ps
here. The timing is shown in Figure 4.9. However, since we may perform two read operations in the same
cycle, the scheduling will be different here. We split the register file into two banks based on the parity of
43
the register number. Registers with odd register numbers belong to Bank 0, and the other belongs to Bank
1.
At the beginning of cycle 2, since Inst x needs to read the registers from different banks, we can send
two read signals to both banks. Cycles one and three are reserved for the write-back reset operation (Inst 1
and 3). Similarly, we send the read signal of R4 at the beginning of cycle 4. However, unlike HiPerRF, we do
not send the next read signal in the next cycle. Instead, we reserve this for the write-back’s reset operation
of Inst 2. The second read operation starts at the beginning of cycle 6. As a result, if the instruction is
reading two registers from different banks, it only takes two cycles. However, if the instruction is reading
two registers from the same bank, it takes four cycles.
In comparison, HiPerRF needs three cycles for all instructions. By doing so, we utilize the two read
ports without increasing the complexity of the scheduling unit. Similar to HiPerRF, reading two same
registers (R2=R3+R3) needs the duplication of the readout results.
4.5 Evaluation
The evaluation of HiPerRF focuses on two aspects: hardware design evaluation and software simulation
results for measuring application-level performance. For each of the above-described register file designs,
we built Verilog netlists using the publicly available cell libraries [61]. We successfully verified the functionality and timing with different combinations of inputs, including accounting for wire delays post-placeand-route. We then integrated the Verilog model of our register file designs within the RISC-V Sodor core
CPU to measure the overall JJ count reduction for the whole chip.
4.5.1 Hardware Performance
The hardware performance is evaluated in two parts: JJ count and static power. The total JJ count is
calculated by using the SFQ cell library provided by [61]. For the dynamic AND gate and NDROC, the data
44
is derived from [57] and [82]. The static power for the entire register file design is also derived from the
cell libraries. Note that in SFQ designs, the limiting design factor is the number of JJs that can be integrated
with current fabrication technologies. Hence, measuring savings in terms of JJ counts is more appropriate
than measuring the chip area. Nonetheless, the register file size is about 20% of the total CPU design area
using NDRO cells, and it is reduced with HiPerRF by various amounts based on the size of the register file.
Total
JJ Count
Percentage
of Baseline
Register File
Size (bits) 4 × 4 16 × 16 32 × 32 4 × 4 16 × 16 32 × 32
NDRO RF
(Baseline Design) 784 9850 36722 100% 100% 100%
HiPerRF 695 5195 16133 88.65% 52.74% 43.93%
Dual-banked HiPerRF 736 5626 17094 93.88% 57.12% 46.55%
Table 4.1: Total JJ count and the percentage over the baseline design
Static Power
(µW)
Percentage
of Baseline
Register File
Size (bits) 4 × 4 16 × 16 32 × 32 4 × 4 16 × 16 32 × 32
NDRO RF
(Baseline Design) 170.73 1997.49 7262.17 100% 100% 100%
HiPerRF 149.16 1220.05 3911.00 87.37% 61.08% 53.85%
Dual-banked HiPerRF 148.47 1289.89 4077.88 87.00% 64.58% 56.15%
Table 4.2: Static power and the percentage over the baseline design
Table 4.1 shows the JJ count for each of the described register file designs for two different register file
sizes. The data includes the JJ counts for splitters, mergers, and any necessary JTLs for the register file
access. The first row shows the baseline, an NDRO-based register file. The second row shows HiPerRF
Readout Delay
(ps)
Percentage
of Baseline
Register File
Size (bits) 4 × 4 16 × 16 32 × 32 4 × 4 16 × 16 32 × 32
NDRO RF
(Baseline Design) 77 144 177.5 100% 100% 100%
HiPerRF 122.8 187.8 220.3 159.48% 130.42% 124.11%
Dual-banked HiPerRF 94.8 159.8 192.3 123.12% 110.97% 108.33%
Table 4.3: Readout delay and the percentage over the baseline design
45
design. The third row shows a dual-banked HiPerRF design. The 4×4, 16×16, and 32×32 bits HiPerRF
saves about 11%, 47%, and 56% of JJs compared to the baseline design. The extra JJs required to implement
HC read and write circuits can be amortized with the density advantage of HC-DRO cells. Hence, the
relative advantage of HiPerRF grows as the size of the register file increases in the future.
Table 4.2 shows the static power for each design and the percentage of the baseline design. As expected,
the static power is a function of the number of JJs used in a design. Hence, the HiPerRF with 32×32 bits
consumes about 46% less static power compared to the baseline design, reflecting the reduction in JJ count.
Note that these results did not include the benefits associated with reduced cooling power due to reduced
static power. Heat extraction is a major power source and may lead to two orders of magnitude more
energy consumption [13].
Full Chip Benefit: To quantify the benefits at the chip level, we synthesized the RISC-V Sodor inorder core with HiPerRF by using the qPalace tool [20] to get the JJ count. The Sodor core has five main
parts: ALU, Register File (RF), Control and Status Registers (CSR), control path, and front end. The total
JJ count of these various CPU components using baseline NDRO register file design is 139,801. When
the register file is replaced with HiPerRF, including all the additional overheads of read/write and clock
circuits, the total JJ count reduces to 117,039 JJs. Thus, the total JJ reduction is 16.3%.
We also analyzed the readout delay, which is a critical performance metric. Table 4.3 shows the readout
delay for each design and the percentage of the baseline design. The 4×4, 16×16, and 32×32 bits HiPerRF
actually increase the readout delay due to the need to write the data into the LoopBuffer before the data is
made available to the ALU. The increase in delay is about 24% for a larger register file. However, the dualbanked design could reduce the readout delay overhead to 8% by reducing the long latency of accessing
the NDROC. We expect that as the register file size grows, relative gains in power and JJ count reduction
of HiPerRF grow. That is because the overhead circuitry costs are better amortized. Moreover, even the
readout delay overhead will eventually match the baseline with a larger size.
46
4.5.2 Simulation Results
To analyze the performance improvement at the application level, we designed an SFQ-based gate-level
CPU simulator. The ISA we chose to simulate is RISC-V 32I, which is the basic RISC-V ISA. The simulator
is based on the RISC-V ISA Simulator Spike [68] and written in C++. While the basic simulator uses a
function-level pipeline, our enhanced simulator implements a gate-level pipeline. Such a gate-level pipeline
model is necessary, for instance, to enable pipelined execution within the read and write ports that use a
long chain of NDROC cells. Our simulator uses the notion of a macro clock to simulate the fetch-decodeexecute-writeback pipelines, and it uses micro clocks to simulate the gate-level pipelining functions of
SFQ. As we discussed before, our simulator supports the internal forwarding within the register file when
appropriate without violating timing. In our current simulator design, an external cache at 77K is interfaced
with the SFQ design, which is the usual practice for interfacing larger memories [80]. As such, all memory
references are satisfied from the 77K memory. However, recent advances in new materials such as threeterminal JJs, magnetic JJs, ferromagnetic JJs, and spin-based JJs have advanced memory integration closer
to SFQ [63, 56, 6, 5], which is outside the scope of our simulation capability.
Figure 4.11: CPI overhead over baseline (NDRO RF) of different RISC-V and SPEC 2006 benchmarks for
different designs
To get the depth of each gate-level pipelined stage, we synthesized the RISC-V Sodor in-order core
Verilog code by using the qPalace tool [20], which synthesizes SFQ designs built from the cell libraries.
We measured the gate-level pipeline depth of each functional block from the synthesis results. Based on
the synthesis result from qPalace, the Sodor design has a worst-case gate-level cycle time of 28ps, which
is about half of the 53ps needed for HiPerRF. Hence, we used a cycle time of 28ps for each gate, and each
47
read or write operation takes two cycles. As such, the readout delay shown in Table 4.3 is translated into
the corresponding number of cycles. These readout delays are then used as input to determine the stall
cycles for handling dependencies. The performance results below account for any write port contention
between loopback writes and the traditional write-back path. The way it is accounted for is described in
Section 4.3.4, where we only issue instructions every three cycles, where one of those cycles is reserved
for write-back operation to not conflict with loopback. Similarly, we also account for the latency in the
dual-banked HiPerRF described in 4.4.2. Hence, by design, our approach statically eliminates write port
contention between write back and loopback and the scheduling costs are modeled accurately in the simulator.
The benchmarks we used are from the RISC-V repository [54] and SPEC CPU 2006 [66]. Due to the ISA
and simulator limitations, we ran mcf, sjeng, libquantum, and specrand. Some limitations include a lack of
full support for floating-point instructions and cross-compiler failures from the GCC compiler. Due to the
extremely slow gate-level simulations, each benchmark is simulated for 24 hours. SFQ simulators need to
do detailed gate-level pipeline simulations to load the SFQ pulse state into the simulator properly. Hence,
fast-forwarding using pinpoints [50] and other advanced simulation strategies need a careful redesign,
which is part of our future work.
Figure 4.11 shows the CPI overhead for each benchmark and the average CPI overhead across all benchmarks, compared to the NDRO register file baseline. It is worth noting that in SFQ-based designs, gatelevel pipelining poses significant challenges for read-after-write (RAW) hazards. The execution stage of
the RISC-V core is 28 stages deep. Hence, any two instructions with RAW dependency in a short window
will stall in the execution stage. We believe that current compiler optimizations may place RAW dependencies somewhat closer to each other to exploit data-forwarding capabilities in traditional CPU pipelines.
However, SFQ-based CPUs require quite the opposite – to spread the RAW dependency instructions as far
apart as possible. As a result, the average cycles per instruction measured in our modified RISC-V core
48
gate-level simulator is about 30 cycles averaged across all the benchmarks. Furthermore, the compiler
we used is not optimized for accessing multiple banks. Hence, for the dual-banked design, we also run
simulations that consider the ideal situation in which all instructions read the two source registers from
different banks.
The CPI of HiPerRF is about 9.8% worse than the baseline. As discussed earlier, HiPerRF’s design goals
are to use fewer JJs without significantly impacting performance. HiPerRF is expected to have somewhat
lower cycle level performance than the baseline design because the LoopBuffer is in the critical path,
and any subsequent instruction that wants to read the same source register may have to wait for the
loopback write before accessing the register file. Dual-banked HiPerRF reduces this overhead to 3.6%.
This improvement is due to less port contention and short readout delay. In the ideal situation, the CPI
overhead is only 2.3%, which is almost as good as the baseline design but saves 53% JJs in the register file
and 16.3% fewer JJs even at the overall chip level.
4.5.3 Impact of Wire Delay
There are two types of wiring in RSFQ technology: Passive microstrip Transmission Lines (PTL) and
Josephson Transmission Lines (JTL). Our design uses PTLs for all the wires, and we use JTLs only when
there is a need to induce delays, as was the case with the HC-READ circuit. We have done the placement and routing of our design by using Cadence Innovus with the library extracted from the open-source
qPalace tool [20]. Figure 4.12 shows the fully placed and routed results of HiPerRF. The white areas in
Figure 4.12 show the LoopBack path of HiPerRF. The longest delay on the LoopBack path is only 4.6ps,
which is much smaller than the decoder latencies (53ps). Although the LoopBack path looks long in the
visual illustration in Figure 4.6, this path is quite short in reality after placement and routing are done.
On average, across multiple circuits placed and routed with qPalace, the wire length between two gates is
262µm between any two gates.
49
Furthermore, as per the qPalace derived data, the delay of the PTL is 1ps/100µm, so the average wire
delay between two gates is 2.62ps. Based on these data, the new readout delay accounting for all wire
delays is shown in Table 4.4. With a detailed wire delay inclusion, the overall readout latency overhead
increases by about 1% compared to the baseline; hence, the CPI performance impact is at most 1%.
NDRO RF
(Baseline Design) HiPerRF Dual-banked
HiPerRF
Readout Delay (ps) 216.8 270.1 236.8
Loopback Latency (ps) – 108.4 93.7
Table 4.4: Readout delay, loopback latency with PTL delay
Figure 4.12: Placement and routing results of HiPerRF
4.6 Related work
While we have described the related work as related to specific design choices throughout the text, in this
section, we focus on a couple of related works that focus on superconducting register file designs.
50
Fujiwara [22] proposed a shift register file built with DRO cells. Their design is a basic single-bit shift
register design where the data moves from the tail to the head of the shift register. They used their design
in the functional verification without considering the architectural and performance implications of a CPU.
However, the implications of trying to build an HC-DRO register file design with loopback capabilities are
significant. (1) HiPerRF is designed to function with the 2-bit DRO cell, which is not considered in prior
work. As such, HiPerRF design must also handle multi-bit decoding and encoding circuits, which are novel
and entirely outside of the scope of [22]. (2) HiPerRF design also carefully considers the timing challenges
of scheduling loop back and write back data arriving at the write port. (3) We also demonstrate how read
ports can be repurposed as reset ports to remove data from an HC-DRO register before a new write is
allowed. This design substantially simplifies the cost of the port design since we eliminate many of the
splitter cells that are latency-sensitive. (4) We compared our design with the NDRO design. We evaluated
our design and showed the actual JJ count benefits with a little performance penalty, which is negligible
with the dual-bank option.
Dorojevets and Chen [15] proposed a Reciprocal Quantum Logic NDRO-based storage. They described
the design details and extended their design to different memory designs, such as register files and cache.
However, their design is fundamentally limited to using NDRO cells to store register data. Hence, they do
not consider the possibility of reducing register file design size using HC-DRO cells as the primary storage
elements.
4.7 Conclusion
Superconducting devices based on Josephson Junctions are an important device technology that needs to
be explored in the context of a microprocessor design. The development of single flux quantum (SFQ)
devices and the logic design tools and techniques around SFQ have been gaining traction with significant
research investments. In this chapter, we explored how SFQ-based designs can be used to implement an
51
area-efficient register file in the context of a modified RISC-V in-order core processor. Since memory is a
premium resource in SFQ designs, we propose HiPerRF, which uses High Capacity Destructive ReadOut
(HC-DRO) cells and yet supports the multiple read property critical for any microprocessor register file.
We designed the HiPerRF and proposed multiple enhancements to reduce the access latency of HiPerRF.
Using gate-level synthesis tools and gate-level simulations, we demonstrate that HiPerRF saves nearly
56% of the JJ counts. We executed a range of benchmarks on a pipelined CPU design with gate-level
detail to show that the HiPerRF only pays about 10% performance reduction due to the LoopBuffer. We
also demonstrate how to use a dual-banked design to reduce the port contention and reduce performance
penalties of accessing DEMUX trees that are embedded in the read, write, and reset ports of SFQ register
files.
52
Chapter 5
SuperBP: Design Space Exploration of Perceptron-Based Branch
Predictors for Superconducting CPUs
5.1 Introduction
As we look into future SFQ CPU designs, our research community needs to explore efficient branch predictors for SFQ CPUs, which will, in turn, enable speculative execution microarchitectures. One consequential
limitation when exploring branch predictor designs for the SFQ technology is the Josephson Junction (JJ)
count that can be integrated into a single chip in the current fabrication process [75]. Hence, predictor
designs must consider the total JJ counts accounting for both the predictor storage and the logic used to
access and update the predictors.
In this chapter, we explore two variants of the well-known perceptron-based branch predictor [30] for
SFQ CPUs. We design the original perceptron [30] and the hashed perceptron variant [76]. While these
predictor implementations in CMOS designs are well known, there are a plethora of challenges in implementing the predictors in SFQ designs. For instance, implementing demultiplexer (DEMUX) designs in
SFQ technology is JJ-intensive. Hence, designs that need a large number of DEMUX circuits will compromise the size of the perceptron weight storage. Thus, a careful JJ-neutral evaluation of branch predictors
is critical for properly evaluating the branch predictor performance.
53
To this end, this chapter aims to design, implement, and refine a perceptron-based predictor using
SFQ-based logic cells. First, we design a baseline predictor that uses non-destructive readout (NDRO) cells
for weight storage. NDRO cells enable the weights to be preserved even after reading the cell but come
at the expense of nearly 7x JJs compared to destructive readout (DRO) cells. We then present SuperBP
(superconducting branch predictor) that uses high capacity-DRO (HC-DRO) cells to increase weight table
storage capacity without increasing JJ counts. Each HC-DRO cell is designed to store two bits using the
same number of JJs as DRO cells. Thus, HC-DRO cells provide a unique opportunity to double the size
of the branch predictor. The 2-bit values are encoded as up to 3 pulses into a single cell. In the previous
chapters, we used HC-DRO cells to design register files that required the use of decoder and encoder
circuits to transform pulse counts into 2 bits (MSB and LSB) of data, which is then processed by execution
units. Adding decoder and encoder circuits reduces the JJ efficiency of the HC-DRO-based branch predictor
design. SuperBP presents a series of novel SFQ circuit-level design adoptions that directly operate on the
multiple pulses from the HC-DRO cells and store them back into the perceptron weight tables without
any decoding and encoding overheads. Additionally, HC-DRO cells suffer from the destructive readout
property. Thus, it is critical to restore the values of a predictor entry after each read. We present a simple
loopback buffer design to retain the values even after an entry is read from the history tables.
We implemented the original perceptron and the hashed perceptron variant designs [30, 76], using all
the innovations mentioned above. The hashed perceptron predictor uses multiple weight tables, and each
table needs a decoder to access the weight entry. Unfortunately, the decoder design in SFQ technology
requires a substantial number of JJs. Hence, the need for multiple decoders leads to JJ overheads, reducing the JJ availability for weight storage. Thus, for a given JJ budget, our results show that the original
perceptron predictor outperforms the hashed perceptron.
The primary contributions of this chapter are as follows:
54
• We present an optimized NDRO-based perceptron branch predictor design as a baseline. As SFQbased branch predictors have not been proposed in the literature, our goal here is to highlight some
of the unique challenges in building this baseline, such as path balancing and reset port requirements.
This design uses NDRO memory cells and SFQ logic gates to achieve perceptron functionality.
• We then present the design of SuperBP, which is the original perceptron branch predictor built with
HC-DRO cells. We first describe the design of the inference and training units of the predictor that
can directly compute on multi-bit cells without extra encoding and decoding circuits. These circuits
are then integrated to create a fully operational perceptron predictor that can be included in an SFQ
CPU.
• We then present the design of the hashed perceptron version of SuperBP.
• We evaluate the MPKI (branch predictor misses per thousand instructions) of NDRO-based perceptron, SuperBP, and hashed perceptron SuperBP for a given JJ budget using a range of benchmarks
– SPEC CPU 2017, mobile and server application traces from the 5th Championship Branch Prediction [10]. Compared to the NDRO baseline, SuperBP can reduce the MPKI by 13.6% for the same JJ
count.
• We evaluate the IPC improvements of SuperBP and show that for the same JJ budget, SuperBP shows
up to 10% higher performance than the hashed version.
5.2 Background
5.2.1 Recap of Perceptron Branch Predictor
Perceptron branch predictor uses a single perceptron weight table to predict branch outcomes [30], as
shown in Figure 5.1.
55
Figure 5.1: Perceptron branch predictor
In the perceptron weight table, there are N rows. In each row, there are (n+ 1) signed integer weights
(w0, w1, ..., wn), where n is the length of the branch history. To predict a branch, we choose a row of
weights based on the branch PC address and perform an element-wise weighted summation with the
branch history, as shown in the equation below. The branch history is represented as (x1, x2, ..., xn),
where xi
is -1 when the branch associated with that history bit is not taken and is +1 when it is taken. The
prediction result y is computed as
y = w0 +
Xn
i=1
xiwi
(5.1)
When y ≥ 0, branch is predicted as taken, and when y < 0, the branch is predicted as not taken.
Later, when the branch is resolved, and the direction t is known, t is set as -1 for a not taken branch
and +1 for a taken branch. The branch outcome t is used to train the perceptron and update the weights.
The training algorithm is shown in Algorithm 1. θ is the threshold that controls the training process.
Algorithm 1 Perceptron BP Training Algorithm
if sign(yout) ̸= t or |yout| ≤ θ then
for i = 0 to n do
wi = wi + xit
end for
end if
56
5.2.2 Recap of Hashed Perceptron Branch Predictor
Perceptron branch predictor only uses one weight table, so its performance may suffer from aliasing. To
further improve the performance, the hashed perceptron branch predictor was proposed [76]. Instead of
using a single weight table, the hashed perceptron predictor has multiple weight tables, and each table
is indexed with different hash functions, as shown in Figure 5.2. Since it utilizes varied path information
through multiple weight tables, hashed perceptron predictor (and other variants) show performance improvements [62, 29]. In this work, we design both a perceptron and a hashed perceptron BP to evaluate
the performance under a limited JJ budget.
Figure 5.2: Hashed Perceptron branch predictor
5.3 Related Work
There is a plethora of branch predictor designs for CMOS CPUs [45, 41, 84]. In this chapter, we focused on
perceptron and hashed perceptron branch predictors [30, 76]. The primary goal of this chapter is to design
predictors using HC-DRO cells without the need for encoding and decoding circuits. Thus, our predictor
design operates natively on the HC-DRO cell storage, significantly reducing JJ counts.
In Chapter 4, we have demonstrated using HC-DRO cells in the SFQ register file design context.
HiPerRF purely used HC-DRO cells to store data in a denser format in an SFQ register file. In HiPerRF
57
design, a decoder circuit must follow each register read to translate the multi-fluxon pulses into 2-bit values. Similarly, each register write is preceded by an encoder circuit that translates a 2-bit value into an
equivalent number of pulses to store. In this chapter, we treat the HC-DRO data as a native data type
where all the pulses stored in an HC-DRO cell are processed simultaneously. We thus eliminate the need
for encoding and decoding the data, making the branch predictor more latency and JJ-efficient.
In Chapter 3, we described a high-level overview of a 2-bit branch predictor [65] built with 2-bit fluxon
storage. We described the operation of the 2-bit branch predictor. The design uses two special ports called
TAKEN and NOT_TAKEN ports. We described the 2-bit update mechanism as sending a pulse to either of
these ports depending on the branch outcome and then incrementing/decrementing the 2-bit value. The
description provided focuses primarily on showcasing a potential use case of the HC-DRO cells. We did not
address key challenges in operating the saturating 2-bit counter. For instance, during the training phase,
the 2-bit prediction counter has to be decremented when the branch is not taken. However, decrementing
an HC-DRO cell releases one fluxon stored in the cell. Unlike in CMOS, a stray pulse must be properly
dissipated in SFQ. Otherwise, these pulses may move through the circuit, causing havoc to the operation.
In particular, the stray pulse may appear as a "taken branch" prediction for the fetch controller. This is just
one specific example of how a functional description of a predictor is inadequate for making the predictor
work in the SFQ regime.
This chapter targets perceptron and hashed perceptron predictors instead of the 2-bit branch predictor.
We aim to design a fully operational perceptron predictor, including accurately handling multiple pulses
in an HC-DRO cell without encoding/decoding and designing efficient predictor update circuits that operate directly on the HC-DRO pulses. As such, this chapter presents a detailed circuit-level design and
implementation of the predictor using HC-DRO cells. It proposes to use HC-DRO cells not just for storage
but to treat the 2-bit value as a native data representation within the microarchitecture. This work also
proposes non-intuitive data representations and circuit implementations that work on sign+magnitude
58
representation of predictor storage values. We also provide detailed circuit+architecture level evaluations
to help guide the microarchitecture progression of SFQ branch predictor designs.
5.4 NDRO Baseline Branch Predictor
While designing a perceptron predictor in CMOS circuits is well studied, its design with the SFQ logic
family presents unique challenges. This section describes our baseline NDRO-based perceptron branch
predictor design. Our goal here is to design the inference and training units considering the constraints of
SFQ logic. Hence, the focus is on highlighting how an NDRO branch predictor can be designed: specifically,
the need for a reset port, the need for path balancing, and the need for concurrent detection of overflow,
which is uniquely needed for SFQ-based perceptron. We will also describe the hashed perceptron design
at the end of this section. Figure 5.3 shows the NDRO perceptron design. This design includes three main
components: perceptron weight storage, training, and inference units.
Figure 5.3: NDRO perceptron branch predictor
59
5.4.1 Perceptron Weight Storage Design
Each rectangular box in the perceptron weight storage consists of a row of NDRO cells storing one perceptron weight entry. Each row has 3(n + 1) bits, where n is the length of the global branch history. We
use 3 bits (one sign bit and two weight value bits) to represent each weight. Weights that are multiple of
2 bits will allow us to easily compare against the HC-DRO design since each HC-DRO cell stores 2 bits.
There are three ports in the weight storage: read, reset, and write port. Each port takes an enable signal
and hashed branch address as input. In order to read, reset, and write into the target weight entry, a demultiplexer (DEMUX) is necessary to decode the address into the one-hot representation. In SFQ technology,
the most common way to design a DMEUX is using Non-Destructive ReadOut cells with Complementary
output (NDROC), which was proposed in prior work [70, 2]. DEMUX designed with NDROC requires 60%
of JJs compared to CMOS design with NOT and AND gates [82].
Figure 5.4 shows the NDROC DEMUX tree design. Each box represents an NDROC cell. After setting the NDROCs with hashed branch address S, the enable signal IN will travel through the clock pins
and arrive at one of the outputs of the bottom-level NDROC cells as a one-hot representation. Notice
that the clock pins are re-purposed in this design to pass the enable signal for generating the one-hot address. Hence, clock distribution is eliminated in the NDROC-based DEMUX compared to the AND-based
DEMUX.
Figure 5.4: NDROC-based DEMUX design
60
Even with these optimizations, it is important to note that a read/write port is JJ intensive. To access a
predictor table of N entries, one needs 1-to-N DEMUX trees. Such a tree uses (N-1) NDROC cells and up
to 2N splitters, consuming a significant amount of JJs. Hence, designs that require fewer access ports may
be beneficial under stringent JJ limits.
The write port is designed using Dynamic AND(DAND) gates [57] to avoid the clock distribution,
which is proposed in [34]. The DAND gates take WEN and W_DATA as input. The detailed design and
timing are shown in Figure 5.5. Only when WEN_0[0] and W_DATA[0] arrive close enough (less than the
hold time) the DAND gates can generate the pulses needed for writing into the NDRO cells. The advantage
of this design is that the port is essentially a clock-less design, thereby reducing the clock distribution
overheads.
Figure 5.5: Write port and DAND gate
Unlike the CMOS storage design, a reset port is necessary for this SFQ weight storage to work properly.
A CMOS flip-flop storing a "1" can be overwritten with a "0", but an NDRO cell cannot be overwritten.
To overwrite a "1" with a "0", we need to reset the NDRO cell using the reset pin. Hence, a reset port is
necessary to perform a correct write operation. Before overwriting the weight entry with updated weights,
we need to reset the target entry first. After reading out from the weight storage, the weights will be sent
to either the inference unit or training unit through a row of NDROC 1-to-2 DEMUX.
61
5.4.2 Training Unit Design
As we described in Algorithm 1, the training unit is responsible for updating the weights. The weights are
updated by using the equation wi = wi + txi
. Since t and xi are both -1 or +1, the weight wi only needs
to be incremented or decremented by one by using a saturating counter design. Based on this saturation
requirement, we designed an NDRO-based SFQ circuit as shown in Figure 5.6. The circuit consists of an
increment/decrement circuit and an independent overflow detection circuit. The increment/decrement
circuit and the overflow detection circuit are designed in behavioral Verilog and synthesized with the
qPalace tool [20], which is an open-source SFQ synthesis tool.
Figure 5.6: NDRO training unit design
The increment/decrement circuit has two inputs: the weight read from the weight storage, wi
, and the
control signal INC. When txi = +1, INC is 1, and when txi = −1, INC is 0. When INC=1 (or 0), the
increment/decrement circuit will add one to (or subtract one from) wi and get wi_new. This process is a
regular counter’s logic when the weight is not saturated.
However, when wi = +3 and INC=1 or wi = −4 and INC=0, we should keep the weight the same as
before. We use an overflow detection circuit to detect possible overflow. The overflow detection circuit
also takes wi and INC as input. When wi = +3 and INC=1 or wi = −4 and INC=0, the SEL will be 0 and
SEL will be 1; otherwise SEL will be 1 and SEL will be 0. We can use SEL and SEL with two AND gates to
62
select the correct new weight. Since SEL and SEL cannot be 1 simultaneously, we can safely combine the
output of the two AND gates with an SFQ merger circuit.
Notice that we detect the overflow simultaneously with computation rather than after the computation
because we want to make the whole training unit have the shortest delay. Also, path-balancing DRO cells
for wi are added to match the delay of the increment/decrement circuit. After the updated weight is
computed, wi_updated, the value is written back to the weight storage through the write port, but with a
reset operation prior to the write. In total, this design needs (n + 1) training units, where n is the length
of the global branch history.
5.4.3 Inference Unit Design
According to Equation 5.1, the interference units need to multiply xi and wi
, then accumulate. Figure 5.7a
shows an NDRO-based inference unit design when n = 7, where n is the length of the global branch
history. There is one sign extension circuit and seven multiplication circuits for inference. When xi = −1,
the multiplication circuit will generate −wi using the 2’s complement formula. Otherwise, the circuit will
only do a sign extension. Although wi only has three bits, when wi = −4 and xi = −1, wixi = 4 needs 4
bits to represent. That is why the proposed design needs to do a sign extension for w0, and the output of
the multiply circuits is 4 bits.
After computing wixi
, the accumulation operation is performed. The proposed design uses the addertree design using the Kogge–Stone adders. The adder tree will have log2(n + 1) levels, which is 3 here. L1
to L3 in Figure 5.7a is the adder tree. Each level’s adder will have one more bit than the previous level to
avoid overflow. Hence, the yout here has 7 bits. The sign bits will be used as the branch prediction result,
0 for taken and 1 for not taken. However, we still need the rest of the part of yout. After the branch result
comes out, even if the branch prediction is correct, we still need to compare yout with θ(=4) to determine
whether we need to update the value or not according to Algorithm 1.
63
(a) (b)
Figure 5.7: NDRO inference unit design (a) Original (b) Optimized
5.4.4 Optimization: Using a 3-bit Adder for Efficient Inference
In the inference unit, sign extension is performed when wi = −4 and xi = −1; wixi needs 4 bits to
represent. Under all other situations, the computation only needs three bits to represent wixi
. The extra
one bit will lead to a much larger add-tree design. If we limit wixi to 3 bits, the total JJ costs of the
Kogge–Stone adder tree will be reduced by around 30%. To enable this limit, the proposed design sets the
range of wi as -3 to +3 to limit the range of wixi
. With the new wi range, wixi can only be -3 to +3, which
only needs 3 bits to represent. This modification requires an updated training unit design, which modifies
the overflow circuit. When wi = +3 and INC=1 or wi = −3 and INC=0, SEL will be 0 and SEL will be 1
as shown in Figure 5.6. In the inference unit, there is no more sign extension function at level 1. Instead,
we need to add path balancing DRO for w0. Each adder in the adder tree will have one less bit, as shown
in Figure 5.7b.
5.4.5 Hashed Perceptron Design
The weight table structure is one main difference between perceptron and hashed perceptron BP design.
In hashed perceptron BP design, each row of the weight tables only has 3 bits. The typical design will
64
have 4 or 8 weight tables. Each table will have its own training unit. The inference unit needs to be
connected to all weight tables. For some hashed perceptron algorithms, the branch history xi may not
be used during the training and inference, but it will not fundamentally affect the training and inference
unit design. For example, in Figure 5.7b, we can remove the top-level circuits of the optimized inference
unit and directly connect wi to L1 adders. Thus, the hashed perceptron design requires multiple decoder
circuits to access each weight table. Each decoder will need its own NDROC-based DEMUX. While these
independent DEMUX circuits can be easily scaled, they each consume a significant number of JJs.
5.5 SuperBP: HC-DRO Perceptron Branch Predictor
While NDRO cells have the desirable property of keeping the weights after each access, NDRO designs
consume a significant number of JJs compared to DRO cells. This section presents the design details of
SuperBP, a predictor design that uses HC-DRO cells to reduce JJ count.
In Chapter 4, we used HC-DRO cells to increase the register file density [85]. In HiPerRF design, we
only consider the HC-DRO cells as normal storage cells and used a decoder circuit that expands the 2-bit
storage information into up to 3 consecutive SFQ pulses, which are then fed to the execution units for
normal operation. Similarly, while writing data into HC-DRO cells, we used an encoder to encode up
to 3 pulses into a 2-bit value. In HiPerRF design, the decoder and encoder circuits were placed on the
critical path of read and write operations, respectively. However, SuperBP takes an integrated storage
and computation approach to branch predictor design. Here, we propose a novel approach to perform the
perceptron predictor computations directly on the HC-DRO storage cells without decoding the data. For
simplicity of presentation, we first discuss the design in the context of the original perceptron predictor.
At the end of this section, we describe the changes needed for the hashed perceptron variant of SuperBP.
65
The SuperBP design is shown in Figure 5.8. Like the NDRO design, SuperBP consists of three main
parts: perceptron weight storage, training unit, and inference unit. Our training and inference unit designs
directly operate on the HC-DRO cells, eliminating the need for extra encode and decode circuits.
Figure 5.8: HC-DRO perceptron branch predictor
5.5.1 Perceptron Weight Storage Design
The HC-DRO perceptron weight storage design replaces the NDRO cells with HC-DRO and DRO cells to
improve JJ usage. Each row has 3(n + 1) bits, where n is the length of the global branch history. Since an
HC-DRO can store 2 bits of information, we use one DRO cell to store the sign bit and one HC-DRO cell
to store the 2-bit weight values. Each entry will have n + 1 DRO cells and n + 1 HC-DRO cells. However,
using DRO and HC-DRO cells as storage leads to several issues.
The first issue is that HC-DRO cells need three consecutive clock pulses to be fully read out. Since
there are at most 3 magnetic pulses in a single HC-DRO cell, we may need 3 consecutive clock pulses to
push the pulses out. The generation of 3 consecutive clock pulses is done by the HC-CLK design shown
in Figure 5.9, which duplicates the read enable (REN) pulse into three copies. This HC-CLK design takes a
66
Figure 5.9: HC-CLK circuit
single input clock pulse and moves it through 3 different paths to generate 3 clock pulses that are equally
spaced. HC-CLK design uses splitters (labeled S in the figure), mergers (M), and JTLs (J), which do not
require any clock distribution. JTL stands for Josephson transmission lines. JTL can pass SFQ pulses with
a small delay.
The second issue is that DRO and HC-DRO cells cannot keep the data after being read. Hence, after
each inference request, a row of weights that are read will be lost, which is a substantial hurdle to retraining
the weights. To counter this challenge, we adapted the LoopBuffer design from Chapter 4. A LoopBuffer
was attached to an HC-DRO register file in HiPerRF. In HiPerRF design, the LoopBuffer was a singleentry NDRO register storage that read a register and then recycled the content after the data was sent
to the execution units. Thus, by backing up a large HC-DRO register file with a single NDRO register,
the overall design enabled the register file to be read multiple times while preserving the HC-DRO cells’
density advantage.
5.5.2 Inference LoopBuffer
In SuperBP, we modified the basic LoopBuffer design to match the branch predictor functionality. In
particular, we have an inference and training Loopbuffer, as shown in Figure 5.8, and each has a different
purpose. When a weight is read for inference, it needs to be preserved and to be written back to the
weight storage. Hence, the Inference LoopBuffer is set to "1", and the Training LoopBuffer is reset to "0"
first. Then, the weight storage is read through the read port. Once the pulses are read from the HC-DRO
cells, they will arrive at the clock pins of both LoopBuffers. Since the Training LoopBuffer is set to "0", it
67
will not generate any output. The readout pulses will pass through the inference LoopBuffer since it is set
to "1". These pulses are then duplicated into two copies with splitters. One copy arrives at the input of the
inference unit and then produces the prediction result.
As discussed in the next section, our inference unit is designed to operate directly on multiple pulses
without decoding HC-DRO data into the most and least significant bits. This optimization reduces the
need for an intervening decoding circuit on the inference critical path. Another copy of the weight goes
through the write-back path and then is written into the original weight entry. This is how we keep the
data after reading from the DRO and HC-DRO cells for inference.
5.5.3 Training LoopBuffer
The second LoopBuffer, the training LoopBuffer, serves a very different purpose. The training LoopBuffer
acts as a custom reset port for the branch predictor. Recall that in SFQ technology, a write operation cannot
erase an existing magnetic flux. Hence, writing a "0" on an existing "1" is impossible unless the existing
value is reset first. The training LoopBuffer provides a very efficient reset port functionality without
needing an expensive port design. In a branch predictor, the weight is updated only during training.
When updating the weight during the training cycle, the inference LoopBuffer is reset to "0", and the
training LoopBuffer is set to "1". Then, the weight storage is read. Since only the training LoopBuffer is
"1", the pulses will arrive at the training unit. As discussed in the next section, our training unit design
operates on multiple pulse inputs rather than decoding the pulses first. This optimization lets us directly
feed multiple pulses from the weight storage to the training unit. The training unit will produce the new
weight. Notice that this training LoopBuffer does not have the "loopback" path. Instead, we only send the
updated weight back to the weight storage directly from the training circuit. Thus, the training LoopBuffer
utilizes the destructive readout property to avoid resetting the weight storage entry before writing into it.
68
5.5.4 Eliminating the Decoder and Encoder Circuits
Next, we tackle the design of the training and inference unit. Recall that the data stored in the HC-DRO
cells store 2 bits of information as multiple pulses. Hence, to enable traditional logic operations, we used
an extra encode and decode circuit in Chapter 3 and 4 to transform 0-3 pulses into 2-bit values and viceversa. However, SuperBP uses a novel design that eliminates the cost of data decoding and encoding. As
we describe next, our training and inference unit can operate on the pulses read from the HC-DRO cells
without encoding and decoding.
5.5.5 Training Unit Design
HC-DRO cells store at most three fluxon pulses, which can be treated as two bits of information. However,
when it comes to a branch predictor design, each weight can either be incremented by one or decremented
by one. In other words, the weight update process does not write arbitrarily different values; instead, they
either increment or decrement one pulse.
To take advantage of this special property, we use a signed-magnitude representation of the weight (as
opposed to a 2’s complement representation). When the sign bit is "0", if the HC-DRO cell stores zero, one,
two, or three pulses, the weight stored here is +0, +1, +2, or +3. When the sign bit is "1", if the HC-DRO
cell stores zero, one, two, or three pulses, the weight stored here is -0, -1, -2, or -3. The range is -3 to +3,
the same as our optimized NDRO design’s weight range. Notice that both +0 and -0 mean the weight is
0. We can see that the HC-DRO cells here store the absolute value of wi
. Even though signed-magnitude
representation is less efficient in its representation, it allows for an efficient training process.
Figure 5.10 shows the design of the training unit. SIGN and VALUE are the sign bit and weight absolute
value bits read from the weight storage. Since the pulses read from the HC-DRO are 0 to 3 consecutive
pulses, there is only one pin for the weight’s absolute value. Control signals INC and DEC are decided by
txi
. When txi = +1, INC is 1 and DEC is 0; when txi = −1, INC is 0 and DEC is 1. The HC-DRO cell H1
69
Figure 5.10: HC-DRO training unit
in this training circuit will separate the weight absolute value pulses into three consecutive cycles since
its clock pin is connected to the global clock signal.
Increment the weight’s absolute value: When SIGN=0 and INC=1 (add one to a positive number)
or SIGN=1 and INC=0 (subtract one from a negative number), we need to add one to the weight’s absolute
value. The XOR gate XOR1 will generate an extra pulse and send it to the weight absolute value path. We
call this pulse "G-pulse" for short. G-pulse will become part of the weight’s absolute value, and we successfully add one to the weight’s absolute value. When the weight’s absolute value is 3, H1 will release three
pulses in three consecutive cycles. G-pulse will arrive at the weight’s absolute value path simultaneously
with the third pulse released from H1. The OR gate on this path can merge these two pulses to prevent
overflow.
Decrement the weight’s absolute value: When SIGN=0 and DEC=1 (subtract one from a positive
number), or SIGN=1 and DEC=0 (add one to a negative number), we need to subtract one from the weight’s
absolute value. The XOR gate XOR2 will generate a pulse to destroy one pulse from the weight absolute
value path. We call this pulse "D-pulse" for short. Whether the weight’s absolute value is 1, 2, or 3, H1
will always release one pulse at the very first cycle. This first pulse will arrive at the XOR gate XOR3
simultaneously with the destroy pulse. Then XOR3 will destroy the first pulse, and we subtract one from
the weight’s absolute value successfully.
70
Encoding +0 and -0: When decrementing the weight’s absolute value, if the value is 0, we are computing +0 − 1 = −1 or −0 + 1 = +1, so we need to change the absolute value to 1 and flip the sign bit.
Since the weight’s absolute value is 0, H1 will not release anything. D-pulse will generate a pulse at the
output of XOR3. It means the weight’s absolute value becomes 1. Meanwhile, H1 did not release anything,
so the NOT gate will generate a pulse. This pulse will arrive at the AND gate with D-pulse. We can use
the AND gate output to flip the sign bit at XOR gate XOR4. As a result, +0 − 1 becomes -1, and −0 + 1
becomes +1.
5.5.6 Inference Unit Design
Similar to the NDRO inference unit design, we need to compute wixi first to add the products. Instead
of decoding the data read from the HC-DRO cells, we directly operate on them. Figure 5.11 shows the
inference unit design when n = 7, where n is the length of the global branch history.
Figure 5.11: HC-DRO inference unit
Multiplication and 2’s complement: We need to represent the number as 2’s complement to add
positive and negative numbers. However, we stored the sign and absolute value of wi
in the weight storage,
so we need to translate them first. With careful design, we successfully merge the multiplication,wixi
,
71
and 2’s complement translation into the same circuit. We use the label M&T to represent this circuit in
Figure 5.11. Figure 5.12 shows this design in detail. VALUE is the absolute value of the readout weight.
SIGN is the sign bit of the readout weight. SIGN_X is the sign of xi
("0" is + and "1" is -).
Figure 5.12: Multiplication and 2’s complement translation
This circuit is divided into two parts. The upper part is used to compute the results, SIGN_OUT will be
the sign bit of wixi
, and the VALUE_OUT will be the corresponding 2’s complement lower 2 bits of wixi
in serial pulses form. For example, if wixi = −3, the 2’s complement of -3 is 101. So SIGN_OUT will be
1, and VALUE_OUT will have one pulse in three cycles. If wixi = +2, the 2’s complement of +2 is 010.
So SIGN_OUT will be 0, and VALUE_OUT will have two pulses in three cycles. The VALUE_OUT pulses
may not be consecutive, but this can be tolerated by our serial adder design described in the next part. The
lower part is used to generate a correct form of 0. When the input is -0 or +0, the upper circuit cannot
generate a correct 0 in 2’s complement. So, if the NDRO in the lower part does not detect any pulses from
the absolute value input, it will not generate any pulse. Hence, the AND gate at the output can generate a
correct 0, which is no pulses. Two NDRO cells in this design need to be reset after each computation.
HC-DRO serial adder design: We designed a new adder, which can directly add two numbers
read from the HC-DRO cells. The design is shown in Figure 5.13(b). This design is based on a one-bit
counter [48], which is represented as COUNT in the figure. The state machine of the counter is shown in
Figure 5.13(a). Since the number of the pulses read from one HC-DRO cell represents the number stored
72
in this HC-DRO cell, if we count the total number of the pulses read from two HC-DRO cells, we will get
the sum of the two numbers stored in the HC-DRO cells.
Figure 5.13: (a) State machine of the counter (b) HC-DRO serial adder
Assume HC-DRO A has two fluxons and HC-DRO B has three fluxons. We only read one fluxon each
cycle. In the first cycle, both A and B release a pulse. We connect an AND gate directly to the higher
bit counter H to add two to the counter. After the first cycle, the counter result is 010. In the second
cycle, both A and B release a pulse. H got another pulse to be flipped to 0 and generate a carry pulse. We
temporarily store this carry pulse in the DRO cell. After the second cycle, the counter result is 100. In the
third cycle, only B releases a pulse. We connect an XOR gate directly to the lower bit counter L to add two
to the counter. After the third cycle, the counter result is 101. After releasing and counting all pulses, we
can read from DRO and two counters. The result will be parallel pulses, 101. Using this counter design,
we complete the decoding and adding simultaneously. As we count the total number of pulses, the adder
operates correctly, even for the non-consecutive input pulses.
Adder tree design: Figure 5.11 shows the adder tree for the summation. The first level of the adder
tree is the proposed serial adder. Since the input is 3 bits and the output will be 4 bits, we need four
73
counters here. The design is shown in Figure 5.14. Notice that the fourth counter is connected to an OR
gate instead of an AND gate. We use this OR gate to do sign extension so that the 4-bit output will have
a correct sign bit. After the first-level computation, all the numbers are in the form of parallel pulses. We
can use the same Kogge–Stone adder tree as the NDRO design for the rest of the summation.
Figure 5.14: 3-bit HC-DRO serial adder
5.5.7 Hashed Perceptron Design
Similar to the NDRO baseline, the weight table is the main difference between perceptron and hashed
perceptron BP design. In the hashed perceptron BP design, each row of the weight tables only has 3
bits(one DRO cell and one HC-DRO cell). The typical design will have 4 or 8 weight tables. Each table will
have its own training unit. The inference unit needs to be connected to all weight tables. For some hashed
perceptron algorithms, the branch history xi may not be used during training and inference, but it will not
fundamentally affect the training and inference unit design. For example, in Figure 5.11, we can remove
the top level of multiplication and 2’s complement translation circuits in the optimized inference unit and
directly connect wi to L1 adders. However, our 3-bit HC-DRO serial adder design will remain intact. The
main challenge is building separate indexing circuits to access a given weight table entry. Each indexing
circuit needs a DEMUX design, which consumes additional JJs.
74
5.6 Evaluation Methodology
JJ Count: Some of the largest demonstrated SFQ chips currently have about 72K JJs [26]. In the near
future, integrating a million JJs on a chip is expected to be feasible [13]. In our evaluations, we decided
to vary the JJ budget allocated to the branch predictor to be under 10% of the total chip budget. Thus, we
evaluated both perceptron and hashed perceptron SuperBP under the same JJ budget, ranging from 30k JJs
to 90k JJs.
Hashed Perceptron Choice: The hashed indexing functions of the hashed perceptron branch predictor have been shown to affect the overall performance. So we chose five different indexing functions
from [76]. We compared the performance of hashed perceptron with these varying hash functions against
the traditional perceptron branch predictor. The hashed indexing functions are shown below. Notice that,
i represents the ith weight table, and past_branch_pc[0] = current_branch_pc.
Hashed 1: past_branch_pc[i] mod #weights
Hashed 2: ((past_branch_pc[2i] << 1) ⊕ history_bit[2i + 1]) mod #weights
Hashed 3: (past_branch_pc[2i] ⊕ past_branch_pc[2i + 1]) mod #weights
Hashed 4: (history_segment[i] ⊕ past_branch_pc[i]) mod #weights
Hashed 5: (history_segment[i] ⊕ current_branch_pc) mod #weights
Performance Simulator: We built an SFQ-based gate-level pipelined CPU simulator to analyze the endto-end application time of different SuperBP designs. The ISA we chose is RISC-V 32I. The simulator is
based on the RISC-V ISA Simulator Spike [68] and written in C++. Our simulator takes the operating trace
as an input to simulate the overall performance of a gate-level pipeline in-order core. To get the depth of
each gate-level pipelined stage, we synthesized an open-source 32-bit in-order CPU [55] with the qPalace
tool [20]. The qPalace tool uses the SFQ cell library and supports path-balancing, which can generate the
correct gate depth.
75
Benchmarks: We selected a wide range of applications from different benchmark suites. We evaluated
mcf, leela, xz, deepsjeng (sjeng), nab, lbm, parest, and namd from the SPEC CPU 2017 benchmark [67]. All the benchmarks are from the SPECrate suite and tested with the ref dataset. In addition
to SPEC, we use a representative microbenchmark from the RISC-V repository [54] to evaluate the performance of superBP. We use Vector Addition (vvadd), Median Filter (median), Multiply (intmul), Sparse
Matrix-Vector Multiplication (spmv), and dhrystone (dhstone). We also used all the 440 mobile and server
traces from the 5th Championship Branch Prediction competition[10]. For SPEC and RISC-V benchmarks,
due to slow simulation speed, we functionally skipped the first 10 billion instructions. Then, we gathered
the branch information for the next 1 billion instructions for each benchmark. These are then used to
evaluate the predictor performance.
5.7 Results
5.7.1 Perceptron and Hashed Perceptron
We first compare the performance in terms of MPKI (misses per 1000 instructions) for perceptron SuperBP
and hashed perceptron SuperBP for varying JJ budgets. We ran simulations with different design parameters (number of weights, number of tables in hashed perceptron, and global history length) for a given
JJ budget and picked the design with the best performance for both predictors. The average MPKI of the
perceptron and hashed perceptron designs are shown in Figure 5.15. These are the averaged MPKI results
across all the benchmarks for the best-performing predictor design.
The MPKI results show that the perceptron SuperBP shows better accuracy than any of the hashed
perceptron SuperBP across all JJ budgets. This outcome is because the access port DEMUX trees consume
most of the JJs in the hashed perceptron SuperBP. In hashed perceptron SuperBP design, DEMUX costs
around 70∼85% of JJs, and the actual weight storage only costs 3∼4% of JJs. However, in perceptron
76
12
14
16
18
20
22
30k 50k 90k
Average MPKI
Hardware budget, JJ count
perceptron
hashed1
hashed2
hashed3
hashed4
hashed5
Figure 5.15: SuperBP and hashed variant MPKI
SuperBP design, the weights use up to 12% of the JJ budget, while only about 50% of the budget is used
to index a given entry of the weight table. Hence, the perceptron SuperBP has around 3 times the weight
storage capacity of the hashed perceptron SuperBP. This additional weight storage leads to tracking more
of the history, improving the prediction accuracy. Thus, in a JJ-constrained environment, which most SFQ
designs currently are, accounting for the table access overheads eats into the performance advantages of
the hashed perceptron predictor. In the rest of this section, we only evaluate SuperBP perceptron across
different designs.
5.7.2 Hardware Performance
We evaluate the hardware performance in JJ count and static power. We built Verilog netlists using the
publicly available cell libraries [61] for both the NDRO and HC-DRO-based perceptron branch predictor.
Furthermore, we have successfully verified the functional correctness and timing with different inputs.
We calculated the total JJ count and static power using the SFQ cell library provided by [61]. As for the
dynamic AND gate, which is not provided in the library, we derived the data from [57] and [82].
JJ Count: Table 5.1 shows the JJ count of NDRO and SuperBP perceptron and hashed perceptron predictors
for different sizes. The size of the branch predictor is represented using the notation N × (n + 1), where
77
Size 16 × 8 32 × 16 64 × 32 128 ×32
NDRO-baseline 20516 62222 205035 369634
SuperBP
(Saving %)
13079
(36.25%)
39040
(37.26%)
125784
(38.65%)
225874
(38.89%)
Hashed SuperBP
(Saving %)
26127
(-27.35%)
96670
(-55.36%)
367708
(-79.34%)
713628
(-93.06%)
Table 5.1: Total JJ count and the saving % over the baseline
N is the number of entries in the perceptron weight storage, and n is the length of the global history. For
hashed perceptron, n + 1 is the number of weight tables. The data includes the JJ counts for splitters,
mergers, and any necessary splitters for the clock distribution. The first row shows the JJ count of the
NDRO-based design. The second row shows the JJ count of SuperBP design and the saving of JJ over the
NDRO-based design in percentage. The third row shows the JJ count for the hashed perceptron predictor
and the percentage of JJ saving over the NDRO-based design in percentage. A negative saving here means
using more JJs than the NDRO-based design. Due to the overheads associated with addressing multiple
tables in the hashed perceptron predictor, the JJ count is consistently higher than the NDRO baseline and
significantly more than the perceptron predictor JJ counts for a given predictor configuration. When the
size is 16×8, the hashed SuperBP has double the amount of JJs of the perceptron SuperBP while having a
similar accuracy.
Access
Port
Weight
Table
Output
Port Training Inference
NDRO-based
Design 39.10% 6.86% 13.31% 15.58% 25.15%
SuperBP 41.99% 6.85% 13.98% 9.52% 27.66%
(a)
Access
Port
Weight
Table
Output
Port Training Inference
NDRO-based
Design 60.13% 12.19% 17.50% 3.46% 6.72%
SuperBP 57.80% 12.69% 19.12% 2.21% 8.17%
(b)
Table 5.2: Breakdown of JJs with the weight table of (a) 16×8 (b) 128×32
78
The breakdown of JJ counts across different components of the branch predictor is shown in Table 5.2
for the two endpoints of our design space explorations. The 16×8 SuperBP costs 13,079 JJs, as shown in
Table 5.1, while the NDRO design for the same branch predictor dimensions consumes 20,516 JJs. Interestingly, the division of JJs across the various parts of the branch predictor design remains roughly the
same between the NDRO and SuperBP. However, SuperBP uses nearly 36% fewer JJs to achieve identical
accuracy to the NDRO design.
In Table 5.2b, we show the same breakdown statistics for the 128×32 branch predictor configuration.
Again, the division of JJ counts across different predictor circuit components remains roughly the same,
but SuperBP uses 39% fewer JJs to achieve the same accuracy.
Power Consumption: Table 5.3 shows the static power for each predictor design. Since static power is a
function of the number of JJs used in a design, a 64×32 SuperBP consumes around 35.41% less static power
than the NDRO-based design. Note that the higher static power consumption of NDRO design also leads
to higher cooling costs since the additional power draw leads to high heat extraction costs. In the results
shown here, we did not count the additional cooling power needed for NDRO-based designs.
Size 16 × 8 32 × 16 64 × 32 128 ×32
NDRO-based
Design 4411.38 12944.26 41425.26 73543.98
SuperBP
(Saving %)
2940.97
(33.33%)
8517.93
(34.20%)
26754.61
(35.41%)
48341.53
(34.27%)
Table 5.3: Total static power consumed by BP only (µW) and the saving % over the baseline
Impact on Latency: As for the latency, although it takes some time for the LoopBack write to update the
weight after each inference, the weight storage is only accessed when there is a branch instruction. The
weight LoopBack update will finish before the next branch instruction for such designs. For the inference
unit, the differences are the wixi multiplication unit and first-level 3-bit adders between NDRO-based
design and SuperBP. In the NDRO-based design, the gate depth of the multiplication unit is 3, and the gate
depth of a 3-bit adder is 5. In SuperBP design, the gate depth of the multiplication and 2’s complement
79
0
5
10
15
20
30k 40k 50k 60k 70k 80k 90k
MPKI
Hardware budget, JJ count
NDRO SuperBP
Figure 5.16: MPKI for both designs
translation is 4, and the gate depth of a 3-bit adder is 2. However, it takes SuperBP design two more cycles
to process all three serial signals. As a result, the NDRO-based design and SuperBP have the same inference
gate latency (3 + 5 + adder tree and 4 + 2 + 2+ adder tree).
Sensitivity to JJ Budget: We simulated both NDRO-based design and SuperBP. We compared the MPKI
of both designs with the same JJ budget. We evaluate all benchmarks with the JJ budget of 30k to 90k.
For a fair comparison, we ran the simulation multiple times for each hardware budget with different N
(number of entries in the weight storage) and n (global history length). Figure 5.16 shows the MPKI for
both NDRO-based design and SuperBP with the optimal size. From this figure, we can see that SuperBP
outperforms the NDRO-based design consistently.
5.7.3 Detailed Performance Evaluations
We performed performance simulations to measure end-to-end application performance. We used 30k JJ
budget designs for this simulation. Figure 5.17 shows each benchmark’s absolute MPKI and MPKI reduction
percentage. For the CBP traces, we show the average for mobile and server traces. lbm shows a 41% of
MPKI reduction. The average MPKI reduction is 13.6%.
80
lbm
parest
namd
nab
mcf
leela
sjeng
xz
vvadd
spmv
median
dhstone
intmul
mobile
server
0
25
50
75
MPKI
SPEC-2017 RV-Micro CBP
41%
31%
28% 26% 20%
10% 3%
2%
19%
17%
11%
9%
2% 15%
13%
NDRO SuperBP
Figure 5.17: MPKI for NDRO and SuperBP design with 30K JJs and Reduction in MPKI (black label)
lbm
parest
namd
nab
mcf
leela
sjeng
xz
vvadd
spmv
median
dhstone
intmul
0.0%
2.0%
4.0%
6.0%
8.0%
10.0%
IPC improvement
SPEC-2017 RV-Micro
Figure 5.18: SuperBP IPC over NDRO-30k JJs
Figure 5.18 shows the IPC improvement for each benchmark. Note that CBP traces have branch instruction information and not all the instructions necessary to simulate performance. Hence, we did not
use CBP traces for IPC measurements. As expected, the benchmark’s overall execution time improvements
are roughly proportional to the MPKI reduction experienced by that benchmark. For instance, nab shows
the most IPC reduction (10.6%). This is because nab has a relatively high MPKI reduction as well as a
relatively high MPKI, both of which contribute to a considerable reduction in branch penalty. In contrast,
lbm has a low base MPKI; hence, branch misprediction is not a significant bottleneck for this benchmark.
Thus, the IPC improvement is minimal, as expected.
81
5.8 Conclusion
Efficient CPUs built using SFQ can enable broad adoption across many domains, from energy-efficient
data centers to qubit control. This work highlights challenges in building branch predictors using SFQ
technology. We proposed SuperBP, a novel design built with High Capacity Destructive ReadOut (HCDRO) cells that can work with HC-DRO cells without extra decoding and encoding. We evaluate the
performance of perceptron SuperBP and hashed perceptron SuperBP using representative benchmarks.
Using gate-level synthesis tools and gate-level simulations, we showed that SuperBP saves 39% of the JJ
counts and shows a 34% reduction in power. We evaluate the performance of SuperBP over the NDRO
baseline using representative benchmarks under the same JJ budget. For example, when the JJ budget is
30k, our SuperBP design can reduce the MPKI by 13.6% on average.
82
Chapter 6
SF-QIQ: An SFQ Issue Queue Design
6.1 Introduction
Continuing that progression of designing microarchitecture blocks for integration into out-of-order (OoO)
CPUs, this chapter focuses on building an issue queue for OoO CPUs. Issue queues (also called reservation
stations) are the critical microarchitectures that enable OoO execution. Current SFQ fabrication processes
provide much higher yield rates with fewer JJs per die. Thus, designs that reduce JJ counts are desirable [75]. Hence, while designing the issue queue, we must consider the total JJ counts needed to manage
the complex functionality of issue queues, including instruction storage, wakeup logic, and selection logic.
Building SFQ-based issue queues is a non-trivial challenge. Instructions exit the issue queue out of
order based on their source operand availability, which requires random access capability. In general,
supporting random access capability in SFQs is expensive. The most efficient random access mechanism
in SFQ technology uses a tree of JJ-intensive demultiplexers built with Non-Destructive ReadOut cells
with Complementary output (NDROC), which was proposed in prior work [70, 2]. Second, instruction
selection logic must support age-based selection to prioritize older instructions to execute. Prior works
have demonstrated that such an age-based issue is important to achieve high performance [58]. However,
tracking the age of instructions in issue queues with random entry and exit points for instructions can
complicate SFQ-based issue queues. Finally, the wakeup logic must identify all the consumers of newly
83
produced data, which requires supporting either CAM or RAM-based search, which is again a difficult
choice to make in SFQ designs.
In this chapter, we propose solutions to these challenges and design an SFQ instruction queue (SF-QIQ),
a circular issue queue with position-based selection. We designed SFQ logic circuits that are free of any
demux-based access ports and designed wakeup logic based on SFQ-specific CAM designs. We designed
ready instruction selection logic that efficiently identifies the oldest ready instructions for execution.
The primary contributions of this chapter are as follows:
• We discuss different issue queue design choices, such as CAM and RAM-based wakeup logic and
collapsible versus non-collapsible issue queues. We evaluated the JJ counts of these various design
choices. Based on this comparison data, we designed SF-QIQ, an instruction issue queue design
based on CAM-based wakeup logic and the non-collapsible circular queue (CIRC).
• We demonstrate how to build CAM-based wakeup logic, shift-based CIRC port design without using
the JJ-intensive decoders, and a simple, fast, position-based selection logic. We also show how to fix
the order error in the CIRC issue queue with minimal hardware cost and how to use HC-DRO cells
further to reduce the JJ cost in the issue queue design.
• We evaluate the CPI of SF-QIQ and show that SF-QIQ can achieve a minimum performance degradation across different designs when compared with the ideal issue queue design.
6.2 Background
6.2.1 α-Cell
The DRO and HC-DRO memory cells cannot be overwritten with new values easily. For instance, if a DRO
cell stores fluxon (to represent "1"), then writing a value of zero is not possible unless the existing fluxon
is removed. This process requires special reset logic.
84
IN OUT
CLK
RESET
α-cell
DRO
(a)
A
B
OUT
CLK RESET
α-cell
(b)
Figure 6.1: (a) α-DRO (b) α-AND
To resolve this issue, a new type of SFQ cell is proposed, called α-cell [31]. By connecting the α-cell
to the DRO cell, as shown in Figure 6.1a, one can reset the internal state of the DRO cell. Note that the
α-cell does not produce any output pulse after the DRO cell is reset. The α-cell can be added to any logic
gate, such as an AND gate as shown in Figure 6.1b, to reset any pulses that are present in the storage cells
within the AND gate. Thus, α-cell works like an asynchronous reset.
6.2.2 Issue Queues in Out-of-Order Execution
We assume that the reader is familiar with Tomasulo’s algorithm [77], which allows modern CPUs to execute instructions in an Out-of-Order (OoO) manner. However, we provide a brief overview of the issue
queue structure in OoO CPUs to make the chapter self-contained. Issue queue decouples the instruction decoding process from the operand (and resource) availability. When an instruction is decoded and
dispatched, it will wait in the issue queue (IQ). An instruction is woken up when its dependencies are
resolved, and computation resources are ready. The wakeup logic requires either a CAM search or RAMbased logic, as we discuss in detail in the next section. After the wakeup logic provides source operands,
multiple waiting instructions may become ready. A selection logic then selects a subset of the ready instructions for execution. Next, we describe the challenges in implementing the wakeup and selection logic
in SFQ circuits.
85
6.3 Challenges of Designing an SFQ Issue Queue
6.3.1 Selection Logic
For every cycle, multiple instructions in the issue queue may become ready after their source operands
are ready. It is the responsibility of the instruction scheduling logic to select a subset (depending on the
issue width) of these ready instructions. Prior work [3] has proven that instruction scheduling is an NPhard problem without knowing the future instructions. As such, modern CPUs use heuristics to schedule
instructions. The oldest-first scheduling algorithm selects the instruction that is the oldest in program
order amongst all the ready instructions. The intuition is that older instructions will have a higher chance
of committing in the ROB and may hold a long dependency chain that blocks the critical path of execution [11].
The oldest-first scheduler relies on tracking the age of pending instructions in the issue queue. Since
instructions may leave the issue queue out-of-order, these instructions may create an empty slot ("hole")
in the issue queue. Tracking instruction age in the presence of holes may be done in CMOS in one of three
possible ways:
6.3.1.1 SHIFT Based Approach
In this approach, a collapsible issue queue [17] will shift all younger instructions to occupy the holes,
as shown in Figure 6.3a. This approach is called the SHIFT approach. The SHIFT approach physically
maintains the order in the issue queue and has the highest space efficiency. However, implementing this
logic in SFQ poses significant challenges. We highlight some of these challenges below.
First, compacting the issue queue by shifting instructions is infeasible in SFQ. In SFQ design, when an
instruction is moved from one entry (source) to another entry (destination), the source entry must be reset
to remove any data and control information stored in that entry. Recall that data cannot be erased without
86
a reset port. Only then can the destination entry be moved to the source. Thus, each issue queue entry
must be augmented with α-cells to provide the reset capability, which significantly worsens the JJ count.
The second and more challenging issue is the delayed CAM search. To illustrate the issue, Figure 6.2
shows a selective shift logic, with DRO cells (corresponding to a source and destination issue queue entry)
and the selection MUX. Each time an issue queue entry is read, we need to use a selection MUX to decide
whether the content needs to go back to the original storage cell (read operation) or the content needs to
go to the next entry (shift operation). Since we need to selectively move only younger instructions into a
vacant slot, this selective shifting operation requires a MUX.
DRO
DRO
CLK
CLK
Hold
Figure 6.2: Shift logic
Thus, selective shifting takes at least as many cycles as the fastest MUX that can be designed. Due to
the gate-level pipelining, even with a one-gate SFQ MUX design, it will take two cycles to shift the content.
While the content leaves the DRO from source to destination entries, the source entry is no longer available
for the CAM query for at least the duration of the MUX since the content is neither in the source nor in
the destination DRO cell. CAM query can only compare while the tags are in the DRO cells. If the MUX
has a depth of two or three gates, the performance will be worse. Note that in CMOS, since there is no
need for gate-level pipelining, the performance issues due to CAM search stalls are not a hurdle.
Finally, we also need to support random access to the SHIFT design. Since the shifting of issue queue
entries can be from any entry to a neighboring entry, random access to the issue queue is necessary. Supporting random access logic in SFQ circuits requires using a demultiplexer (DEMUX) tree. Prior works [85,
87
86] have shown that the most efficient way to design a DEMUX is to use NDRO with complementary output (NDROC) cells. However, NDROC cells are JJ-intensive and have multi-cycle access. The DEMUX can
only be accessed every two cycles in the best case [85], which means one can only issue every other cycle,
which in turn will significantly slow down the scheduling process logic. Hence, we need to avoid using
the DEMUX tree design since the issue queue is on the critical path of the OoO execution.
6.3.1.2 RANDOM Fill Approach
A random queue (RAND) [83] will fill the "holes" with new instructions as they enter the issue queue.
While increasing the space efficiency of the issue queue, the order is not kept by the issue queue. To keep
the order of the instructions, an age matrix (AGE) [53, 58] is introduced. By using the age matrix, the
scheduler can issue the oldest ready instruction correctly.
However, one needs to access two different structures: the age matrix first to determine the relative
ordering of ready instructions, followed by access to the issue queue. Furthermore, the random queue also
needs to support random access to support the relative ordering of all the issue queue entries. Thus, many
of the challenges that we discussed regarding the SHIFT technique also apply to the RAND approach.
6.3.1.3 CIRCULAR Queue Approach
The third approach is to use circular queue (CIRC) design[7], which will leave the "holes" as is and will
not try to fill them until the head of the queue points to them. A CIRC design keeps the original order of
the instructions, but it reduces the space efficiency of the issue queue due to the non-collapsible nature
of issue queue fills, as shown in Figure 6.3b. We evaluated all three design choices in terms of JJ counts,
which is an excellent proxy of the design complexity as well. Given the simplicity of logic in terms of
JJ counts, even at the expense of some space usage inefficiency, we eventually used the CIRC design for
SF-QIQ (more details in the evaluation section).
88
Instruction 1
Instruction 2
Instruction 3
old
new
issued
(a)
Instruction 1
“hole”
Instruction 3
old
new
issued
(b)
Figure 6.3: (a) Collapsible issue queue (b) Non-collapsible issue queue
6.3.2 Wakeup Logic
The wakeup logic searches for dependent instructions and provides them their operands (or tags) when
a producer instruction completes its execution. When an older instruction finishes its execution, it will
broadcast its destination register tag to all entries in the issue queue. Each entry will compare this destination tag with the source tag and set the ready bit if the tags match. The wakeup logic for the issue queue
can be based either on CAM or RAM-based circuits. Figure 6.4 shows the CAM-based wakeup logic.
Rdy1 Src1 tag Src2 tag Rdy2
Selection Logic
Request
…...
=
Wakeup Bus
=
Rdy1 Src1 tag Src2 tag Rdy2 Request
= =
Figure 6.4: CAM-based Wakeup Logic
The main drawback of the CAM-based wakeup logic in CMOS is that the CAM search consumes significant dynamic power. One method to reduce power consumption is to use RAM-based wakeup logic [81,
60, 7]. As shown in Figure 6.5, the main concept of RAM-based wakeup logic is to store the position of the
89
issue queue in a RAM table and use its source register tag as the address. When an older instruction (for
example, R1=R2+R3) finishes its execution, it will read the RAM table and retrieve the instruction (pointing
to the issue queue) that depended on it. The entry in the correspondence issue queue will set the ready
bit. The RAM operations consume less dynamic power than the CAM search. However, since one RAM
entry cannot hold all its dependents, such as the R1 with the gray background in the figure, an additional
latency penalty is paid for the RAM-based wakeup process.
R0
R1
Rn
R1
R1
R1
Rdy1 Src1 Src2 Rdy2
RAM Table Issue Queue
R1 = R2 + R3
Figure 6.5: RAM-based Wakeup Logic
Additionally, in order to support RAM-based search, we have to provide the ability to support random
access while reading from and writing to the RAM table, which is supported by using a DEMUX tree in SFQ
circuits. As discussed earlier, since the DEMUX cannot be accessed every cycle, the RAM design can only
update the RAM table every other cycle, which in turn will significantly slow down the wakeup logic. We
evaluated both CAM and RAM-based wakeup designs and eventually selected CAM-based implementation
for SF-QIQ.
Prior RAM-based designs provide a separate data array to store the instruction payload, such as opcode
and destination tag. Since the payload is only required when the instructions are issued, it is a common
practice to decouple the payload RAM from the wakeup and selection logic [9, 49, 1]. That approach of
separating the tag search from the instruction payload can still be used for efficient SF-QIQ implementation.
90
6.4 SF-QIQ Design
Based on the challenges presented in the prior sections, we designed SF-QIQ to use the CIRC approach for
scheduling and a CAM-based search for wakeup logic. Figure 6.6 shows the SF-QIQ design. It consists of
four main parts: CAM-based wakeup logic, DEMUX-free write&reset port, selection logic, and HC-DRO
payload RAM.
CAM
CAM
CAM
CAM
HC-DRO
RAM
HC-DRO
RAM
HC-DRO
RAM
HC-DRO
RAM
Selection Logic
Shift-Based CIRC Port
Figure 6.6: SF-QIQ Design
6.4.1 CAM-Based Wakeup Logic Design
An SFQ content addressable memory structure was proposed in prior work [46]. However, this design
has two main drawbacks. First, while doing a search operation, it uses positive and negative pulses to
differentiate "0" and "1". However, generating a negative pulse requires extra circuits and is not compatible
with the existing SFQ cell library [61]. Second, since the search port is embedded in the cell, it is impossible
to compare multiple values at the same time through a single CAM search. For example, if an instruction
has two source operands, the CAM search needs to tag match both the source operands efficiently to find a
match. Thus, supporting multi-tag CAM is critical. As a result, SF-QIQ uses existing cells to build a custom
CAM-based wakeup logic specifically designed for CPU issue queues.
91
Src2 Tag
M
…...
Tag 1
Tag 2
Compare
Enable
Src1 Tag
Src2 Match
Src1 Match
α Req
Reset
Valid Check 2
Valid Check 1
Request
M
M
Valid
Rdy2
Rdy1
M
DRO M
Figure 6.7: SFQ CAM-based Wakeup Logic
Figure 6.7 shows the SFQ-specific CAM circuit design for one issue queue entry as an exemplar. The
memory part of the CAM stores two source tags (labeled Src1 Tag and Src2 Tag), two ready bits, and one
valid bit. Since issue queues must compare the source tags with the broadcast tag every cycle, we need to
read from the storage arrays frequently. Hence, SF-QIQ could not use DRO cells for source tags. Hence,
the source tags are NDRO cells. In every cycle, both source tags are read for CAM match.
Once the source tags are read from the memory, we compare them with as many broadcast tags as
the number of instructions executed per clock. In the figure, there are two broadcast tags (labeled Tag1
and Tag2) indicating that the CPU can support 2-wide instruction execution capability. The results of the
comparison of Src1Tag with Tag1 or Tag2 will be merged and sent to a NOT gate. If at least one bit of
the Src1Tag does not match the broadcast tag, at least one of the XOR gates will create a pulse ("1") and
prevent the NOT gate from generating any pulses. Broadcast of multiple tags is done through multiple
XOR-Merger-NOT structures, and the results are finally merged. If any of the tag comparisons succeed,
the corresponding ready bit will be set. If both of the ready bits are set, a request signal will be sent to the
scheduler logic.
Notice that the valid bit circuit design is not path-balanced, and yet it is the correct design for the
wakeup logic. This circuit is custom-designed to ensure the entry is still valid before we set the ready bit.
92
The NDRO cannot be overwritten and needs to be reset deliberately. If we have a matched tag broadcast at
cycle N and reset the entry at cycle N+1 due to squash, the XOR-Merger-NOT structure will still generate
a match signal. By putting the valid check before setting the ready bit, we can guarantee the ready bit will
not be set unintentionally.
Note also that this design does not need to use α-cells to reset the XOR-Merger-NOT structure. However, the valid check AND gate also has a delay of one cycle. We use a DRO cell to duplicate the reset
signal so that the ready bit can be reset correctly without using the α-cell.
However, we attach an α-cell to the request AND gate. Notice that we treat the valid check AND gate
and the request AND gate differently. The valid check AND gate takes input from a merger tree, which
means that this input may arrive late in the clock cycle. If we attach an α-cell, the reset signal may be too
close to the input, which brings a timing conflict. However, both inputs of the request AND gate are from
the ready NDRO cells. They will arrive at the AND gate at the beginning of the cycle, using the α-cell, and
resetting the AND gate at the end of a cycle will not bring any time conflict.
There is still one situation in which the ready bit will be set unintentionally. If we did not broadcast
any tag (tag=0) at cycle N and set the entry valid at cycle N+1 (write operation), the XOR-Merger-NOT
structure would still generate a match signal and pass the valid check. Since tag 0 is usually the zero
register and will not be broadcast, we can add a compare enable signal to the XOR-Merger-NOT structure
to eliminate this situation.
Thus, by understanding the intricacies of the SFQ technology and the microarchitecture design requirement of the wakeup logic, we used the technology-circuit-microarchitecture co-design principle to
build the wakeup logic.
93
6.4.1.1 Design Alternative
The delay of the merger tree increases when the Src Tag becomes larger. Recall that the size of the tags
grows with the ROB/physical register file size. For wide source tags (more than 8 bits), we explored an
alternative design. This design uses a threshold gate [32] to replace the merger tree. A threshold gate can
have up to 32 inputs and generate an output pulse based on the threshold of the cell. For example, if the
threshold is 6, and there are more than or equal to 6 input pulses, the gate will produce the output.
…...
Tag 1
Tag 2
Src Tag
Src1 Match
T
M
DRO
Compare
Enable
Threshold 9
Figure 6.8: CAM-based Wakeup Logic with Threshold Gate
Figure 6.8 shows the CAM design with the threshold gate. We replace XOR gates with XNOR gates.
We set the threshold of the gate to the width of the SrcTag plus one (compare enable). Then, each XNOR
gate will produce a "1" when the corresponding bits in the broadcast tag match the SrcTag bit. Thus, the
threshold gate is activated and produces an output pulse only when all the bit-wise inputs produce a match
while there is a compare enable pulse. The number of inputs of the threshold is highly customizable. This
design will improve the overall stability of the CAM circuit when the tags are very wide. We evaluated
the possibility of using this alternate design. However, the current state-of-the-art threshold gates have a
long reset delay of more than 120ps [32], which is a significant delay for the issue queue operation. Hence,
SF-QIQ design did not use this approach after initial evaluation. However, recent efforts in building a faster
and synchronous threshold gate will likely make this design approach more time and JJ-efficient.
94
6.4.2 Shift-Based CIRC Port Design
The SrcTag fields in the issue queues cannot be overwritten with new values until the old values are
reset explicitly. Hence, the issue queue needs three ports to write new SrcTags when a new instruction is
dispatched to the issue queue: a write port, a read port, and a reset port. However, the SrcTag fields are
read every cycle for the CAM search operation. Hence, there is no need for any read port design. Instead,
a global clock will repeatedly read out the SrcTag fields at every cycle. Thus, we only need the write port
and the reset port to access the tag storage.
As we discussed in Section 6.3.1, we want to avoid using the DEMUX tree to generate the one-hot
address for both the write port and the reset port. We exploit the knowledge that SF-QIQ is a circular
queue design. Hence, we designed a novel SFQ shift register-based design to generate the one-hot address
without using a demultiplexer. Figure 6.9 shows the design. This design has an integrated write and reset
port.
w0 α w1 α w2 α w3 α
α rst0 α rst1 α rst2 α rst3
write_en
reset_en
write_rst
reset_rst
grant[3:0]
w_en[3:0]
rst_en[3:0]
write_init
Figure 6.9: Shift-based CIRC Port
There are two sets of shift registers consisting of DRO cells. In this illustration, we assume the issue
queue is 4 entries deep. The issue queue has two sets of shift registers built with DRO cells: w0 ∼ w3 and
95
rst0 ∼ rst3. An α-cell follows each DRO cell as a reset port. We explain the operation using a simple
scenario where 4 instructions (say i0, i1, i2, i3) are going to be placed in the issue queue.
6.4.2.1 Initialization
When the circuit is powered on, all DRO cells w0 ∼ w3 and rst0 ∼ rst3 have no pulses stored in their
cells, indicating that all DRO cells have a value "0". We initialize the DRO cells first. We send a write_init
pulse. This pulse sets the DRO w0 to "1". When w0 is set to "1," it indicates that the first available write
position in the issue queue for any new incoming instructions is IQ entry 0.
6.4.2.2 Write Operation
Let us assume i0 is the first instruction dispatched to the instruction queue.
To write i0 into the instruction queue, we send a write_en pulse. The write enable pulse triggers the
word line signal w_en[0] to be generated, and the pulse is shifted to w1. This shifting to w1 indicates that
the next instruction to enter the issue queue will use the issue queue entry 1. w_en[0] is connected with
i0 SrcTag field (bit line) with the dynamic AND gate [57]. Thus, the w_en[0] signal causes the SrcTag of
IQ entry 0 to be written with the instruction tag payload received from the dispatcher.
Note that the w_en[0] is also connected to the rst3 DRO cell. Thus, when w_en[0] generates a pulse,
not only is the SrcTag field written with the payload, but it also sets the rst3 DRO cell to "1".
When the next instruction i1 needs to be written into the issue queue, the above process repeats.
However, since w1 DRO is set to "1," the above-described process will write to issue queue entry 1 and also
sets the rst2 to "1".
Similarly, when i2 and i3 are written to the issue queue, we send two more write_en pulses. Each of
those pulses triggers a write to the next available issue queue entry and also sets the rst DRO cell to "1".
Thus, by the time all four instructions i0, i1, i2, and i3 enter the issue queue, all the issue queue entries
96
are now occupied with four different instruction SrcTags. In the end, the write pulse goes back to w0,
while rst0 ∼ rst3 are all set to "1". Since the IQ is full now, the instruction dispatch unit should not send
any new write_en pulses. It can only send a write_en pulse once i0 left the IQ. Since the issue queue is
circular, even if i2 leaves the issue queue, we leave that as an empty hole until the w2 is set to "1".
6.4.2.3 Issue an Instruction
Now, consider that i0 is a branch instruction that is ready to be issued (by the selection logic as described
below). The selection logic will generate the grant[0] signal, and it will merge with rst_en[0] signal to
reset IQ entry 0. Hence, we do not need to send any additional reset signal to reuse the issue queue entry
0 once i0 leaves the issue queue for execution.
6.4.2.4 Squash Operation
It turns out that in our example, i0 is a mispredicted branch instruction, and we need to clear i1, i2, and
i3 from the IQ. We can do a single bulk reset of i1, i2 and i3 as follows:
Step 1: reset all rst DRO cells. As we mentioned above, rst0 ∼ rst3 are all set to "1", so we need to
reset them first. We send a reset_rst signal to reset DRO rst0 ∼ rst3. Now that all rst DRO cells are
reset to "0", we should make them point to the IQ entry we want to reset first, which is IQ entry 3. We
send a write_en pulse. Now, the pulse stored in w0 will be copied into rst3 and shifted to w1.
Step 2: reset the IQ entry. Now, we can reset the IQ entry 3 by sending a reset_en pulse. The pulse in
rst3 will be shifted back to rst2, and we can generate a rst_en[3] pulse. rst_en[3] is connected to the reset
pins of IQ entry 3, so we can clear i3 properly. Next, we send another reset_en to clear i2. Meanwhile,
during these two reset operations, w3 and w2 are set to "1" as well. So, w1 ∼ w3 are all "1" now.
Step 3: reset the last entry and restore the w DRO cell. Now that we have cleared i3 and i2, there is
still i1 left to be cleared. Before we do the last reset operation, we need to start to restore the w DRO cell.
97
We send a write_rst pulse to reset w1 ∼ w3. Then, we can perform the last reset operation by sending a
reset_en pulse. It will generate a rst_en[1] pulse to reset IQ entry 1, and the w1 is set to "1". Now, there
is only one pulse in the w DRO cell, which is w1. The CIRC port is now ready to receive a new write_en
pulse. Notice that though the IQ entry 0 is empty since i0 left the IQ, the next available position is still IQ
entry 1 (w1) since we are not filling any "holes" in the CIRC design.
By utilizing the circular issue queue from the basic gates, we successfully designed the access port
without using traditional decoding trees used in random access logic. Our port design is controlled by
control signals without any addresses, which reduces the complexity of the control logic design and wiring
complexity.
6.4.3 Selection Logic Design
We now describe how the selection logic picks age-based instruction for execution. Our selection logic
design can handle a four-input priority based on the position of the instruction within the issue queue, as
shown in Figure 6.10. To handle a larger number of issue queue entries, we use a tree structure, as shown
in Figure 6.11.
Since we are actively reading out the content of the CAM every cycle, once an instruction is ready, it
will keep sending a Req signal every cycle (see Figure 6.7). Rather than receiving and managing duplicate
Req signals, we get the first Req signal and buffer it. Thus, once when an issue queue entry is ready, we
need extra logic to filter out any repeat pulses. SF-QIQ uses a set of DRO cells with α-cell to store the Req
signal. The DRO with α-cell acts as a buffer that stores only the Req pulse and prevents any future pulses
originating from the issue queue from propagating forward.
Once the request signals are read from the DRO, we will split them into two copies. Our goal in this
design is to select the two oldest ready instructions. Hence, one copy is sent to the high-priority selection
logic, and another copy is sent to the low-priority selection logic.
98
D3
D2
D1
D0
H3
H2
H1
DRO
α
α
α
α
H0
Req[3]
Req[2]
Req[1]
Req[0]
High
Low
Piority
α
α
α
ASK
REQ_HP
GRNT_HP
grant_hp[3]
grant_hp[2]
grant_hp[1]
grant_hp[0]
rst_en[3]
RST
rst_en[2]
rst_en[1]
rst_en[0]
clock
merger
splitter
lp[3]
lp[2]
lp[1]
α
lp[0]
(a)
D3
D2
D1
D0
L3
L2
L1
DRO
α
α
Req[3] α
Req[2]
Req[1]
Req[0]
High
Low
Piority
α
α
α
ASK
REQ_LP
GRNT_LP
grant_lp[3]
grant_lp[2]
grant_lp[1]
rst_en[3]
RST
rst_en[2]
rst_en[1]
rst_en[0]
clock
merger
splitter
MAJ3
hp[3]
hp[2]
hp[1]
hp[0]
α
1 MAJ5
(b)
Req[0:3]
ASK
D[0:3]
rst_en[0:3]
HP
Selection
LP
selection
hp[0:3]
lp[0:3]
grant[0]
grant[1]
grant[2]
grant[3]
grant_lp[1]
grant_lp[2]
grant_lp[3]
grant_hp[3]
grant_hp[2]
grant_hp[1]
grant_hp[0]
merger
splitter
GRNT_LP
GRNT_HP
REQ_LP
REQ_HP
RST
(c)
Figure 6.10: Selection Logic Design (a) High Priority (b) Low Priority (c) Overview of the Whole Design
req0
grant0
REQ_HP GRNT_HP REQ_LP GRNT_LP REQ_HP GRNT_HP REQ_LP GRNT_LP
DRO with α-cell
Req[0]
Req[1]
Req[2]
Req[3]
Req[4]
Req[5]
Req[6]
Req[7]
ASK
req1
grant1
req2
grant2
req3
grant3
req0
grant0
req1
grant1
req2
grant2
req3
grant3
REQ_HP GRNT_HP REQ_LP GRNT_LP req0 grant0 req1 grant1 req2 grant2 req3 grant3
Figure 6.11: Selection Tree Logic
We now explain the operation with an illustrative example. Let us assume that i0, i2, and i3 are ready
to be issued while i1 is waiting for some long latency dependency to be resolved. The three ready pulses
99
are stored in D0, D2, and D3 will be read out by enabling the ASK signal, which activates the selection
process.
In the high-priority selection logic, once the request signals are read from the D0 ∼ D3, we use a set of
mergers to generate the REQ_HP signal. REQ_HP indicates that there is at least one ready instruction
among the four Req signal inputs received in the high-priority selection logic. REQ_HP signal is stored
in a DRO cell after the merger to enable stable circuit operation.
We use another column of DRO cells, H0 ∼ H3, to receive a copy of the Req signals once the ASK
signal is enabled. Notice that each of H0 ∼ H3 is connected to the outputs of their corresponding D0 ∼
D3 cells via a splitter. Hence, in our current example, H0, H2, and H3 will be set to "1" once the ASK
pulse is transmitted.
We then use the pulse generated from D0 as inputs to the α-cells of H1, H2, and H3. Thus, if there is
a pulse output from D0, it would automatically reset any inputs stored in H1, H2, H3. Similarly, we use
the pulse generated from D1 as inputs to the α-cells of H2, H3. Thus, if there is a pulse output from D1,
it would automatically reset any inputs stored in H2, H3. This process repeats. In our running example,
the D0 pulse would reset the Req[2] and Req[3] signals stored in H2, H3, which are lower priority than
the Req[0] signal. Recall that our selection logic follows the oldest first policy, and the D DRO cell output
pulses achieve that goal by only keeping the Req[0] signal alive in H0 while resetting the remaining lower
priority request signals.
The splitters and mergers delay the reset signals so that the H DRO cells can be reset while satisfying
the timing requirement [38]. We show all the splitters and mergers on this critical path in Figure 6.10.
Once the lower priority requests are reset, at most, one instruction is granted permission to execute.
The REQ_HP signal is used to generate GRNT_HP (grant high priority) signal and sent to the selection
logic. Once GRNT_HP arrives, The H DRO cells with the highest priority will release the pulse as
100
the high-priority one-hot grant signals, grant_hp[0 ∼ 3]. In our example, only H0 will generate the
grant_hp[0] signal.
Note that GRNT_HP is not the same as the REQ_HP in our design because we need to use a tree
of selection logic circuits when there are more than 4 issue queue entries. Hence, the GRNT_HP signal
is appropriately generated to the correct tree node.
In the low-priority selection logic, once the request signals are read from the DRO D0 ∼ D3, we use a
clockless five-input majority gate [39] to generate the REQ_LP signal. It can generate the output without
using the clock and can self-reset after a short period. Similarly, we store the REQ_LP signal in a DRO
cell. REQ_LP indicates that there are at least two ready instructions among these four instructions.
We also use another row of DRO cells, L1 ∼ L3, to store the request signals. However, we do not need
to store the D0 since it is the highest priority, and we will always grant it with the high-priority selection
logic. Similarly, we set and reset these DRO cells conditionally with clockless AND [57] and three-input
majority gate [39]. As a result, only L2 will store a pulse. By doing so, when the GRNT_LP (grant low
priority) signal arrives, the DRO cell with the second highest priority (L2) will release the pulse as the
low-priority grant signal grant_lp[2].
Figure 6.10c shows how these two selection logic are connected. We merge the grant_hp and grant_lp
signals to generate the grantsignals that are used to read out the payload RAM and reset the corresponding
issue queue entry. In our running example, grant_hp[0] generates grant[0] and grant_lp[2] generates
grant[2]. Hence, i0 and i2 can be issued properly.
By having two sets of selection logic, we can select two instructions at the same time without introducing more delay. The connections of the GRNT_HP, GRNT_LP, REQ_HP, REQ_LP are shown
in Figure 6.11. The tree structure can select 2 instructions at the same time in a larger instructions window.
Once we generate the correct grant signals, we need to reset all selection logic.
101
6.4.3.1 CIRC with Correct Order
When the CIRC queue goes past the last entry, the order is no longer based on the position, as shown
in Figure 6.12a. To enforce the correct order, we can add one extra bit to indicate the order is reversed,
as shown in Figure 6.12b. While writing into the issue queue, if the tail counter is larger than the head
counter, we will set this bit to "0"; otherwise, this bit will be "1" by default. We can connect the reset
pin of the NDRO to the compare logic and connect the set pin to the reset signal. When the Req signal
is generated, we can check if it is in the reverse order section. If so, the Req_normal is still "1," and the
Req_fixed will be "0". These two signals will be sent to two different selection logic. Once the tail counter
is bigger than the head counter, we can set all the extra bits to "1".
Instruction n+2
Instruction n+3
Instruction n
Instruction n+1
old
new
(a)
Req
α
Request
NDRO
Req_normal
SET
RESET
Tail < Head
Reset
Req_fixed
(b)
Figure 6.12: (a) Reversed Order (b) Fix Order Mechanism
We can use another four-input position-based selection to select the correct instructions to issue, as
shown in Figure 6.13. The Req_fixed selection has a higher priority than the Req_normal selection. By
doing so, we can select two instructions with the correct priority at the same time.
6.4.4 HC-DRO Payload RAM
In RAM design, HC-DRO cells save a lot of JJs than NDRO cells. So, we decided to use HC-DRO cells in
the payload RAM storage. The read addresses are the grant signals. To correctly reset the RAM due to
102
REQ_HP GRNT_HP REQ_LP GRNT_LP REQ_HP GRNT_HP REQ_LP GRNT_LP
Req_fixed
REQ_HP GRNT_HP REQ_LP GRNT_LP req0 grant0 req1 grant1 req2 grant2 req3 grant3
Req_normal
Figure 6.13: Selection logic with correct order
branch misprediction-related squash, we can either use the LoopBuffer design described in Chapter 4 or
α-cell [38].
6.5 Evaluation
The evaluation of SF-QIQ focuses on two aspects: hardware performance and software performance.
6.5.1 Hardware Performance
As described in Section 6.3, we treat the issue queue design to have three possible issue logic options:
SHIFT, CIRC, and RAND with AGE; and two different wakeup logic options: CAM and RAM. Thus, one
can design the issue queue using six different configurations. SF-QIQ itself uses CIRC+CAM based on the
detailed analysis provided in Section 6.3. We did not include the payload RAM in our hardware evaluation
since it is the same across all these designs.
We built SF-QIQ design using Verilog netlists using the publicly available cell libraries [61]. We successfully verified the functionality and timing with different combinations of inputs. We then calculated
the total JJ count and power consumption using the SFQ cell library provided by [61].
The SHIFT-based issue logic’s JJ count and power consumption could not be obtained precisely since
the Verilog implementation required for shifting and compacting entries required peripheral circuits that
could only be implemented in software.
103
For the RAM-based wakeup logic, it was built based on [60]. For the RAM design, we measure the
JJ count and power consumption for the access port and RAM storage. However, some of the peripheral
circuitry is not counted in the JJ count. Hence, the RAM logic JJ counts are optimistic.
We consider two setups: (1) the register tags are 7 bits, hence 128 physical registers, and the issue
queue has 32 entries; (2) the register tags are 8 bits, hence 256 physical registers, and the issue queue
has 64 entries. For both setups, we can broadcast two finished tags at the same time to account for the 2
instruction issue design.
6.5.1.1 JJ Count
Table 6.1 shows the JJ count for different IQ designs. The RAM-based wakeup logic design requires more
than 3 times the JJs of CAM across all designs. The reason is that the RAM-based logic requires more
storage to store all the dependency info. For example, in the RAM table of the 7-bit register tag and 32-
entry issue queue design, we need to store around 224B info, but we only need to store around 52B in the
CAM. Meanwhile, the height of the RAM table is determined by the number of physical registers. In SFQ,
since a higher table has a bigger DEMUX tree than a wider table, a H × w RAM will cost more JJ than a
w × H (H > w) RAM. Hence, the height of the RAM table will make the RAM design even bigger. Even
if we optimized the RAM-based wakeup logic with HC-DRO, this design still costs more JJs.
As for the CAM-based wakeup logic design, the JJ cost of different designs is roughly the same since
more than 80% of the JJs are used in the CAM-wakeup logic itself. However, our DEMUX-free port design
does cost fewer JJs. CIRC-fix is the JJ cost of the CIRC with the correct order design. The JJ cost is also
similar. However, when we consider the selection logic, the RAND+AGE design will double its JJ usage.
This is because the RAND+AGE design needs a big memory to store the age matrix, and the selection logic
is a deep XOR tree, which both consume a lot of JJs. On the other hand, our position-based selection logic
104
Wakeup
Logic
Selection Logic
RAND+AGE CIRC CIRC-fix
Wakeup Selection Total Wakeup Selection Total Wakeup Selection Total
CAM 50088 53789 103877 48264 4520 52784 49410 9336 58746
RAM 152249 53789 206038 152249 4520 156769 154099 9336 163435
RAM with
HC-DRO 83308 53789 137097 83308 4520 87828 85158 9336 94494
(a) 7-bit register tag and 32-entry issue queue
Wakeup
Logic
Selection Logic
RAND+AGE CIRC CIRC-fix
Wakeup Selection Total Wakeup Selection Total Wakeup Selection Total
CAM 111766 216125 327891 107899 9336 117235 110197 18968 129165
RAM 338501 216125 554626 338501 9336 347837 342207 18968 361175
RAM with
HC-DRO 178401 216125 394526 178401 9336 187737 182107 18968 201075
(b) 8-bit register tag and 64-entry issue queue
Table 6.1: The JJ count across different issue queue designs
costs way fewer JJs. Although CIRC-fix requires another set of position-based selection logic, the extra
10% total JJ cost is still acceptable.
Overall, among all the different design choice combinations, our SF-QIQ design, which uses the CAMbased wakeup logic and position-based CIRC selection logic, consumes the least amount of JJs.
6.5.1.2 Power Consumption
Table 6.2 shows the static power for different IQ designs. In SFQ technology, unlike CMOS, static power
consumes more than 90% of the total power. Hence, we use static power as our metric. Since the static
power has a positive correlation to the JJ count, we can see that our SF-QIQ design using CAM+CIRC
consumes the least static power, 17.55 mW for 32-entry IQ and 38.95 mW for 64-entry IQ. The RAMbased wakeup logic and RAND+AGE selection logic both increase the total static power significantly.
Meanwhile, our CIRC-fix only introduced around 11% power overhead.
105
Wakeup
Logic
Selection Logic
RAND+AGE CIRC CIRC-fix
CAM 32.15
(83.19%)
17.55
(0%)
19.62
(11.80%)
RAM 62.27
(254.81%)
48.14
(174.29%)
50.44
(187.39%)
RAM with
HC-DRO
41.90
(138.77%)
27.77
(58.24%)
30.07
(71.35%)
(a) 7-bit register tag and 32-entry issue queue
Wakeup
Logic
Selection Logic
RAND+AGE CIRC CIRC-fix
CAM 100.00
(156.76%)
38.95
(0%)
43.09
(10.64%)
RAM 167.13
(329.12%)
107.09
(174.96%)
111.69
(186.78%)
RAM with
HC-DRO
119.38
(206.52%)
59.34
(52.36%)
63.94
(64.18%)
(b) 8-bit register tag and 64-entry issue queue
Table 6.2: The static power (mW) and relative overhead across different issue queue designs
6.5.2 Software Performance
We used the gem5 simulator [4, 44, 24] to evaluate the software performance. We used the O3CPU as the
base simulator and changed the machine width to two. We also use the same two configurations of the
issue queue described above. We also changed the delay of each stage to match the gate-level pipeline.
The delay of each stage is extracted from the prior work [74, 73, 51]. As for benchmarks, we choose to use
the SPEC CPU2017 [67]. We selected the SPECrate Integer suite and tested with the ref dataset. Due to
slow simulation speed, we fast-forwarded the first 10 billion instructions and gathered the total cycles of
running the next 1 billion instructions for each benchmark.
We also chose three different designs to compare the performance: (1) baseline: we choose to use a
SHIFT design with our position-based selection logic, and we ignore the overhead of shifting in the issue
queue; (2) RAND+AGE design: we designed the selection logic and use the corresponding delay; however,
as we described above, the RAND+AGE design can only issue one instruction at a time; (3) CIRC design; (4)
CIRC design with the correct order. Both (3) and (4) use position-based selection logic. We also considered
the delay of the DEMUX in (1) and (2).
Figure 6.14 shows the CPI overhead for each benchmark and the average CPI overhead across all benchmarks, compared to the SHIFT baseline. Since the baseline has maximum space efficiency and the fastest
selection, the CPI overhead is more than 20% on average for our CIRC and CIRC-fix design. However, as
we described, we did not consider the overhead of shifting the instructions, and it is a big challenge to
106
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
140.0%
160.0%
deepsjeng exchange2 gcc leela mcf omnetpp perlbench x264 xalancbmk xz average
RAND+AGE CIRC CIRC-fix
(a)
0.0%
20.0%
40.0%
60.0%
80.0%
100.0%
120.0%
140.0%
160.0%
180.0%
deepsjeng exchange2 gcc leela mcf omnetpp perlbench x264 xalancbmk xz average
RAND+AGE CIRC CIRC-fix
(b)
Figure 6.14: CPI overhead over baseline (SHIFT) of different SPEC CPU2017 benchmarks for different
designs (a) 32-entry IQ (b) 64-entry IQ
realize such a design. When compared with the RAND+AGE design, since the AGE design can only issue one instruction at a time, both CIRC and CIRC-fix outperformed RAND+AGE significantly. CIRC and
CIRC-fix have more than 30% performance improvement compared with the RAND+AGE design when the
issue queue has 32 entries and have more than 50% performance improvement when the issue queue has
64 entries. Notice that when the IQ has 64 entries, the CIRC-fix is slower than the CIRC design. This is
because the selection logic of CIRC-fix is one cycle slower than the selection logic of CIRC, and the order
wrap-around happens less frequently when the issue queue has more entries.
6.5.2.1 IQ Size and Selection Logic Delay
The latency of our position-based selection logic is related to the size of the issue queue. For example,
the selection logic of a 32-entry IQ has 5 cycles delay, and the selection logic of a 64-entry IQ has 6 cycles
107
delay. When we correct the order of the CIRC design, the extra selection circuit also introduces one extra
cycle delay. Hence, having a bigger issue queue or fixing the order may not bring benefits to the overall
performance. To figure out the best setup, we ran the software simulation with different queue sizes.
0.6
0.7
0.8
0.9
1
1.1
16 32 64 128
Relative IPC
Issue Queue Size
SHIFT
CIRC
CIRC-fix
Figure 6.15: Relative IPC with different IQ sizes
Figure 6.15 shows the relative IPC. We use the SHIFT design with 16-entry as the baseline, so the
relative IPC of this design is 1. Even though we did not have the SHIFT design, it has the ideal space
efficiency, so we used it to see the impact of the delay of the selection logic. We can see that the SHIFT
design has a peak performance when the IQ has 32 entries, and when the IQ has 128 entries, the SHIFT
design has the worst IPC. This shows that when the useful space reaches a certain point, the performance is
more sensitive to the delay of the selection logic. Similarly, the CIRC design reaches its peak performance
when the IQ has 64 entries. However, for the CIRC-fix design, the peak performance is reached when the
IQ has 128 entries. And when the IQ has 64 entries, the CIRC-fix is even worse than the CIRC design.
This is because the case that the order wrap-around happens differently with different sizes. It is hard
to predict the overall performance gain with both the order-fixing and the longer latency of the selection
logic. Hence, while choosing the size of the IQ, we need to take both CIRC and CIRC-fix into consideration.
108
6.6 Conclusion
High-performance CPUs are designed using out-of-order (OoO) execution paradigms. OoO CPUs, however, require efficient instruction issue queues. We designed SF-QIQ, an instruction issue queue design
based on CAM-based wakeup logic and the non-collapsible circular queue (CIRC). We compared different
design choices, such as CAM or RAM wakeup logic and collapsible and non-collapsible issue queues. We
demonstrate how to build CAM-based wakeup logic, shift-based CIRC port design without using the heavy
decoder, and a simple, fast, position-based selection logic. We also showed how to fix the order error in
the CIRC issue queue with minimal hardware cost and how to use HC-DRO cells further to reduce the
JJ cost in the issue queue design. We showed that SF-QIQ has the lowest JJ count compared with other
issue queue designs. We also showed that SF-QIQ can achieve a minimum performance degradation across
different designs when compared with the ideal issue queue design.
109
Chapter 7
Future Work
All the microarchitecture designs proposed in this thesis are only some parts of a modern out-of-order
CPU. However, there are still many challenges on the way to a modern SFQ CPU.
Modern out-of-order superscalar CPUs tend to have a machine width of 4 or more, which means we
need to access one microarchitecture simultaneity and read multiple copies of information. As shown in
Chapter 4, having multiple memory access ports will increase the JJ cost dramatically. Our solution is to
divide the register file into two banks. However, not every storage in the CPU is suitable to use the bank
solution, and it will be more difficult to bank when the machine width is higher. A possible solution is to
design a random access decoder with a small JJ cost and a small delay. This will also enable more powerful
designs, such as the TAGE branch predictor.
There are still many critical building blocks missing in SFQ CPU design, such as reorder buffer(ROB)
and load-store queue (LSQ). More circuits related to the control path are also important, as SFQ CPU is
a deep pipelined design, and an efficient control path logic will easily improve the overall performance.
Successfully building all the parts of the CPU will pave the way to building a fully functional out-of-order
SFQ CPU.
110
Chapter 8
Conclusions
Single Flux Quantum (SFQ) superconducting technology has a considerable advantage over CMOS in
power and performance. Recently, significant developments in VLSI design automation tools have made it
feasible to design pipelined SFQ CPUs. Hence, it is important to explore the designs of different SFQ CPU
critical components.
On-chip memory is a big challenge for SFQ technology. We proposed the High Capacity Destructive
ReadOut (HC-DRO) cell to solve this issue. HC-DRO can hold up to three SFQ pulses, which means they
can store 2 bits of information in one memory cell, thereby providing an opportunity to double the memory
density. Based on using HC-DRO cells for register file design. We built a register file based on the HC-DRO
cells called HiPerRF. HC-DRO cells provide only destructive readout capability. We showed how to provide
HiPerRF with the non-destructive property using a loopback write mechanism, thereby preserving the
higher density of HC-DRO cells without compromising the multi-read demands of a register file. HiPerRF
reduces the JJ count of the register file design while having a similar performance.
Deeply pipelined SFQ CPUs suffer from a big branch misprediction penalty, which means that a good
branch predictor is necessary. To tackle this challenge, we proposed SuperBP. SuperBP is a perceptron
branch predictor that also uses the HC-DRO cells to store perceptron weights. We presented novel inference and prediction update circuits for the Perceptron predictor that can directly operate on the native
111
2-bit HC-DRO weights without decoding and encoding, thereby reducing the JJ use. SuperBP reduces the
JJ count by 39% compared to the NDRO-based design. Under the same JJ budget, SuperBP has a higher
performance than its hashed variant.
High-performance CPUs are designed using out-of-order (OOO) execution paradigms. OOO CPUs,
however, require efficient instruction issue queues. We designed SF-QIQ, an instruction issue queue design
based on CAM-based wakeup logic and the non-collapsible circular queue (CIRC). We compared different
design choices, such as CAM or RAM wakeup logic and collapsible and non-collapsible issue queues. We
demonstrated how to build CAM-based wakeup logic, shift-based CIRC port design without using the
heavy decoder, and a simple, fast, position-based selection logic. We also showed how to fix the order
error in the CIRC issue queue with minimal hardware cost and how to use HC-DRO cells further to reduce
the JJ cost in the issue queue design. SF-QIQ can achieve a similar performance compared with an ideal
issue queue with a more complicated selection logic.
112
Bibliography
[1] Hideki Ando. “SWQUE: A mode switching issue queue with priority-correcting circular queue”. In:
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 2019,
pp. 506–518.
[2] Yuki Ando, Ryo Sato, Masamitsu Tanaka, Kazuyoshi Takagi, Naofumi Takagi, and Akira Fujimaki.
“Design and demonstration of an 8-bit bit-serial RSFQ microprocessor: CORE e4”. In: IEEE
Transactions on Applied Superconductivity 26.5 (2016), pp. 1–5.
[3] David Bernstein, Michael Rodeh, and Izidor Gertner. “On the complexity of scheduling problems
for parallel/pipelined machines”. In: IEEE Transactions on computers 38.9 (1989), pp. 1308–1313.
[4] Nathan L. Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt, Ali G. Saidi,
Arkaprava Basu, Joel Hestness, Derek Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen,
Korey Sewell, Muhammad Shoaib Bin Altaf, Nilay Vaish, Mark D. Hill, and David A. Wood. “The
gem5 simulator”. In: SIGARCH Comput. Archit. News 39.2 (2011), pp. 1–7. doi:
10.1145/2024716.2024718.
[5] Norman O Birge and Manuel Houzet. “Spin-Singlet and Spin-Triplet Josephson Junctions for
Cryogenic Memory”. In: IEEE Magnetics Letters 10 (2019), pp. 1–5.
[6] Norman O Birge, Alexander E Madden, and Ofer Naaman. “Ferromagnetic Josephson junctions for
cryogenic memory”. In: Spintronics XI. Vol. 10732. International Society for Optics and Photonics.
2018, p. 107321M.
[7] Edward Brekelbaum, Jeff Rupley, Chris Wilkerson, and Bryan Black. “Hierarchical scheduling
windows”. In: 35th Annual IEEE/ACM International Symposium on Microarchitecture,
2002.(MICRO-35). Proceedings. IEEE. 2002, pp. 27–36.
[8] Paul Bunyk, Mike Leung, John Spargo, and Mikhail Dorojevets. “FLUX-1 RSFQ microprocessor:
Physical design and test results”. In: IEEE Trans. Appl. Supercon. 13.2 (2003), pp. 433–436.
[9] Alper Buyuktosunoglu, David Albonesi, Stanley Schuster, David Brooks, Pradip Bose, and
Peter Cook. “A circuit level implementation of an adaptive issue queue for power-aware
microprocessors”. In: Proceedings of the 11th Great Lakes symposium on VLSI. 2001, pp. 73–78.
[10] Championship Branch Prediction (CBP-5). https://jilp.org/cbp2016/.
113
[11] Dibei Chen, Tairan Zhang, Yi Huang, Jianfeng Zhu, Yang Liu, Pengfei Gou, Chunyang Feng,
Binghua Li, Shaojun Wei, and Leibo Liu. “Orinoco: Ordered Issue and Unordered Commit with
Non-Collapsible Queues”. In: Proceedings of the 50th Annual International Symposium on Computer
Architecture. 2023, pp. 1–14.
[12] W Chen, AV Rylyakov, Vijay Patel, JE Lukens, and KK Likharev. “Rapid single flux quantum T-flip
flop operating up to 770 GHz”. In: IEEE Transactions on Applied Superconductivity 9.2 (1999),
pp. 3212–3215.
[13] IEEE International Roadmap for Devices and Systems. Cryogenic Electronics and Quantum
Information Processing. Tech. rep. Institute of Electrical and Electronics Engineers, 2023.
[14] Mikhail Dorojevets, Christopher L Ayala, Nobuyuki Yoshikawa, and Akira Fujimaki. “8-bit
asynchronous sparse-tree superconductor RSFQ arithmetic-logic unit with a rich set of
operations”. In: IEEE transactions on applied superconductivity 23.3 (2012), pp. 1700104–1700104.
[15] Mikhail Dorojevets and Zuoting Chen. “Fast pipelined storage for high-performance
energy-efficient computing with superconductor technology”. In: 2015 12th International
Conference & Expo on Emerging Technologies for a Smarter World (CEWIT). IEEE. 2015, pp. 1–6.
[16] Keith I Farkas, Norman P Jouppi, and Paul Chow. “Register file design considerations in
dynamically scheduled processors”. In: Proceedings. Second International Symposium on
High-Performance Computer Architecture. IEEE. 1996, pp. 40–51.
[17] James A Farrell and Timothy C Fischer. “Issue logic for a 600-mhz out-of-order execution
microprocessor”. In: IEEE Journal of Solid-State Circuits 33.5 (1998), pp. 707–712.
[18] T Filippov, M Dorojevets, A Sahu, A Kirichenko, C Ayala, and O Mukhanov. “8-bit asynchronous
wave-pipelined RSFQ arithmetic-logic unit”. In: IEEE transactions on applied superconductivity 21.3
(2011), pp. 847–851.
[19] Coenrad J Fourie. “Extraction of DC-biased SFQ circuit verilog models”. In: IEEE Transactions on
Applied Superconductivity 28.6 (2018), pp. 1–11.
[20] Coenrad Johann Fourie, Kyle Jackman, Matthys M Botha, Sasan Razmkhah, Pascal Febvre,
Christopher Lawrence Ayala, Qiuyun Xu, Nobuyuki Yoshikawa, Erin Patrick, Mark Law, et al.
“ColdFlux superconducting EDA and TCAD tools project: Overview and progress”. In: IEEE
Transactions on Applied Superconductivity 29.5 (2019), pp. 1–7.
[21] K Fujiwara, H Hoshina, Y Yamashiro, and N Yoshikawa. “Design and component test of SFQ shift
register memories”. In: IEEE transactions on applied superconductivity 13.2 (2003), pp. 555–558.
[22] K Fujiwara, Y Yamashiro, Nobuyuki Yoshikawa, Y Hashimoto, S Yorozu, H Terai, and A Fujimaki.
“High-speed test of SFQ-shift register files using PTL wiring”. In: Physica C: Superconductivity 412
(2004), pp. 1586–1590.
[23] Kris Gaj, Eby G Friedman, and Marc J Feldman. “Timing of multi-gigahertz rapid single flux
quantum digital circuits”. In: J. VLSI Sig. Proc. Sys. for Sig., Image Vid. Tech. 16.2-3 (1997),
pp. 247–276.
114
[24] Andreas Hansson, Neha Agarwal, Aasheesh Kolli, Thomas F. Wenisch, and Aniruddha N. Udipi.
“Simulating DRAM controllers for future system architecture exploration”. In: 2014 IEEE
International Symposium on Performance Analysis of Systems and Software, ISPASS 2014, Monterey,
CA, USA, March 23-25, 2014. IEEE Computer Society, 2014, pp. 201–210. doi:
10.1109/ISPASS.2014.6844484.
[25] John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. Elsevier,
2011.
[26] Quentin P Herr, Joshua Osborne, Micah JA Stoutimore, Harold Hearne, Ryan Selig, Jacob Vogel,
Eileen Min, Vladimir V Talanov, and Anna Y Herr. “Reproducible operating margins on a 72
800-device digital superconducting chip”. In: Superconductor Science and Technology 28.12 (2015),
p. 124003.
[27] Amol Inamdar, Sukanya S Meher, Benjamin Chonigman, Anubhav Sahu, Jushya Ravi, and
Deepnarayan Gupta. “50 GHz Operation of RSFQ Arithmetic Logic Unit Designed Using the
Advanced Design Flow and the Dual RSFQ/ERSFQ Cell Library”. In: IEEE Transactions on Applied
Superconductivity 33.5 (2023), pp. 1–8.
[28] M Ameen Jardine and Coenrad Johann Fourie. “Hybrid RSFQ-QFP superconducting neuron”. In:
IEEE Transactions on Applied Superconductivity 33.4 (2023), pp. 1–9.
[29] Daniel A Jiménez. “Multiperspective perceptron predictor”. In: 5th JILP Workshop on Computer
Architecture Competitions (JWAC-5): Championship Branch Prediction (CBP-5). 2016.
[30] Daniel A Jiménez and Calvin Lin. “Dynamic branch prediction with perceptrons”. In: Proceedings
HPCA Seventh International Symposium on High-Performance Computer Architecture. IEEE. 2001,
pp. 197–206.
[31] Mustafa Altay Karamuftuoglu and Massoud Pedram. “α-Soma: Single Flux Quantum Threshold
Cell for Spiking Neural Network Implementations”. In: IEEE Transactions on Applied
Superconductivity 33.5 (2023), pp. 1–5.
[32] Mustafa Altay Karamuftuoglu, Beyza Zeynep Ucpinar, Arash Fayyazi, Sasan Razmkhah,
Mehdi Kamal, and Massoud Pedram. “Scalable superconductor neuron with ternary synaptic
connections for ultra-fast SNN hardware”. In: Superconductor Science and Technology 38.2 (2025),
p. 025014.
[33] Naveen Katam, Alireza Shafaei, and Massoud Pedram. “Design of multiple fanout clock
distribution network for rapid single flux quantum technology”. In: 2017 22nd Asia and South
Pacific Design Automation Conference (ASP-DAC). IEEE. 2017, pp. 384–389.
[34] Naveen K Katam, Haipeng Zha, M Pedram, and M Annavaram. “Multi Fluxon Storage and its
Implications for Microprocessor Design”. In: Journal of Physics: Conference Series. Vol. 1559. 1. IOP
Publishing. 2020, p. 012004.
[35] Naveen Kumar Katam, Jamil Kawa, and Massoud Pedram. “Challenges and the status of
superconducting single flux quantum technology”. In: 2019 Design, Automation & Test in Europe
Conference & Exhibition (DATE). IEEE. 2019, pp. 1781–1787.
115
[36] Naveen Kumar Katam and Massoud Pedram. “Logic optimization, complex cell design, and
retiming of Single Flux Quantum circuits”. In: IEEE Transactions on Applied Superconductivity 28.7
(2018), pp. 1–9.
[37] Alex F Kirichenko, Anubhav Sahu, Timur V Filippov, Oleg A Mukhanov, Andriy V Dotsenko,
Mikhail Dorojevets, and Artur K Kasperek. “Demonstration of an 8× 8-bit RSFQ multi-port
register file”. In: 2013 IEEE 14th International Superconductive Electronics Conference (ISEC). IEEE.
2013, pp. 1–3.
[38] Yasemin Kopur, Beyza Zeynep Ucpinar, Mustafa Altay Karamuftuoglu, Sasan Razmkhah, and
Massoud Pedram. “AR-SFQ: Asynchronous Reset Library Using α-Cell Design”. In: arXiv preprint
arXiv:2501.09449 (2025).
[39] Gleb Krylov and Eby G Friedman. “Asynchronous dynamic single-flux quantum majority gates”.
In: IEEE Transactions on Applied Superconductivity 30.5 (2020), pp. 1–7.
[40] Souvik Kundu, Gourav Datta, Peter A Beerel, and Massoud Pedram. “qBSA: Logic design of a
32-bit block-skewed RSFQ arithmetic logic unit”. In: 2019 IEEE International Superconductive
Electronics Conference (ISEC). IEEE. 2019, pp. 1–3.
[41] Chih-Chieh Lee, I-CK Chen, and Trevor N Mudge. “The bi-mode branch predictor”. In: Proceedings
of 30th Annual International Symposium on Microarchitecture. IEEE. 1997, pp. 4–13.
[42] Mingye Li, Bo Zhang, and Massoud Pedram. “Striking a Good Balance Between Area and
Throughput of RSFQ Circuits Containing Feedback Loops”. In: IEEE Transactions on Applied
Superconductivity 33.5 (2023), pp. 1–6.
[43] Konstantin K Likharev and Vasilii K Semenov. “RSFQ logic/memory family: A new
Josephson-junction technology for sub-terahertz-clock-frequency digital systems”. In: IEEE
Transactions on Applied Superconductivity 1.1 (1991), pp. 3–28.
[44] Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger,
Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Srikant Bharadwaj, Gabe Black,
Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues Carvalho, Jer’onimo Castrill’on, Lizhong Chen,
Nicolas Derumigny, Stephan Diestelhorst, Wendy Elsasser, Marjan Fariborz,
Amin Farmahini Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi, Dibakar Gope,
Thomas Grass, Bagus Hanindhito, Andreas Hansson, Swapnil Haria, Austin Harris,
Timothy Hayes, Adrian Herrera, Matthew Horsnell, Syed Ali Raza Jafri, Radhika Jagtap,
Hanhwi Jang, Reiley Jeyapaul, Timothy M. Jones, Matthias Jung, Subash Kannoth,
Hamidreza Khaleghzadeh, Yuetsu Kodama, Tushar Krishna, Tommaso Marinelli, Christian Menard,
Andrea Mondelli, Tiago M"uck, Omar Naji, Krishnendra Nathella, Hoa Nguyen, Nikos Nikoleris,
Lena E. Olson, Marc S. Orr, Binh Pham, Pablo Prieto, Trivikram Reddy, Alec Roelke,
Mahyar Samani, Andreas Sandberg, Javier Setoain, Boris Shingarov, Matthew D. Sinclair, Tuan Ta,
Rahul Thakur, Giacomo Travaglini, Michael Upton, Nilay Vaish, Ilias Vougioukas,
Zhengrong Wang, Norbert Wehn, Christian Weis, David A. Wood, Hongil Yoon, and
’Eder F. Zulian. “The gem5 Simulator: Version 20.0+”. In: CoRR abs/2007.03152 (2020). arXiv:
2007.03152. url: https://arxiv.org/abs/2007.03152.
116
[45] Scott McFarling. Combining branch predictors. Tech. rep. Citeseer, 1993.
[46] Mititada Morisue, Masayuki Kaneko, and Hiroo Hosoya. “A content addressable memory using
Josephson junctions”. In: IEEE Transactions on Applied Superconductivity 1.1 (1991), pp. 48–53.
[47] Oleg A Mukhanov. “Energy-efficient single flux quantum technology”. In: IEEE Transactions on
Applied Superconductivity 21.3 (2011), pp. 760–769.
[48] T Onomi, T Kondo, and K Nakajima. “High-speed single flux-quantum up/down counter for neural
computation using stochastic logic”. In: Journal of Physics: Conference Series. Vol. 97. 1. IOP
Publishing. 2008, p. 012187.
[49] Subbarao Palacharla, Norman P Jouppi, and James E Smith. Quantifying the complexity of
superscalar processors. Tech. rep. University of Wisconsin-Madison Department of Computer
Sciences, 1996.
[50] Harish Patil, Robert Cohn, Mark Charney, Rajiv Kapoor, Andrew Sun, and Anand Karunanidhi.
“Pinpointing representative portions of large intel® itanium® programs with dynamic
instrumentation”. In: 37th International Symposium on Microarchitecture (MICRO-37’04). IEEE. 2004,
pp. 81–92.
[51] Xizhu Peng, Qiuyun Xu, Taichi Kato, Yuki Yamanashi, Nobuyuki Yoshikawa, Akira Fujimaki,
Naofumi Takagi, Kazuyoshi Takagi, and Mutsuo Hidaka. “High-speed demonstration of bit-serial
floating-point adders and multipliers using single-flux-quantum circuits”. In: IEEE Transactions on
Applied Superconductivity 25.3 (2014), pp. 1–6.
[52] Michael D Powell, Amit Agarwal, TN Vijaykumar, Babak Falsafi, and Kaushik Roy. “Reducing
set-associative cache energy via way-prediction and selective direct-mapping”. In: Proceedings of
the 34th annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society.
2001, pp. 54–65.
[53] Ronald P Preston, Roy W Badeau, Daniel W Bailey, Shane L Bell, Larry L Biro, William J Bowhill,
Daniel E Dever, Stephen Felix, Richard Gammack, Valeria Germini, et al. “Design of an 8-wide
superscalar RISC microprocessor with simultaneous multithreading”. In: 2002 IEEE International
Solid-State Circuits Conference. Digest of Technical Papers (Cat. No. 02CH37315). Vol. 1. IEEE. 2002,
pp. 334–472.
[54] Repository hosts unit tests for RISC-V processors. https://github.com/riscv/riscv-tests.
[55] RISC-V Core. https://github.com/ultraembedded/riscv.
[56] Valery V Ryazanov, Vitaly V Bol’ginov, Danila S Sobanin, Igor V Vernik, Sergey K Tolpygo,
Alan M Kadin, and Oleg A Mukhanov. “Magnetic Josephson junction technology for digital and
memory applications”. In: Physics Procedia 36 (2012), pp. 35–41.
[57] Sergey V Rylov. “Clockless Dynamic SFQ and Gate With High Input Skew Tolerance”. In: IEEE
Transactions on Applied Superconductivity 29.5 (2019), pp. 1–5.
117
[58] Peter G Sassone, Jeff Rupley, Edward Brekelbaum, Gabriel H Loh, and Bryan Black. “Matrix
scheduler reloaded”. In: ACM SIGARCH Computer Architecture News 35.2 (2007), pp. 335–346.
[59] Ryo Sato, Yuki Hatanaka, Yuki Ando, Masamitsu Tanaka, Akira Fujimaki, Kazuyoshi Takagi, and
Naofumi Takagi. “High-speed operation of random-access-memory-embedded microprocessor
with minimal instruction set architecture based on rapid single-flux-quantum logic”. In: IEEE
Trans. Appl. Supercon. 27.4 (2016), pp. 1–5.
[60] Toshinori Sato, Yusuke Nakamura, and Itsujiro Arita. “Revisiting direct tag search algorithm on
superscalar processors”. In: Workshop on Complexity-Effective Design. 2001.
[61] Lieze Schindler, Johannes A Delport, and Coenrad J Fourie. “The ColdFlux RSFQ cell library for
MIT-LL SFQ5ee fabrication process”. In: IEEE Transactions on Applied Superconductivity 32.2 (2021),
pp. 1–7.
[62] André Seznec. “Genesis of the O-GEHL Branch Predictor”. In: J. Instr. Level Parallelism 7 (2005).
[63] Serhii E Shafraniuk, Ivan P Nevirkovets, and Oleg A Mukhanov. “Modeling computer memory
based on ferromagnetic/superconductor multilayers”. In: Physical Review Applied 11.6 (2019),
p. 064018.
[64] James E Smith. “A study of branch prediction strategies”. In: Proceedings of the 8th annual
symposium on Computer Architecture. IEEE Computer Society Press. 1981, pp. 135–148.
[65] James E Smith. “A study of branch prediction strategies”. In: 25 years of the international symposia
on Computer architecture (selected papers). 1998, pp. 202–215.
[66] SPEC CPU2006. https://www.spec.org/cpu2006/.
[67] SPEC CPU2017. https://www.spec.org/cpu2017/.
[68] Spike, a RISC-V ISA Simulator. https://github.com/riscv/riscv-isa-sim.
[69] Ramy N Tadros and Peter A Beerel. “A robust and tree-free hybrid clocking technique for RSFQ
circuits-CSR application”. In: 2017 16th International Superconductive Electronics Conference (ISEC).
IEEE. 2017, pp. 1–4.
[70] Masamitsu Tanaka, Ryo Sato, Yuki Hatanaka, and Akira Fujimaki. “High-density
shift-register-based rapid single-flux-quantum memory system for bit-serial microprocessors”. In:
IEEE Transactions on Applied Superconductivity 26.5 (2016), pp. 1–5.
[71] Masamitsu Tanaka, Kensuke Takata, Takahiro Kawaguchi, Yuki Ando, Nobuyuki Yoshikawa,
Ryo Sato, Akira Fujimaki, Kazuyoshi Takagi, and Naofumi Takagi. “Development of bit-serial
RSFQ microprocessors integrated with shift-register-based random access memories”. In: 2015 15th
International Superconductive Electronics Conference (ISEC). IEEE. 2015, pp. 1–3.
[72] Guang-Ming Tang, Pei-Yao Qu, Xiao-Chun Ye, and Dong-Rui Fan. “Logic design of a 16-bit bit-slice
arithmetic logic unit for 32-/64-bit RSFQ microprocessors”. In: IEEE Transactions on Applied
Superconductivity 28.4 (2018), pp. 1–5.
118
[73] Guang-Ming Tang, Kazuyoshi Takagi, and Naofumi Takagi. “A 4-bit bit-slice multiplier for a 32-bit
RSFQ microprocessor”. In: 2015 15th International Superconductive Electronics Conference (ISEC).
IEEE. 2015, pp. 1–3.
[74] Guang-Ming Tang, Kensuke Takata, Masamitsu Tanaka, Akira Fujimaki, Kazuyoshi Takagi, and
Naofumi Takagi. “4-bit bit-slice arithmetic logic unit for 32-bit RSFQ microprocessors”. In: IEEE
Transactions on Applied Superconductivity 26.1 (2015), pp. 1–6.
[75] Swamit S Tannu, Poulami Das, Michael L Lewis, Robert Krick, Douglas M Carmean, and
Moinuddin K Qureshi. “A case for superconducting accelerators”. In: Proceedings of the 16th ACM
International Conference on Computing Frontiers. 2019, pp. 67–75.
[76] David Tarjan and Kevin Skadron. “Merging path and gshare indexing in perceptron branch
prediction”. In: ACM transactions on architecture and code optimization (TACO) 2.3 (2005),
pp. 280–300.
[77] Robert M Tomasulo. “An efficient algorithm for exploiting multiple arithmetic units”. In: IBM
Journal of research and Development 11.1 (1967), pp. 25–33.
[78] Beyza Zeynep Ucpinar, Yasemin Kopur, Mustafa Altay Karamuftuoglu, Sasan Razmkhah, and
Massoud Pedram. “Design of a superconducting multiflux non-destructive readout memory unit”.
In: arXiv preprint arXiv:2309.14613 (2023).
[79] IV Vernik, AF Kirichenko, OA Mukhanov, and TA Ohki. “Energy-efficient and compact ERSFQ
decoder for cryogenic RAM”. In: IEEE Trans. Appl. Supercon. 27.4 (2016), pp. 1–5.
[80] Fred Ware, Liji Gopalakrishnan, Eric Linstadt, Sally A McKee, Thomas Vogelsang,
Kenneth L Wright, Craig Hampel, and Gary Bronner. “Do superconducting processors really need
cryogenic memories? The case for cold DRAM”. In: Proceedings of the International Symposium on
Memory Systems. 2017, pp. 183–188.
[81] Shlomo Weiss and James E Smith. “Instruction issue logic for pipelined supercomputers”. In:
Proceedings of the 11th annual international symposium on Computer architecture. 1984, pp. 110–118.
[82] Yuki Yamanashi, Sotaro Nakaishi, Akira Sugiyama, Naoki Takeuchi, and Nobuyuki Yoshikawa.
“Design methodology of single-flux-quantum flip-flops composed of both 0-and π-shifted
Josephson junctions”. In: Superconductor Science and Technology 31.10 (2018), p. 105003.
[83] Kenneth C Yeager. “The MIPS R10000 superscalar microprocessor”. In: IEEE micro 16.2 (1996),
pp. 28–41.
[84] Tse-Yu Yeh and Yale N Patt. “Two-level adaptive training branch prediction”. In: Proceedings of the
24th annual international symposium on Microarchitecture. 1991, pp. 51–61.
[85] Haipeng Zha, Naveen Kumar Katam, Massoud Pedram, and Murali Annavaram. “HiPerRF: A
dual-bit dense storage SFQ register file”. In: 2022 IEEE International Symposium on
High-Performance Computer Architecture (HPCA). IEEE. 2022, pp. 415–428.
119
[86] Haipeng Zha, Swamit Tannu, and Murali Annavaram. “Superbp: Design space exploration of
perceptron-based branch predictors for superconducting cpus”. In: Proceedings of the 56th Annual
IEEE/ACM International Symposium on Microarchitecture. 2023, pp. 599–613.
[87] Bo Zhang and Massoud Pedram. “qSTA: A Static Timing Analysis Tool for Superconducting
Single-Flux-Quantum Circuits”. In: IEEE Transactions on Applied Superconductivity 30.5 (2020),
pp. 1–9.
120
Asset Metadata
Creator
Zha, Haipeng (author)
Core Title
Microarchitecture design space exploration for multi-fluxon SFQ CPUs
Contributor
Electronically uploaded by the author
(provenance)
School
Andrew and Erna Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2025-05
Publication Date
05/12/2025
Defense Date
03/26/2025
Publisher
University of Southern California. Libraries
(digital),
University of Southern California (Los Angeles, California, USA)
(original)
Tag
computer architecture,OAI-PMH Harvest,SFQ
Format
theses
(aat)
Language
English
Advisor
Annavaram, Murali (
committee chair
), Beerel, Peter (
committee member
), Li, Mengyuan (
committee member
)
Creator Email
hzha@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11399KHEM
Unique identifier
UC11399KHEM
Legacy Identifier
etd-Zha-37008-47825
Document Type
Dissertation
Format
theses (aat)
Rights
Zha, Haipeng
Internet Media Type
application/pdf
Type
texts
Source
202505w03-usctheses
(batch),
University of Southern California Dissertations and Theses
(collection),
University of Southern California
(contributing entity)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
uscdl@usc.edu
Abstract (if available)
Abstract
Single Flux Quantum (SFQ) superconducting technology provides significant power and performance benefits in the era of diminishing CMOS scaling. SFQ CPUs can also help scale quantum computing technologies, as SFQ circuits can be integrated with qubits due to their amenability to a cryogenic environment. Recent advances in design automation tools and fabrication facilities have brought SFQ-based computing to the forefront. SFQ technology is constrained by the number of Josephson Junctions (JJs) integrated into a single chip, and prior works focused more on JJ-efficient SFQ datapath designs. However, to successfully build an SFQ-based CPU, many control-intensive microarchitecture structures need to be designed to be JJ-efficient. We proposed a new type of multi-fluxon SFQ storage cell, called HC-DRO, to increase the on-chip memory density. We designed multiple SFQ CPU architectures, including the register file HiPerRF, branch predictor SuperBP, and SFQ issue queue. We hope this broad array of microarchitecture innovations provided in this thesis form the blueprint for future SFQ CPU designs.
Tags
SFQ
computer architecture
Linked assets
University of Southern California Dissertations and Theses