Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Advanced cell design and reconfigurable circuits for single flux quantum technology
(USC Thesis Other)
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADV ANCED CELL DESIGN AND RECONFIGURABLE CIRCUITS FOR SINGLE FLUX QUANTUM TECHNOLOGY by Naveen Kumar Katam A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2019 Copyright 2019 Naveen Kumar Katam No eye has seen, no ear has heard, and no mind has imagined what God has prepared for those who love Him. –1 Cor. 2:9 in the Holy Bible ii Acknowledgements I acknowledge the partial financial support of Intelligence Advanced Research Projects Activity (IARPA) through my advisor because of which my Doctorate studies were made possible. I would like to thank my Ph.D. advisor Prof. Massoud Pedram for guidance, training, and encouragement during my Doctoral degree pursuit at USC. I was introduced to an exciting research area and was given freedom in pursuing the direction of research. I also would like to thank Dr. Oleg Mukhanov at Hypres who advised me on research and supervised me for two summer internships with access to the resources at Hypes. I would like to thank my other qualification and thesis committee members Prof. Peter Beerel, Prof. Sandeep Gupta, Prof. Aiichiro Nakano and Prof. Murali Annavaram for providing valuable feedback on my research. I have come across many good friends and colleagues during the 5 years of my Ph.D. study who were of good support to me and I am very thankful for them. I sincerely thank Dr. Alireza Shaefei, Dr. Yanzhi Wang, Dr. Tiansong Cui, Dr. Di Zhu who were always available for advice. I cannot thank enough Ting-Ru Lin, Bo Zhang, Haolin Cong, iii Soheil Nazar Shahsavani, Ghasem Pasandi, and Shuang Chen for their help in research and discussions. I thank my dear colleagues and friends Luhao Wang, Ramy Tadros, Fanzhou Wang, Mahdi Nazemi, Krishna Giri Narra, Kiran Matam who are always there to discuss anything and give company in difficult situations. I also would like to thank Benjamin Gordon, Alyssa Gordon, Cole Umemura, Joel Mathew, Gregory Christman to name a few among many friends in Los Angeles area who were with me to support and encourage through the ups and downs of this Ph.D. struggle in a foreign land. I would like to express my sincere appreciation for my parents who supported and loved me unconditionally all through my life for pursuing education. Finally, I would like to thank and acknowledge the Almighty God who has not only provided me with the eternal salvation but with his favor to excel at different things in my life especially the work during my Doctoral studies leading to this dissertation. iv Table of Contents Acknowledgements iii List of Figures viii List of Tables xii Abstract xiii Related Publications xvi 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Basics of SFQ Technology . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Josephson Junction . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Handling of SFQ pulses . . . . . . . . . . . . . . . . . . . . . 5 1.3 Differences with CMOS . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 Timing Convention . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 Logic Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.3 Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.4 Interconnections . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.5 Timing Characterization . . . . . . . . . . . . . . . . . . . . . 11 1.3.6 Fabrication . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3.7 Energy consumption . . . . . . . . . . . . . . . . . . . . . . . 12 1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2 SFQ circuit synthesis and Complex Cells 18 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Circuit Synthesis using ABC . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 Cell Library specification and Circuit mapping . . . . . . . . . 21 2.2.2 Path-Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Splitter-Insertion for Fanout implementation . . . . . . . . . . 24 v 2.2.4 Generation of Final SFQ netlist . . . . . . . . . . . . . . . . . 25 2.3 Complex Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.1 Cells with more than 2-inputs . . . . . . . . . . . . . . . . . . 26 2.3.2 High-Fanout Splitters . . . . . . . . . . . . . . . . . . . . . . . 29 2.3.3 Analysis of Cells . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4 Results of Logic Synthesis with Complex Cells . . . . . . . . . . . . . 31 2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3 Timing Characterization of SFQ circuits for Static Timing Analysis 36 3.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Measurement Procedure . . . . . . . . . . . . . . . . . . . . . 40 3.3 Timing Characterization . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 LUT for Clock-to-Q delay . . . . . . . . . . . . . . . . . . . . 42 3.3.2 LUTs for Setup and Hold times . . . . . . . . . . . . . . . . . 47 3.3.3 Generation of LUTs . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.4 Delay Model for Non-clocked cells . . . . . . . . . . . . . . . 50 3.3.5 Delay model for PTL connections . . . . . . . . . . . . . . . . 51 3.3.6 Impact of Bias Current . . . . . . . . . . . . . . . . . . . . . . 52 3.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.4.1 Process corners . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4 Simulation Analysis of ERSFQ Circuits 61 4.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.1 ERSFQ Biasing Scheme . . . . . . . . . . . . . . . . . . . . . 63 4.1.2 Tolerance Comparison . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Simulation Analysis of ERSFQ Biasing . . . . . . . . . . . . . . . . . 68 4.2.1 Study on Bias Inductance . . . . . . . . . . . . . . . . . . . . . 69 4.2.2 Study on the Size of Feeding JTL . . . . . . . . . . . . . . . . 71 4.2.3 Operational Bias Margins . . . . . . . . . . . . . . . . . . . . 72 4.2.4 Effect of Feeding Clock Frequency . . . . . . . . . . . . . . . 76 4.2.5 Effect of Circuit Operation Frequency . . . . . . . . . . . . . . 77 4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5 Reconfigurable SFQ Circuits: Superconducting Magnetic Field Programmable Gate Array 79 5.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.1 Proposed work . . . . . . . . . . . . . . . . . . . . . . . . . . 80 5.2 Basics of Field Programmable Gate Array . . . . . . . . . . . . . . . . 81 5.2.1 Overview of SFQ FPGA implementation . . . . . . . . . . . . 82 5.3 Design and Details of SFQ FPGA Fabric . . . . . . . . . . . . . . . . . 86 vi 5.3.1 NDRO-based Configurable Logic Block (CLB) . . . . . . . . . 86 5.3.2 Programmable Routing . . . . . . . . . . . . . . . . . . . . . . 90 5.3.3 Magnetic CLB . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.4 Switch Programming . . . . . . . . . . . . . . . . . . . . . . . 99 5.3.5 SFQ FPGA Operation . . . . . . . . . . . . . . . . . . . . . . 102 5.4 SFQ FPGA fabric extensions . . . . . . . . . . . . . . . . . . . . . . . 104 5.4.1 SFQ FPGA for Asynchronous Wave-Pipelining . . . . . . . . . 104 5.4.2 SFQ FPGA with Multiple-Input Gates . . . . . . . . . . . . . . 105 5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.5.1 Implementation Estimations . . . . . . . . . . . . . . . . . . . 108 5.5.2 Circuit implementation example on FPGA fabric . . . . . . . . 109 5.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.4 Implementation Considerations . . . . . . . . . . . . . . . . . 112 5.6 Energy Saving Techniques for Large regular ERSFQ circuits like FPGA 113 5.6.1 Feeding Clock Choking- Dynamic Sleep Regime . . . . . . . . 113 5.6.2 Current Recycling . . . . . . . . . . . . . . . . . . . . . . . . 114 5.6.3 Current Recycling with Feeding Clock Choking . . . . . . . . . 116 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6 Conclusion and Future Work 118 6.1 Completed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.1.1 Circuit Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.1.2 Design of Complex Cells . . . . . . . . . . . . . . . . . . . . . 119 6.1.3 Timing Characterization of Standard Cells . . . . . . . . . . . . 119 6.1.4 Design of Superconducting FPGA based on MJJs . . . . . . . . 120 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.2.1 Standard Cell Library . . . . . . . . . . . . . . . . . . . . . . . 121 6.2.2 Building Static Timing Analysis Tool . . . . . . . . . . . . . . 122 6.2.3 Building Optimal synthesis and placement tools for FPGA . . . 122 6.2.4 New Timing and Clocking techniques for the FPGA . . . . . . 123 6.2.5 Building Current Recycling Layouts and Algorithms . . . . . . 123 Bibliography 125 vii List of Figures 1.1 Bit energy Vs. gate delay comparison for different technologies [3] . . . 2 1.2 (a) Josephson junction (SIS), (b) circuit symbol, and (c) the RCSJ model of a Josephson junction. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 (a), (b) JTL circuits, (c) splitter circuit, and its (d) simulation result. . . 6 1.4 (a) SFQ D flip-flop (J1, J2 and L2 form a interferometer, whereas J0 and J3 mediate pulse propagation), (b) simulation results. . . . . . . . . . . 7 1.5 (a) A typical RSFQ cell, and (b) the standard timing protocol of RSFQ cells. The definition of setup time, hold time, clock-2-Q time, and clock period are specified. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Logic synthesis flow in ABC from a circuit VHDL description to an SFQ circuit netlist which will be further modified with clock-tree syn- thesis. Example of a full adder is given at different stages of circuit synthesis : (i) standard cell mapping. Logic levels are mentioned for each cell; (ii) insertion of DFFs (iii) retiming and splitter insertion. . . 20 2.2 Implementation of AND3 cell: (a) as a single cell and (b) with 2-input AND cells. For (b), a path-balancing DFF and clock-follow-data clock- ing are used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Examples of complex cells: (a) Schematic of A+BC stand-alone cell (b) Inter-connected A+BC cell . . . . . . . . . . . . . . . . . . . . . . . . 27 2.4 Circuits schematics of stand-alone complex cells (a) AND3 and (b)OR3 28 2.5 Margins of multiple-input AND and OR cells and 1-to-2,3, and 4 splitters 30 2.6 Clock-to-Q delays of multiple-input AND and OR cells and input-to- output delays of 1-to-2,3, and 4 splitters . . . . . . . . . . . . . . . . . 31 viii 2.7 Placed and routed 8-bit CLA netlist synthesized with A+BC complex cell and 2-input cells. Each rectangle shows one cell in a row-based placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1 (a) SFQ D flip-flop (J1, J2 and L2 form a interferometer, whereas J0 and J3 mediate pulse propagation), (b) simulation results. . . . . . . . . . . 40 3.2 Circuit for clock-to-Q delay LUT generation. . . . . . . . . . . . . . . 46 3.3 Formatting timing parameters in .lib-like format for (a) input pin (b) output pin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4 Clock-to-Q delay values for measured for AND-OR connection over different bias current levels). . . . . . . . . . . . . . . . . . . . . . . . 53 3.5 Interconnection of two SFQ cells with (a)JTL (b)PTL . . . . . . . . . . 54 3.6 Test circuit for the proposed timing characterization approach. . . . . . 56 3.7 Clock-to-Q delay comparison for process corners between JSIM sim- ulations and NLDM approach for OR gate with different loads (AND, OR, XOR). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.1 (a) Example of the ERSFQ biasing scheme: with JTL as logic circuit. L Bi ;L BFi and BJ i ;BJ Fi represent bias inductors and bias (limiting) junctions for the logic circuit and the feeding JTL respectively. (b) High- level abstraction of ERSFQ biasing. . . . . . . . . . . . . . . . . . . . 63 4.2 (a) Schematic of D-Flipflop. (b) Margins of the circuit parameters in a D-Flipflop cell performing same functionality in the context of a shift register circuit with ERSFQ(E:) and RSFQ(R:) biasing schemes. Pa- rameters having same margin values in both the schemes are not shown here. The designed values are shown alongside the parameters. . . . . . 68 4.3 (a) Variation in the delivered bias current and (b) the count of bias junc- tion flips with bias inductance variation for an FJTL branch in ERSFQ circuit test benches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4 Profiles of delivered bias current to both the feeding JTL and the logic circuit branches whenI SB is varied for an 8-bit SR circuit with FJTLs of different JJ count. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5 Profiles of the bias bus voltage whenI SB is varied for an 8-bit SR circuit with FJTLs of different JJ count. . . . . . . . . . . . . . . . . . . . . . 73 ix 4.6 Operational bias margins for SR and MUX circuits for different FJTL sizes. The number in the label beside each bar represents the count of JJs in the FJTL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.7 Minimum bias current at which SR and MUX circuits can still operate for different values of f FJTL with f CKT = 2 GHz. Best bias margins correspond to the minimum bias current supply. . . . . . . . . . . . . . 76 4.8 Minimum bias current at which SR circuit can still operate for different values off CKT withf FJTL = 3.33 GHz. . . . . . . . . . . . . . . . . . 77 5.1 Island-style architecture adaptation of SFQ FPGA with unidirectional and bidirectional data flow in horizontal and vertical directions respec- tively. A Configurable Logic Block (CLB) gets its inputs from the rout- ing network through Vertical Connection Block (VCB) and its outputs are carried to the routing network through Horizontal Connection Block (HCB). I/P: Input, O/P: Output, I/O: Input/Output. . . . . . . . . . . . . 83 5.2 (a) Program and store (PS) block implementation with NDROs and DFFs. A PS unit is shown in a dashed red rectangle and a PS block is formed by serially connecting PS units. S2 represents 1-to-2 splitter. Functional waveforms in Verilog HDL simulation: (b) Signals during programming mode: Writing 0 1 0 1 (for PS units at positions 0 1 2 3). (c) Signals during reading mode. PS units at positions 0 and 2 do not produce an output pulse for the respective Read input. . . . . . . . . . . . . . . . . 85 5.3 Implementation of the LUT-based CLB for a 2-input gate using a de- coder with DFFCs, PS block with NDRO-based switches and a 4-to-1 merger. DFFC: D-Flipflop with complementary outputs. . . . . . . . . 88 5.4 Implementation of the FS-based CLB for four 2-input SFQ gates using a PS block with NDRO-based switches, an actual implementation of gates and a 4-to-1 merger. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5 Switch implementation with MJJ as limiting junction in ERSFQ biasing. (a) Circuit schematic and representational symbol for MJJ-based switch. I c0 = 100A;I c1 = I c2 = I c3 = 200A;L 1 = L 2 = L 3 = 4pH. (b) Circuit simulation: result of switch output Q when I c of MJ 0 is 150A and 250A showing the blocking and the passage of input pulse, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 x 5.6 (a) Switch box implementation. Inputs and outputs are represented by red and green color labels, respectively. Dashed connection lines rep- resent the programming of MJJ switches to let the pulse pass through them. (b-e) Representational figures: (b) 3 signal merger. (c) 2 signal merger. (d) 3-way splitter (S3) with attached switches at outputs. (e) 2-way splitter (S2) with switches. (f) Functional waveforms of Verilog HDL simulation of switch box for the programmed switches shown in (a) with dashed connection lines. . . . . . . . . . . . . . . . . . . . . . 92 5.7 The circuit implementation of a 2-way splitter with MJJ-based switches (Fig. 5.6(e)) used in FPGA subcircuits. BJ refers to a regular JJ that is used as bias limiting junction in ERSFQ biasing that does not require programming. MJ refers to a magnetic JJ that will be used in switch implementation with programmableI c . . . . . . . . . . . . . . . . . . . 93 5.8 Connection blocks (CB). (a) Vertical CB. (b) Horizontal CB. . . . . . . 95 5.9 MJJ-based magnetic CLBs: (a) LUT-based (b) FS-based (triple-switch) (c) S4sw block: representation of 4-way splitter with switches. . . . . . 97 5.10 Programming layer for MJJs on chip with current lines (access lines (AL)). (a) Programming unit of MJJ. HAL: Horizontal AL; V AL: ver- tical AL. (b) MJJs are located near the intersections of crossbar made by HALs and V ALs used for programming MJJs. (c) Using external de- coders to access specific MJJs out of all MJJs belonging to the FPGA fabric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.11 Clock pulse distribution to synchronous CLBs in SFQ FPGA. . . . . . . 102 5.12 FPGA implementation example: (a) A circuit block of 8-bit ALU that contains all building blocks and the signal path from the inputs to the output of an asynchronous wave-pipelined ALU in [84]. (b) Synthesized (with all clocked cells), placed and routed ALU block on our proposed SFQ FPGA. FPGA fabric grid is shown with dotted lines. Routing of top-to-bottom and bottom-to-top vertical tracks and of horizontal tracks are shown in blue and gold, and black colored lines, respectively. . . . . 107 5.13 (a) Implementation of current recycling technique for ERSFQ logic cir- cuits. MC represents magnetic coupling using inductors. Input and Out- put pulses of a D-flipflop in the context of shift register circuit of the upper block in current recycling with feeding clock choking (b) absent (c) present in the lower block. . . . . . . . . . . . . . . . . . . . . . . 115 xi List of Tables 2.1 Effect of high-fanout splitter cells on decoder (DEC) designs. Multiple- input cells were used. Last column shows the percentage of cost reduc- tion when high-fanout splitters are used during logic synthesis. . . . . . 31 2.2 Effect of multiple input cells on multiplexer (MUX) designs. SPLIT3, SPLIT4, and “A + BC" cells were not used. Last column shows the percentage of cost reduction when multiple-input cells are used during logic synthesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3 Results for Placed and Routed CLA circuit netlists . . . . . . . . . . . 34 3.1 AND cell clcok-to-Q (C2Q) dependence onJ para junction currents (Ref. Fig. 3.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.2 Setup and Hold Times of SFQ cells . . . . . . . . . . . . . . . . . . . 48 3.3 Comparison of JSIM simulations vs. our NLDM appraoch for the clock- to-Q delay under different test cases of Fig. 3.6 . . . . . . . . . . . . . 56 3.4 Comparison of Input-to-Output delay with JTL interconnection of n stages 57 3.5 Comparison of delay with PTL interconnection. (Ref. Fig. 3.5(b)) . . . 57 3.6 Process Corner Definition . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Comparison of JJ count for CLBs . . . . . . . . . . . . . . . . . . . . 98 5.2 NDRO-based switches Vs. MJJ-based switches . . . . . . . . . . . . . 100 5.3 FS-based CLBs Vs. LUT-based CLBs . . . . . . . . . . . . . . . . . . 105 5.4 JJ count estimation for LUT-based CLBs with multiple inputs and single- switch FS-based CLBs with larger number of gates . . . . . . . . . . . 108 5.5 JJ count and Area estimation of FPGA subcircuits . . . . . . . . . . . . 109 xii Abstract The demand for high computing performance and energy efficiency has been driv- ing the development of the semiconductor technology for decades. Complementary Metal-Oxide Semiconductor (CMOS) technology is the widely used integrated circuit technology today’s electronics. However, with increasing challenges to the physical scaling of CMOS devices and the conclusive end of Moore’s law in sight, there is a sig- nificant need to search for new device technologies and circuit fabrics that would allow continuation of performance and energy efficiency scaling beyond the end-of-scaling CMOS.In this context, superconductive digital electronics (SDE), especially Joseph- son junction (JJ)-based single flux quantum (SFQ), has appeared as a very promising “beyond-CMOS” device technology with a verified speed of 370GHz for simple digital circuits and switching energy per bit of 10-19J at T=4.2K (liquid helium temperature). Though the SFQ technology has clear lead in terms speed and power consumption, there are several technical challenges for it to become a realistic option to realize large- scale, high-performance, and energy-efficient computing systems of the future. The main challenge is that the circuit fabrics and the architectures are different from current xiii day semiconductor technology and consequently the development of efficient simulation and design automation techniques and tools for SFQ logic must be undertaken. The objective of this research is to design and develop circuit techniques for SFQ technology in the direction of realizing large-scale circuits with JJs. In this work, (i) automated synthesis of a digital synchronous rapid single-flux- quantum (RSFQ) circuit using a CMOS logic synthesis tool by adding some new fea- tures to the tool and modifying the existing features, and (ii) design and analysis of complex cells and using these cells to synthesize relatively large RSFQ circuits, are presented. Complex cells in this work include multiple-input AND and OR cells, high- fanout splitters, and the A + BC cell. Synthesis results show a significant reduction in the logical depth and number of Josephson junctions when complex cells are used during the technology mapping. Another technique presented here deals with timing characterization of single flux quantum (SFQ) logic cells to enable static timing analysis (STA) of circuits belonging to any SFQ logic family. The available methodologies or tools for performing timing analysis of SFQ circuits do not have a load-dependent timing characterization method for calculating the context-dependent delay of cells such as the nonlinear delay model (NLDM) for CMOS circuits. Accordingly, a new timing characterization method for SFQ logic cells is presented, which relies on low-dimensional look-up tables (LUTs) to store the clock-to-output delay, setup, and hold times of clocked cells and input- to-output delay of non-clocked cells in an SFQ standard cell library. The accuracy xiv of the proposed LUT-based timing characterization method is compared against JSIM simulations, which shows a maximum error of only 2.1% of the tested clocked cells with different loads. Finally, a new architecture for superconducting Field-programmable gate arrays (FP- GAs) is presented by utilizing the latest developments in SDE, namely Energy-efficient single flux quantum (ERSFQ) biasing and magnetic Josephson junctions (MJJs). To- wards developing an SFQ-specific FPGA, new designs of FPGA subcircuits for both synchronous and asynchronous operation of SFQ circuits are presented in this work. MJJs are used as bias limiting junctions in ERSFQ biasing to implement programmable switches in various subcircuits of the proposed FPGA fabric. Designs of all FPGA subcircuits are developed and are verified through circuit simulation. Verilog hardware description language (HDL) models are also developed for all FPGA subcircuits to fa- cilitate large-scale FPGA simulations for the implementation of the desired circuit on the proposed FPGA fabric.This fabric can be utilized for implementing quantum control algorithms for Quantum computing. xv Related Publications Naveen Kumar Katam and M. Pedram, "Timing Characterization for Static Tim- ing Analysis of Single Flux Quantum Circuits," in IEEE Transactions on Applied Superconductivity, vol. 29, no. 6, pp. 1-8, Sept. 2019. Naveen Kumar Katam, O. Mukhanov and M. Pedram, "Simulation Analysis and Energy-Saving Techniques for ERSFQ Circuits," in IEEE Transactions on Applied Superconductivity, vol. 29, no. 5, pp. 1-7, Aug. 2019. Naveen Kumar Katam, Haolin Cong and M. Pedram, "Reconfigurable Logic Cell for Superconducting Magnetic Field Programmable Gate Array," Submitted to Superconductive Electronics Conference (ISEC), 2019 17th International, July 2019. Naveen Kumar Katam, J. Kawa, and M. Pedram, "Challenges and the Status of Superconducting Single Flux Quantum Technology," to appear in Proc. of DATE, Florence, Italy, Mar. 2019. Naveen Kumar Katam and M. Pedram, "Logic Optimization, Complex Cell De- sign, and Retiming of Single Flux Quantum Circuits," in IEEE Transactions on Applied Superconductivity, vol. 28, no. 7, pp. 1-9, Oct. 2018. Naveen Kumar Katam, O. Mukhanov and M. Pedram, "Superconducting Mag- netic Field Programmable Gate Array," in IEEE Transactions on Applied Super- conductivity, vol.28, no. 2, pp. 1-12, March 2018. Naveen Kumar Katam, A. Shafaei, and M. Pedram, "Design of complex rapid singleflux-quantum cells with application to logic synthesis," in Superconductive Electronics Conference (ISEC), 2017 16th International. IEEE, June 2017. Naveen Kumar Katam, Alireza Shafaei, and Massoud Pedram. "Design of mul- tiple fanout clock distribution network for rapid single flux quantum technology." In Design Automation Conference (ASP-DAC), 2017 22nd Asia and South Pa- cific, pp. 384-389. IEEE, 2017. Katam, Naveen Kumar, Oleg A. Mukhanov, and Massoud Pedram. "Supercon- ducting magnetic field programmable gate array. ( US Patent: 62/646,173) xvi Chapter 1 Introduction 1.1 Motivation According to 2013 international technology roadmap for semiconductors (ITRS) re- port [1], current semiconductor technologies may reach a switching speed of up to 10 GHz by 2028 with 10 15 J practical switching energy per bit at T=300K. At present, the growth of microprocessor speed is almost stalled due to the high power dissipation in the chip and the difficulty of heat removal. At this pace of semiconductor tech- nology, the future growth of electronics seems almost bleak. Due to this hard fact, ITRS’s driver was changed from device scaling to applications. In 2016, ITRS is trans- formed into International Roadmap for Devices and Systems (IRDS) opening the door to non-semiconductor technologies. Currently, first roadmaps are under development. In this context, superconducting digital electronics (SDE), especially single flux quantum (SFQ) technology seems very promising with a verified speed of 370 GHz for simple digital circuits, and, a switching energy per bit of 10 19 J at T = 4K [2]. Comparison graph of bit energy in Joules and gate delay in seconds is shown in 1.1 for various semiconductor and superconducting technologies. It can be observed that 1 Figure 1.1: Bit energy Vs. gate delay comparison for different technologies [3] superconducting technologies, RSFQ, energy-efficient SFQ have got an advantage in terms of both energy consumption and delay of a gate. SFQ information processing is uniquely energy efficient as it a pulse-based logic. Comparing with the current tech- nologies, SFQ circuits can be designed for clock speeds that are 10 times higher than any of their semiconductor counterparts [4]. There is a lot of good research has already been done on quantum algorithms as the researchers across the world still trying to figure out hardware for quantum computing. 2 (a) (b) (c) Figure 1.2: (a) Josephson junction (SIS), (b) circuit symbol, and (c) the RCSJ model of a Josephson junction. With superconducting qubits and the super-low temperatures at which quantum com- puters are operated, SFQ technology is best suited for implementing the hardware for quantum computation. The circuit techniques and the fabric that is the focus of this thesis can potentially be used for control circuitry for quantum computing. 1.2 Basics of SFQ Technology 1.2.1 Josephson Junction The active component in any superconductive technology is a two-terminal JJ whose electrical characteristics follow the equations below: J s () =J c sin() (1.1) 3 @ @t = 2 0 V (t) (1.2) Z V (t)@t = 0 h 2e 2:07 10 15 Vs(Tesla:m 2 ) (1.3) Equations 1.1 and 1.2 are called current-phase relation and voltage phase relation, respectively. J s the supercurrent density,J c notes the critical current density (I c critical current), represents the phase of the junction, andV (t) the voltage across the junction. If current density through a junction exceedsJ c , the junction exits the superconducting state and enters the normal state where a rather large voltage, denoted byV (t), is formed across the JJ. 0 is a single quantum of superconducting flux whose value is calculated as shown in 1.3 [5]. When JJ exits the superconducting state forming voltageV (t) and returns to the superconducting state, the junction experiences a so-called 2-leap. In the SFQ logic, the binary information of logic is not represented by the voltage level as in conventional electronics, but by the absence and presence of a quantized voltage pulse of area 0 , called an SFQ pulse [6]. These SFQ pulses, which are typically of picosecond duration, can be naturally generated, reproduced, amplified, memorized, and processed by circuits comprising of JJs and pure inductors (a superconductor is a pure inductor) [7, 8]. Fig. 1.2.1 shows the resistively and capacitively shunted junction (RCSJ) model of a JJ which is composed of a Josephson current (Josephson element), a normal current 4 (resistance), and a displacement current (capacitance). Josephson element is the desired JJ with electrical characteristics given in (1) and (2), i.e., it is represented by a non- linear current source. Resistance and capacitance are unavoidable parasitic elements. In the case of a weak-signal excitation [11], Josephson element is represented by an equivalent inductance of L j = L j0 cos , where L j0 = 0 2Ic . This is a widely accepted electrical model for a JJ. An important parameter of a JJ is the Stewart-McCumber parameter, c = 2e ~ I c (R N ) 2 C. JJs are usually shunted with an external resistance, obtain a desirable c value. All JJs in an SFQ circuit are typically shunted with resistance that will result in a c value around one. 1.2.2 Handling of SFQ pulses SFQ logic is pulse based logic and there are two main operations that happen among the RSFQ circuit elements: (i) transferring, and (ii) storing of pulses. We will explain both operations using two basic components of RSFQ logic, i.e., Josephson transmission line (JTL) and D flip-flop (DFF).RSFQ logic is pulse based logic and there are two main operations that happen among the RSFQ circuit elements: (i) transferring, and (ii) storing of pulses. We will explain both operations using two basic components of RSFQ logic, i.e., Josephson transmission line (JTL) and D flip-flop (DFF). Transfer of SFQ pulses: A key element of SFQ circuits is the JTL, which consists of several JJs that are DC-current biased withI b such thatI b <I c . These JJs are connected in parallel to one another with series inductors in between, as shown in Fig. 1.3(a) and 5 (a) (b) (c) (d) Figure 1.3: (a), (b) JTL circuits, (c) splitter circuit, and its (d) simulation result. Fig. 1.3(b). The inductance valueL must be set toL 0 Ic Referring to Fig. 1.3(a), an input pulse from inputA triggers a 2-leap in J1; next the resulting pulse developed across J1 triggers a 2-leap in J2 through L2. The process repeats until the pulse is transferred to the end of JTL. To transfer a pulse from one location to two destinations, the SFQ pulse splitter is used (cf. Fig.1.3(c)). JJ parameters and inductance values are chosen such that the 0 is reproduced over appropriate time-width (cf. Fig. 1.3(d)). Unlike CMOS, in SFQ logic, splitters must be used in order to fan out a source signal to different destinations. Storing SFQ Pulses: DC SQUID (Superconducting Quantum Interference Device) with inductance L (LI c 1:6 0 ) [6] is used as a memory element to conserve SFQ pulses. It is a quantizing SFQ loop [9], which is utilized in DFFs (cf. Fig. 1.4) and has 6 (a) (b) Figure 1.4: (a) SFQ D flip-flop (J1, J2 and L2 form a interferometer, whereas J0 and J3 mediate pulse propagation), (b) simulation results. (a) (b) Figure 1.5: (a) A typical RSFQ cell, and (b) the standard timing protocol of RSFQ cells. The definition of setup time, hold time, clock-2-Q time, and clock period are specified. two stable states: storing either zero (counterclockwise direction ofi L known as state “0”) or one SFQ pulse (clockwise direction ofi L known as state “1”). The state of the loop depends on the input from D. The stored pulse in the loop can be read by using a clock (Clk) signal. Depending upon the state of the loop, the arrival ofClk pulse causes either J2 (if the state is “1”) or J3 (if the state is “0”) to leap. If J2 leaps, an output pulse will be generated, losing the SFQ pulse stored in the loop and resetting it to “0” state. If J3 leaps, no pulse will be generated. 7 1.3 Differences with CMOS Some of the differences between SFQ technology and CMOS are given below. 1.3.1 Timing Convention Signaling protocol for the RSFQ logic is different from the conventional CMOS logic because of two reasons [9]: (1) the “return to zero” nature of signals (SFQ pulses), and (2) the intrinsic memory of SFQ loops (SQUIDs), which are part of most RSFQ gates. Any RSFQ logic gate can functionally be viewed as an implicit coupling of an asynchronous logic with a DFF, which works based on a standard protocol as shown in Fig. 1.5. A typical RSFQ cell is fed by one or more inputsD 1 toD n and the clock line. Each cell has typically two stable states and represents a finite state machine. Each clock pulse represents the boundary between consecutive compute cycles (clock periods). An input data pulseD i can arrive any time during a clock period; arrival (or nonarrival) of the pulse atD i is treated as “1” (or “0”) during that clock period. Inputs cause changes in the internal state of the RSFQ cell and are consequently conserved until the arrival of theClk pulse. When theClk arrives, it resets the cell into the “0” state, producing an output pulse atQ o ut (this happens exactly if the internal state of the cell is “1”). For example, in Fig. 1.4(b), after the arrival of data pulse (D), the DFF’s change of state is represented by the increase in current through L2 (I L2 ). With the clock pulse arrival, the current state is read andI L2 is reduced to the previous value, producing an output pulse. Every logic gate (with the exception of trivial cells for splitting and merging and 8 JTL) in RSFQ technology is thus sequential and has to be clocked. Therefore, each gate will have a setup time, a hold time, and a clock-to-Q delay instead of a propagation delay. This is similar to gate-level pipelining in CMOS logic. In the following text of the document, all delays are measured from peak of an input pulse to peak of an output pulse (as shown in Fig. 1.5(b), and the terms ‘gate’ and ‘cell’ are used interchangeably. 1.3.2 Logic Synthesis Since the SFQ logic cells are sequential, all the gates in a data path circuit need to be synchronized with the other cells and the clock network. A basic way to synchronize datapath is to do path balancing, i.e., inserting D-flip flops between a cell and each of its fanin cells to ensure that input pulses arrive in the same clock cycle (i.e., the difference between logic levels of input signals to a logic cell is zero, where logic level of cell g i is the length of the longest path in terms of gate count from any primary input of the circuit to the cell [10]). Similar to path balancing, new logic synthesis techniques need to be developed to get the full benefits of SFQ logic with the existing limitations. Circuit retiming can be used after initial D-flip flop insertion to minimize the number of inserted flops. Similarly, since the fanout count is limited (typically to one), splitters must be inserted in the logic synthesis step. 9 1.3.3 Clocking Due to the sequential nature of logic gates, nearly every logic gate in an SFQ cir- cuit requires a clock pulse for the operation of the logic gate. Because of this, SFQ circuits can be considered as gate-level pipelined. The distribution of clock to every gate and clock skew become very important. CMOS circuits traditionally use equipo- tential clocking, which assumes that the worst-case propagation delay in the clock path is much smaller than the most critical data path delay in the circuit. In contrast, in SFQ circuits, the propagation delay through the clock distribution network is comparable to the worst-case data path delay. The reason is that the clock distribution network is typ- ically comprised of splitters and JTLs whose delays are comparable to delays of logic cells. For this reason, flow clocking may be used for SFQ circuits, which allows several consecutive clock pulses to travel simultaneously on the clock distribution network. A few commonly-used clocking schemes used in SFQ circuits are H-tree, concurrent-flow, counter-flow and Clock-follow-data [11]. 1.3.4 Interconnections In CMOS circuits, two gates are connected by a short wire or by a long wire, both of which are typically modelled as distributed RC or RLC networks. The delay of a wire scales quadratically with respect to the wire length. A superconducting wire is electrically an inductor . There are two ways the gates can be interconnected in SCE, namely, by using (i) Josephson transmission lines (JTL) or (ii) passive transmission 10 lines (PTL). JTLs contain active switching JJs and occupy active area on the chip while requiring bias currents, which adds to the power consumption of overall circuit. Due to low loss and dispersion nature of superconducting microstrip lines, PTLs transfer the (return-to-zero) SFQ pulses ballistically with a speed of 1:25 10 8 m/s [12]. Furthermore, they do not occupy active area of the chip, and their propagation delay scales linearly with the PTL length. However, a single PTL requires special SFQ cells, PTL driver and PTL receiver to be attached to it at its two ends which consume some space on the chip and add to the power consumption. It is advisable to use JTLs for short connections and PTLs for long distances (or for short distances if JTL connections are not possible). In summary, local interconnects in SCE are more expensive than their CMOS counterparts whereas global interconnects in SCE exhibit better scaling behavior with respect to their length compared to their CMOS counterparts. 1.3.5 Timing Characterization Timing characterization techniques for traditional CMOS VLSI circuits used as part of the Static timing analysis cannot be directly applied to SFQ circuits. For CMOS standard cells, delay and output transition time of a cell are characterized by the input capacitance of the driven cell (output load) and slew rate of the data signal. Each cell is associated with a 2D look-up table (LUT) of characterized delay values with input capacitance and slew rate as the LUT key. This characterization is not applicable to SFQ circuits, which are pulse based logic where the output capacitance is not a good 11 measure of the load. This implies that a new set of parameters must be found to enable accurate calculation of delay values for SFQ cells, e,g., inductance value of the first series-connected inductor of the load and the difference between the critical current and the quiescent current of the first driven JJ (see [13]). 1.3.6 Fabrication Fabrication of SFQ circuits and the technology is comparatively simpler than its CMOS counterpart. However, the minimum feature sizes of the SFQ technologies are larger than today’s CMOS circuits (a typical minimum feature size for fabricated SFQ circuit is 350 nm). Historically, SCE processes are defined by the critical current density (J c ) of the JJs. Niobium (Nb) is used for metal layers to implement inductors and passive transmission lines. Silicon dioxide is used between metal layers for isolation. Aluminum oxide is used on a special layer to make Josephson junction (Nb/AlO/Nb). A review of fabrication processes can be found in [14]. Currently, the foundries that are making good progress with superconductive circuits are Hypres, MIT-LL, AIST, ADP, and D-Wave systems [15, 16, 17, 18]. 1.3.7 Energy consumption The circuit operation in SFQ logic depends on the manipulation of single flux quan- tum ( 0 ) which has an energy of 210 19 J Joule or 510 3 kBTln(2) at temperature of 4K [19]. Typically, dynamic power dissipation in SFQ circuits occur when the JJs 12 undergo a 2-phase change. The dynamic power consumed during this 2-phase slip can be calculated asP D = I b 0 f, whereI b is the bias current through a junction and f denotes the switching frequency. For a typical JJ, 150A is the critical current and are typically biased at 70-75% of the I c . For a circuit operating at 20 GHz, an SFQ gate containing a few switching JJ’s consumes about 10nW of dynamic power. If resis- tors are used for biasing, the bias voltage isV b and the total bias current isI b , then the static power can be expressed asP S = I b V b . Typically, for aV b of 2.6mV , we lose the power advantage of SFQ circuits for large SFQ circuits as the static power consumed is a few orders of magnitude higher than the dynamic power consumption. Hence, the logic circuits without resistor biasing are preferred. There are, however, energy-efficient versions of RSFQ logic that rely on AC biasing to completely eliminate static power dis- sipation. 1.4 Challenges Though the SFQ technologies seem to be very promising for the future, they will not have the same circuit density as CMOS circuits. An optimal circuit design approach has not been developed so far. The technology is still in need of an effective solution for data storage to be able to realize an implementation of a computer. 13 Though CAD tools for semiconductor technology are very mature, they cannot be directly applied to the design of SFQ circuits. Main obstacles at the circuit level in- clude: (i) Different active components (i.e., transistor in CMOS vs. JJ in RSFQ), (ii) different basic passive components (i.e., capacitors in CMOS vs. inductors in RSFQ), and (iii) different passive and active interconnects (i.e., metal RC lines with buffers in CMOS vs. Josephson transmission lines and microstrip lines in RSFQ). Furthermore, at logic and system levels, (i) different suite of basic gates, (ii) different clocking schemes, (iii) the influence of one gate on another gate because of the magnetic nature of super- conductive phenomena which requires specific isolation techniques, and (iv) different representation of information (bits) are the needs for custom CAD tools for SDE. . Un- fortunately, the current SDE CAD tools are not sufficient for building large-scale digital circuits. Some of the basic requirements needed for automation of the RSFQ technology are as follows: (i) A standard cell library with different parameters such as sizing, delay, energy, etc., (ii) a methodology to construct and optimize complex gates with intercon- nections of basic logic cells, (iii) well-defined synchronization schemes, (iv) memory blocks. Some of the other challenges very specific to the single flux quantum technology include (i) inductance extraction[20], (ii) bias current analysis[21, 22, 23], (iii) current recycling[24], (iv) flux trapping[25, 26], (v) Joule heating and cryo-cooling[27], (vi) signal routing and shielding, (vii) availability of memory harware. In view of these 14 existing challenges for SFQ technologies with a promising future, this work focuses on overcoming some of the issues. 1.5 Organization In chapter 2, two main topics are presented: (i) description of the algorithms and the methodology used to synthesize an synchronous SFQ circuit using a CMOS logic synthesis tool by adding new features to the tool and by modifying the existing fea- tures; (ii) introducing the concept of two kinds of complex cells : (a) stand-alone cells and (b) inter-connected cells along with the advantage gained by these complex cells through the results of synthesized SFQ netlists. Design of stand-alone complex cells presented here includes AND and OR gates with more-than-two inputs, high-fanout splitters, and ’A+BC’ cell. Circuits of decoders, multiplexers, and carry-look-ahead adders are synthesized with complex cells and the advantage in terms of JJ count and latency is presented. In chapter 3, results of timing characterization of single flux quantum (SFQ) logic cells which enable the static timing analysis (STA) of circuits belonging to any SFQ logic family (RSFQ, ERSFQ) are presented. The available methodologies or tools for performing timing analysis of SFQ circuits do not have a load-dependent timing charac- terization method for calculating the context-dependent delay of cells such as the nonlin- ear delay model (NLDM) for CMOS circuits. Accordingly, this chapter presents a new 15 timing characterization method for SFQ logic cells, which relies on low-dimensional look-up tables (LUTs) to store the clock-to-output delay, setup, and hold times of clocked cells and input-to-output delay of non-clocked cells in an SFQ standard cell library. Although the delay of Josephson junction-based logic cells depends on many parameters, we were able to bring down this dependency to a small number of well- chosen parameters without much inaccuracy. All LUTs are obtained from JSIM simu- lations for a given target process technology. The accuracy of the proposed LUT-based timing characterization method is compared against JSIM simulations, which shows a maximum error of only 2.1% for the tested clocked cells with different loads. In chapter 4, a simulation study of ERSFQ biasing scheme is carried out by building simulation test benches for both synchronous and asynchronous ERSFQ circuits. A study is carried out to present the optimum value of biasing inductance, influence of the feeding Josephson transmission line (FJTL) and the effect of its size, the effect of the feeding clock frequency, and the effect of the circuit operating frequency. In chapter 5, new designs of FPGA subcircuits for both synchronous and asyn- chronous operation of SFQ circuits are presented towards developing an SFQ-specific FPGA. Magnetic Josephson junctions (MJJs) are used as bias limiting junctions in ERSFQ biasing to implement programmable switches in various subcircuits of the pro- posed FPGA fabric. Designs of all FPGA subcircuits are developed and are verified through circuit simulation. Verilog HDL models are also developed for all FPGA circuit blocks to facilitate large-scale FPGA simulations for the implementation of the desired 16 circuit on the proposed FPGA fabric. Designs of a few subcircuits with switches based on non-destructive readout (NDRO) cell are also developed for better comparison with MJJ-switch based counterparts. Programming of MJJ-based switches is based on the ability to control the critical current of MJJs externally. Recent implementations of SFQ decoder is proposed for accessing individual MJJs through the current lines in a crossbar structure. Estimations for the area and power consumption are much better in comparison to previous attempts at designing an SFQ specific FPGA. An innovative clock-choking mechanism using magnetic Josephson junctions (MJJ) is also proposed for the FJTL in the case of no logic circuit activity for a current-recycling circuit block, which would help in eliminating the dynamic power consumed due to the switching of bias junctions in a logic circuit. 17 Chapter 2 SFQ circuit synthesis and Complex Cells In this chapter, the following topics are covered: (i) Description of the algorithms and the methodology used to synthesize an synchronous SFQ circuit using a CMOS logic synthesis tool by adding new features to the tool and by modifying the existing features; (ii) introduce the concept of two kinds of complex cells : (a) stand-alone cells and (b) inter-connected cells along with the advantage gained by these complex cells through the results of synthesized SFQ netlists. Design of stand-alone complex cells presented here includes AND and OR gates with more-than-two inputs, high-fanout splitters, and ’A+BC’ cell. Circuits of decoders, multiplexers, and carry-look-ahead adders are synthesized with complex cells and the advantages in terms of JJ count and latency are presented. 18 2.1 Background There are several efforts in the past describing the work from a top-down circuit design methodology perspective, starting from behavioral description through physi- cal layout of an SFQ circuit. [28, 29] and [30] present their logic synthesis tools in their top-down methodology for SFQ circuits without any detail into the software and algorithms used for mapping a behavioral netlist into an SFQ circuit netlist. [31] de- scribes the required steps to convert a CMOS VHDL netlist to an SFQ netlist. However, they do not present the algorithms used for fan-out check and fix and for gate-level pipelining.[32] uses ABC, however, their clocking and splitter-fanout implementation methodology along with logic synthesis do not give rise to the best possible perfor- mance of SFQ circuits. In this chapter, the algorithms and the methodology used for modifying ABC to generate an SFQ-specific netlist are presented. Different cell libraries exist in the literature to design an SFQ circuit [8, 33]. How- ever, most RSFQ cell libraries are limited to 2-input logic cells (AND/OR/XOR), D-flip flop (DFF), inverter, 1-to-2 splitter, and merger cells, which are sufficient to realize a synchronous SFQ circuit. However, such a cell library does not guarantee the mini- mum logic depth and gate count, which are important factors to minimize the latency, area, and power of SFQ circuits. In this direction, cell library is extended to support AND/OR cells with more than two inputs, splitters with more than two outputs, and a special cell that implements the commonly-used Boolean function of A + BC. We 19 present advantages of these cells by generating SFQ-specific netlists for several circuits with and without these complex cells. Figure 2.1: Logic synthesis flow in ABC from a circuit VHDL description to an SFQ circuit netlist which will be further modified with clock-tree synthesis. Example of a full adder is given at different stages of circuit synthesis : (i) standard cell mapping. Logic levels are mentioned for each cell; (ii) insertion of DFFs (iii) retiming and splitter insertion. 2.2 Circuit Synthesis using ABC ABC is a growing open source software system written in C language for synthesis and verification of binary sequential logic circuits appearing in synchronous hardware designs [34]. Since a synchronous SFQ circuit is a fully gate-level pipelined structure, this CMOS synthesis tool fits in very well with the development of an SFQ logic/circuit synthesis tool. Though, a similar top-down design methodology presented in various 20 works is followed such as [31], proposed circuit synthesis approach presented in Fig. 2.1 is ABC-based and the flow has some differences. Description of the flow is presented below. 2.2.1 Cell Library specification and Circuit mapping A logic synthesis tool requires a cell library and a behavioral or functional speci- fication of the circuit to be synthesized. The input cell library to the cell contains the list of available gates for the technology along with their Boolean function and other information such as delay and area. Since the focus is on synchronous circuits here, all the gates will be considered sequential in the final SFQ netlist and the clock-to-Q delay value of a gate is entered for the delay of the gate. For the input circuit netlist, a Verilog HDL functional description of the circuit-to-be-synthesized is given as input. A basic cell library contains 2-input AND, OR, XOR gates, inverter, and a DFF. With the available functions of ABC, a given circuit will be mapped to available gates in the cell library for optimal delay or for area saving. At this point, the mapped circuit is still being treated as a regular CMOS netlist. 2.2.2 Path-Balancing Gate-level pipelining of an SFQ circuit requires the arrival of a clock pulse for the operation of every cell in the circuit and the availability of data inputs for each cell before this clock pulse arrives to evaluate the cell. Hence, the distribution of clock pulse to 21 each cell synchronizing with the data flow of the circuit is very important for the correct operation of the circuit. One straightforward way to synchronize the clock distribution to each cell with their input data arrival times is to insert calculated delays in the clock distribution line for every connection between any two cells in the circuit. With graph theory analogy to a circuit, the above statement can be rewritten as clock distribution network will have a similar graph as the circuit with a calculated delay inserted on every edge based on the timing characteristics of nodes (cells) it is connecting. Delay is inserted for the clock to arrive with a small delay after all the data inputs arrive at a node. This clocking scheme is called clock-follow-data scheme. In general, any flow-clocking needs to have carefully designed timing for whole clock-distribution network [11, 32]. This may be easier to implement for a circuit that has a linear structure. However, it will be a very difficult task for a circuit which has many interconnections among its nodes, and especially for larger circuits. Another way to synchronize the clock pulse distribution with the data-flow is to path- balance the circuit by inserting a suitable number of SFQ D-flipflops (DFFs) as buffers between two nodes of a circuit. This insertion is according to the logic level of those nodes so that every path from any primary input to a connected primary output will have the same number of nodes. The logic level of a cell can be defined as the length of the longest path in terms of the cell count from this cell to the primary inputs of the circuit. For example, the level of each cell with a number on each cell of the full adder of a mapped circuit is shown in Fig. 2.1. When a circuit is path-balanced, every cell in the 22 circuit encounters the same number of cells when traced on a path from itself to any primary input of the circuit. The algorithm for path-balancing by inserting DFFs based on the levels of the cells in a circuit is shown in Algorithm 1. Algorithm 1: Algorithm to insert DFFs between cells in a circuit for path- balancing 1 for each cellC in the circuit do 2 for each faninfi of cellC do 3 Calculate the difference level betweenC andfi 4 if level difference> 1 then 5 n = level difference 6 Insertn DFFs betweenC andfi 7 else 8 Do nothing 9 end 10 end 11 end After the insertion of DFFs for path-balancing, a standard retiming algorithm avail- able in ABC is run [35] to reduce the count of DFFs in the circuit. At this stage, ABC treats only the DFFs as sequential cells and the rest of them as combinational cells. Retiming can be defined as a circuit transformation that relocates sequential elements (DFFs) in the circuits (balancing path delays) with preserving its functionality to opti- mize the number of DFFs and to optimize the clock period. However, we use retiming mainly to reduce the number of flipflops we inserted by using Algorithm 1. This ensures that the number of added DFFs to path-balance is as minimum as possible. 23 2.2.3 Splitter-Insertion for Fanout implementation SFQ technology is a pulse based logic and it relies on transmitting a voltage pulse (for logic value 1) whose area is equivalent to a magnetic flux quantum. For this reason, it is not easier for any cell to have a fanout greater than one. However, in any synthesized circuit, there will be many cells which have larger fanouts. In general, these fanouts are implemented by inserting the required number of splitter cells. Here, a basic approach to insert 1-to-2 splitters in a balanced binary tree structure to minimize the delay through the splitters is presented in algorithm 2. Algorithm 2: Algorithm to insert DFFs between cells in the circuit for path- balancing 1 for each cell and primary input in the circuitC do 2 if fanout countN f of cellC 2 then 3 Disconnect all fanout connections ofC 4 Insert a splitterspl at the output ofC 5 AddSplTree(0,N f ,C,spl) 6 else 7 Do nothing 8 end 9 end 10 Function AddSplTree(n i ;n f ;C;spl) 11 ifn f n i is 2 then 12 Addspl outputs as fanins to fanouts ofC 13 else 14 Insert splitterspl1 at the output ofspl AddSplTree(n i ; (n f +n i + 1)=2;C;spl1) 15 ifn f n i is 3 then 16 Add left out fanout ofC as fanout cell ofspl 17 else 18 Insert splitterspl2 at the output ofspl AddSplTree((n f +n i + 1)=2;n f ;C;spl2 ) 19 end 20 end 24 2.2.4 Generation of Final SFQ netlist All the process described so far is done inside the ABC software. At this stage, a given netlist is fully synthesized with available SFQ cells provided through cell library, fully path-balanced with the least possible number of DFFs for synchronous operation, and splitter inserted for fanout implementation. This netlist can be written into either BLIF orverilog format. TheBLIF netlist is externally processed to add clock dis- tribution network before sent for placement and routing of the circuit. Either zero-skew clocking with H-tree or flow-clocking or a combination can be used on the generated netlist [36]. 2.3 Complex Cells In generating an SFQ synthesized circuit, the number of inserted DFFs is directly dependent on the maximum logic level or the longest path in the circuit along with the circuit architecture. For the same circuit functionality, the larger the count of nodes in the longest path (maximum level), the larger the number of DFFs are to be inserted for path-balancing the circuit. To reduce the count of inserted DFFs, generation of complex cells is suggested so that the number of nodes in a circuit can be reduced thus decreasing the length of the longest path. Complex cells are generated in such a way that the functionality of neighboring nodes that are repeated often in a circuit are combined to produce a super node (or a complex cell) as a standard cell. Due to the 25 (a) (b) Figure 2.2: Implementation of AND3 cell: (a) as a single cell and (b) with 2-input AND cells. For (b), a path-balancing DFF and clock-follow-data clocking are used. fact that a complex cell can do the function of more than one basic cell, a fewer number of DFFs are required to path-balance the circuit leading to the reduction in the area of overall implementation. This implies that if a single logic cell can be designed to realize a relatively complex Boolean function, using this single cell instead of implementing it with basic 2-input cells, not only reduces the total cell count but also decreases the logical depth of the circuit. A simple example of AND3 gate is shown in Fig. 2.2. In most cases, the length of longest-path in the circuit also decreases implying also the reduction in the latency. Hence, the complex cells can help in reducing the area and the latency of an SFQ synthesized circuit for synchronous operation. 2.3.1 Cells with more than 2-inputs Complex cells can be categorized into two categories : (i) stand-alone cells and (ii) inter-connected cells. Stand-alone cells are like basic SFQ gates and have only one in- terferometer (inductive storage loop) or a few interferometers that are in parallel. These cells do not require clocking of interferometers in series, but they might get clocked in 26 (a) (b) Figure 2.3: Examples of complex cells: (a) Schematic of A+BC stand-alone cell (b) Inter-connected A+BC cell parallel. Generation of cells of this kind is usually difficult. Example of a stand-alone cell is A+BC cell shown in Fig. 2.3(a). It has two parallel interferometers formed by J11-L8-J14 and J13-L9-J15. Inter-connected cells are formed by connecting more than two basic cells together to get the desired functionality of a complex cell. However, this complex cell will have only one clock input. All the basic cells in a complex cell are clocked successively from the input side to the output side with delays inserted in the clock line according to the timing characteristics of the basic cells. Example of an inter- connected complex cell for implementing the same A+BC function is shown in Fig. 27 (a) (b) Figure 2.4: Circuits schematics of stand-alone complex cells (a) AND3 and (b)OR3 2.3(b). This cell contains 2 levels of logic and hence clock is distributed consecutively in a clock-follow-data manner to both levels. The drawback of designing larger complex cells is that in general, the clock-to-Q delay of them would be larger thus decreasing the frequency of operation. The delay of interconnected cells increases with respect to the number of levels of cells placed in a complex cell. For example, the delay of A+BC cell in Fig. 2.3(b) is almost doubled when compared to single 2-input AND gate. This implies that the final clock-period of a circuit increases proportionally to the number of levels in an interconnected complex 28 cell. When these cells are routed through PTLs which is often the case [37], if the time taken by the longest PTL connection for transferring a pulse is larger than the increased delay due to complex cells, the circuit’s clock frequency will not be affected. However, the increase in delay of stand-alone cells will be much smaller when compared to the interconnected cells. For A+BC cell in Fig. 2.3(a), its setup time is the same as an OR gate and its clock-to-Q delay is the same as for an AND gate. Here, the change in the clock frequency of the whole circuit will be negligible. Along with A+BC cell, 3, 4, and 5-input AND and OR cells are also designed as stand-alone complex cells. Schematics of 3-input AND and OR cells are shown in Fig. 2.4. 2.3.2 High-Fanout Splitters Since SFQ cells cannot have a direct fanout of more than one, generally 1-to-2 split- ters (SPLIT2) are used at the output of the cell to drive a larger fanout. For fanouts greater than two, SPLIT2 cells are cascaded in various tree-like structures. For exam- ple, an algorithm to insert splitters in a balanced binary tree structure is presented in Algorithm 2. As it can be seen, a large number of splitters are required for implement- ing fanouts and it increases the overall area, delay and the power consumption of the circuit. Some cost of the splitters can be saved if high-fanout splitters are available in the library. In this direction, 1-to-3 and 1-to-4 splitters (denoted by SPLIT3 and SPLIT4) are designed to reduce the cost of the final circuit. 29 2.3.3 Analysis of Cells The functionality of designed complex cells was verified by JSIM simulations [38]. All JJs in the simulation has a critical current density of 10 kA=cm 2 and are shunted with required resistance values to have a Stewart-McCumber parameter ( c ) value of two. However, the major challenge in designing complex cells is to achieve acceptable margins. To this end, PSCAN2 [39] is used to optimize the cell parameters and derive their margins. Critical margins of each complex cell are reported in Fig. 2.5. We believe margins can still be increased with more optimization. Delays of complex cells are also shown in Fig. 2.5. Figure 2.5: Margins of multiple-input AND and OR cells and 1-to-2,3, and 4 splitters 30 Figure 2.6: Clock-to-Q delays of multiple-input AND and OR cells and input-to-output delays of 1-to-2,3, and 4 splitters Table 2.1: Effect of high-fanout splitter cells on decoder (DEC) designs. Multiple-input cells were used. Last column shows the percentage of cost reduction when high-fanout splitters are used during logic synthesis. DEC 2-Output Splitters High-Fanout Splitters Cost Design Depth JJs Cost Depth JJs Cost Imp. 3-to-8 2 231 272 2 207 237 13% 4-to-16 2 583 722 2 519 625 13% 5-to-32 2 1,315 1,709 2 1,125 1,383 19% 2.4 Results of Logic Synthesis with Complex Cells To show the advantages of multiple-input cells and high fanout splitters and to il- lustrate the usage of the developed SFQ-specific circuit synthesis tool, following cir- cuits are used as benchmarks: multiplexer (MUX), decoder, and carry look-ahead adder (CLA) designs with different sizes. Multiplexers are used to illustrate the effect of multiple-input AND and OR gates as they do not have fanouts within them. Decoder circuits are used to show the advantage of high-fanout splitters since they have several outputs generated from a relatively small number of inputs, which means that they con- tain large fanouts. Finally, CLA, which is a fast adder, is used to show the advantage of 31 having a special cell dedicated to realizing a commonly-used Boolean function in CLA, i.e., A + BC. To capture the effect of both area and clock frequency of the circuit, a cost function is defined in terms of the total number of JJs (N JJ ) and the worst-case stage delay of the circuit. In the place-and-route results, it was found that the worst-case interconnect delay (D int ) to be60 ps using passive transmission lines (PTLs) and a zero-skew clock distribution network. To capture the worst-case stage delay, maximum clock-to-Q delay of all gates in the circuit and the maximum splitter network delay is added toD int . The product of number of JJs and the normalized worst-case clock period is then taken as the cost function to compare the logic synthesis results: Cost = N JJ ( D gate +D int D int ) = N JJ (1 + D gate D int ) (2.1) The implementation results of three sizes of multiplexer (MUX) circuits given in Table 2.2 show that the cost of implementation is much smaller with the use of 3,4,5- input AND and OR cells in comparison with the implementation of only 2-input cells. Specifically, there is a significant improvement in the logical depth of the MUX circuit. As a result, while a multiple-input cell implementation of 16-to-1 MUX takes six clock cycles to finish, its 2-input cell implementation takes 20 cycles to complete its opera- tion. The advantage of high-fanout splitters can be seen in the decoder circuit results presented in Table 2.1. Though the JJ count reduction is not significant with the high- fanout splitters, the main advantage comes from the decrease in the worst-case stage 32 Table 2.2: Effect of multiple input cells on multiplexer (MUX) designs. SPLIT3, SPLIT4, and “A + BC" cells were not used. Last column shows the percentage of cost reduction when multiple-input cells are used during logic synthesis. MUX 2-Input Cells Multiple-Input Cells Cost Design Depth JJs Cost Depth JJs Cost Imp. 4-to-1 6 227 260 3 142 163 37% 8-to-1 11 683 783 4 367 441 44% 16-to-1 20 2,016 2,311 6 855 1,028 56% Figure 2.7: Placed and routed 8-bit CLA netlist synthesized with A+BC complex cell and 2-input cells. Each rectangle shows one cell in a row-based placement delay. In using both multiple-input cells and high-fanout splitters, it can be observed that in all the cases, as the circuit size and/or the logic depth of the circuit increases, savings in terms of area and delay increases, resulting in considerable cost reduction. CLA implementation results show the combined effect of both multiple input gates and high-fanout splitters. Along with these two, we have also implemented the CLA 33 Table 2.3: Results for Placed and Routed CLA circuit netlists CLA netlist & mapped cells Total area (m 2 ) Longest signal routing PTL (m) Longest clock routing PTL (m) Avg. Via count 4-bit, only 2-input 1455840 1580 724 2.68 4-bit, 2-input & A+BC 676400 940 695 2.65 8-bit, only 2-input 5376800 2200 1323 2.72 8-bit, 2-input & A+BC 2737800 1250 955 2.50 using the special function A+BC as it is used in calculating the carry values. Our imple- mentation of ‘A+BC’ standard cell has 22 JJs when compared to 9 JJs each for 2-input AND and OR gates and 5 JJs for DFF (total 23 JJs in a 2-input gate implementation of A+BC cell). Though the advantage at the cell level in JJ count is insignificant, at the circuit level, the overall JJ count for the circuit is decreased with A+BC cell implement- ing combined effect of two logic cells and hence the reduction of the total number of cells in a circuit. So, we see a 401 JJ and 1131 JJ count for our best implementation of 4-bit and 8-bit CLAs which are significantly less than the other two implementations. We have also placed and routed the synthesized CLA circuit netlists with basic 2-input gates and with an A+BC cell with a basic zero-skew H-tree clock network. Comparison of physcial layout area and the longest PTL lengths in Table 2.3 demonstrate that there is a big advantage with designing and synthesizing circuits with complex cells. The routed layout of 8-bit CLA with 2-input cells and A+BC stand-alone complex cell placed in 34 rows is shown in Fig. 2.7. Details of placement and routing along with the information about PTL wiring can be found in [36]. 2.5 Conclusions SFQ technology has various challenges before realizing large-scale circuits. Specif- ically, due to the fundamental differences with the CMOS technology, advanced CMOS CAD tools cannot be used directly for SFQ logic circuits. In this chapter, a new logic optimization approach is presented using a CMOS logic synthesis tool, ABC. Some of the logic synthesis features of ABC are used and some new features are added to ABC to get an SFQ circuit netlist from Verilog HDL functional description of a circuit. The synthesis flow and the algorithms used for path-balancing to insert DFFs and splitter- insertion for fanout implementation are also presented. Retiming of sequential circuits in VLSI is used to reduce the path-balancing DFF count for SFQ circuits. The concept of two kinds of complex cells: stand-alone and interconnected cells is introduced. The design and analysis of 3,4,-5 input AND and OR gates, A+BC cell, and high-fanout splitters are presented as part of stand-alone complex cells. The advantage of these cells is presented through the synthesis of multiplexer, decoder and carry look-ahead adder circuits through the synthesis flow in ABC that is described in this chapter. 35 Chapter 3 Timing Characterization of SFQ circuits for Static Timing Analysis The focus of this chapter is to develop a timing characterization method for SFQ standard cells to accurately model and quickly compute the timing parameters of an SFQ circuit which takes the interaction among different cells into consideration. The most widely-used tools for timing extraction of SFQ cells are JSIM [38], WRSpice [40] and PSCAN [41], which are full-circuit (analog) simulation tools for SFQ logic. Timing extraction using these simulators requires a greater effort for building the whole netlist and simulating for various combinations of inputs to make sure that there are no timing violations. In the direction of developing a tool for timing analysis, a timing characteri- zation method is proposed in this chapter which uses look-up tables (LUT). This method is specifically designed for SFQ logic circuits similar to nonlinear delay model (NLDM) based LUTs for CMOS circuits. After investigating timing characteristics of SFQ cells, following LUTs are presented in this chapter. Clock-to-Q delay of clocked cells, input- to-output delay of non-clocked cells (i.e., JTL, splitter, and merger) and output pulse 36 width of each cell are stored in 2-dimensional (2D) LUTs indexed by (i) the first se- ries inductance and (ii) the difference of critical current and the actual current flowing through the first parallel Josephson Junction (JJ) at the output of the cell. Setup and hold times are stored in 1-dimensional (1D) LUTs based on the incoming pulse width. In rest of the chapter, proposed timing characterization method is called NLDM approach. 3.1 Prior Work The conventional way of cell-based design of SFQ circuits and the timing analysis is described in [42] where cells are designed with different bias currents to obtain different delays and timing analysis is carried out with a single delay value assigned for a cell in any context of the circuit. This can contribute to an error in delay calculation when context-dependency is not taken into consideration (e.g., an error of up to 23% in esti- mating the delay of a splitter cell [43]). Another work on timing includes [44] STATS (a statistical timing analysis tool), and its focus is on the delay variation of an RSFQ circuit under thermal noise and fabrication variations. Though it captures the effect of process corners and the noise effectively, it does not take the interaction among cells into consid- eration. Moreover, this method is not sufficient for clocked SFQ cells (AND, OR, XOR, etc.), because it does not deal with clock-to-Q delay. The latest technique for doing static timing analysis of SFQ circuits relies on a timing characterization of logic cells in a cell library, whereby each cell is loaded by all other cells in the library and timing 37 values corresponding to a different driver and a driven cell combinations are measured and stored in the library characterization data set [43]. Though this method identifies the problem and considers the interaction among cells, it relies on an exhaustive set of stored values and not on capturing the loading dependence.(i.e. finding the right set of parameters from which the effect of each driven cell on the timing of its driver cell can be concisely and accurately modeled.) The main drawback of this approach is that the delay table of all cells must be updated whenever a new cell is added to the library or when an existing cell’s circuit schematic is modified. 3.2 Background In conventional VLSI circuits, setup and hold times are defined as windows of time before and after the triggering edge of the clock during which the input data must be stable, respectively. For SFQ logic circuits, these definitions change slightly because they belong to are pulse-based logic families. An input data pulse must arrive before the clock pulse at least with a time difference of setup time for the data pulse to be read correctly. Similarly, an input data pulse must arrive after the clock pulse at least with a time difference of hold time for the data pulse not to race through the cell. In general, if an input data pulse does not satisfy the setup time constraint, its consequent output will not be read by the coming clock pulse and thus causes read before write hazard in 38 the circuit. Likewise, if an input data pulse does not satisfy the hold time constraint of a clock pulse, it may race through and thus causes write after write hazard in the circuit. The setup time can be calculated as the time difference between the longest signal path delay to the state changing circuit element (two-junction interferometer in the case of an SFQ cell) in the cell and the shortest clock path delay to the same. Similarly, the hold time can be calculated as the time difference between the longest clock path delay to the interferometer in the cell and the shortest signal path delay to the same. We will explain these timing definitions with an example DFF cell shown in Fig. 3.1. The longest signal path delay is the time difference between the settling of the increase in I L2 (current though L2) and the arrival of the input data pulse at the input port of the concerned cell. Input pulse arrived at input port will be stored in the interferometer loop J1-L2-J2 and this storage can be evidenced by increased current through L2. The time instant at whichI L2 stabilizes to a new value from the previous value can be taken for measurement of the longest signal path. The shortest clock path delay is the time difference between the switching of J3 (J2), if the state of DFF is 0 (1), and the arrival of the clock pulse at the clock port. The time of switching of J3 (J2) can be measured at the instant when the junction’s phase is increased by 2. The longest clock path delay is the difference between the settling of the decrease inI L2 and the arrival of the clock pulse, whereas the shortest signal path delay is the difference between the switching of J1 and the arrival of the data pulse. The above description is useful for understanding setup and hold times with an insight into the detailed schematic of a gate. To come up with the 39 (a) (b) Figure 3.1: (a) SFQ D flip-flop (J1, J2 and L2 form a interferometer, whereas J0 and J3 mediate pulse propagation), (b) simulation results. exact characterization of setup and hold times, we have used simulations. Clock-to-Q delay is the time difference between the appearance of the resultant output pulse at the output pin and the arrival of the corresponding clock pulse at the clock pin of the cell. 3.2.1 Measurement Procedure In our setup time measurements [45], we run simulation of a cell with the data signal, (peak of) D arriving one-half clock cycle before the (peak of) relevant clock pulse, CLK measuring the clock-to-Q delay (refer to Fig. 3.1(a)). Delay value measured under this condition is considered as the nominal clock-to-Q delay. Next, we start moving the D signal pulse closer and closer to the CLK pulse, noticing that the clock-to-Q delay will start to increase. The setup time for a cell is defined as the minimum time between D and CLK pulses that cause the clock-to-Q delay to increase by no more than 10% above 40 the nominal value. Hold time is measured similarly but by varying the position of the input data pulse, D following the CLK pulse. 3.3 Timing Characterization Static timing analysis (STA) is a method for analyzing and validating the timing characteristics of a digital circuit without requiring a full circuit simulation. STA takes the gate-level netlist of a circuit, the cell library with timing characterization data, and timing requirements (e.g. worst-case circuit latency) as the inputs, and determines whether the circuit meets the timing requirements. In general, setup time and hold time violations are checked in STA. Setup time violation occurs when a signal arrives too late and misses the time when it should advance, and hold time violation occurs when an input signal arrives too soon after the clock pulse’s arrival. The most popular STA tool used in the VLSI design community is Synopsys PrimeTime [46]. This tool uses the Liberty library format (.lib) to accurately and efficiently capture behaviors of standard cells [47]. However, STA techniques of traditional VLSI cannot be directly applied to SFQ circuits. For CMOS standard cells, delay and output transition time of a cell are characterized by input capacitance of the driven cell (output load) and slew rate of the data signal. Each cell is associated with a 2D LUT of characterized delay values with input capacitance and slew rate as LUT keys. This characterization is not applicable to SFQ context since it is a pulse based logic and the output capacitance is not an accurate 41 representation of a load. This implies that a new set of parameters must be found to enable accurate calculation of delay values for SFQ cells. 3.3.1 LUT for Clock-to-Q delay By translating LUT keys directly from CMOS context to SFQ context the slew rate and the output capacitance directly translate to the pulse width and impedance seen at the output of a cell (looking into the driven cell). However, the mechanism of these parameters contributing to the timing information is not the same. An SFQ cell is made of several JJs and the width of the incoming input pulse is important for the 2-leap of the superconducting phase in the first junction at the input side of the gate. If the incoming pulse is sharper (short pulse-width), the ramp up to exceeding I c of the JJ will be sharper and the delay will be smaller, and if it is wider (large pulse-width), the result will be the opposite. For simplicity, since the circuits of similar design distribute clock pulse to all cells in an SFQ circuit, the width of the clock pulse will be considered equal for all the logic cells, and hence we do not consider the clock pulse width as a required parameter for calculating context-dependent delay but can be considered for future implementations. In the following text, identification of the LUT keys for our timing characterization method is explained. In all the simulations, all JJs have a critical current density of 10 kA=cm 2 and are shunted with required resistance values to have a c value of 2.0 for Nb/AlOx/Nb Josephson junction. In our JSIM simulations of a JTL with different count of JJs and 42 SPICE simulations of the extracted RCSJ model [48], we found that the input impedance of a long JTL is dependent only on two JTL stages. Here, an inductance and a JJ with its bias current are viewed as a stage of JTL. Using the RCSJ model of JJ, we tried to get a simplified expression for the input impedance of two stages of JTL, so that it can also be used for a general SFQ cell to derive the LUT keys for timing characterization. The impedance expression used for a JJ is Z JJ = 1 1 j!L J +j!C J + 1 R S (3.1) L J is the junction equivalent inductance [49]. The expression for impedance of two stages of JTL after simplification is Z JTL2 =j!L 1 1 j!C J1 (3.2) The major simplification is to neglect the lower order ! terms since we deal with very high frequencies here. Hence, as a first-order delay approximation of a circuit, the impedance only depends on the first JTL stage, or more accurately, on the first induc- tance connected at the output (L 1 in equation (3.2)) and parasitic capacitance of the first JJ (C J1 in equation (3.2)). From here, to simplify our characterization, we will only consider an input subcircuit of the loading circuit, comprising of a first series-connected inductance (defineL series ) and a first parallel-connected JJ (defineJ para ). Several JSIM 43 simulations are done by varying inductance L series to measure its effect on the clock- to-Q delay of different cells. The change in inductance yields expected results: larger values ofL series give larger delay values (far-end) than the smaller values ofL series . This is because a large inductance at the output of a gate leads to a smaller value of current going towards the connected JJ, and hence, takes a longer time to switch and generate a pulse. Keeping theL series value constant, one can expect an increase in parasitic ca- pacitance ofJ para leading to a decrease in overall impedance and a subsequent decrease in the delay. The change in capacitance can be achieved by changing one of these three values: dielectric constant, its thickness or area of the junction. Dielectric material and its thickness or fixed for a process and the area would result in the change of critical current which is the most important property of individual junction itself. Hence, the capacitance is not considered for the remainder of this discussion and focus will instead be on on theI c of JJ and the supercurrent flowing through it. Note that theI c of a JJ also represents the area of JJ. It is well established and is also observed in our simulation results that the bias current plays a major role in deciding the delay of a circuit. From here on, bias current refers to the externally supplied current to the JJs of a cell, whereas quiescent current (denoted byI q ) refers to the actual current passing through the JJ. The impact of bias current on the cell delay can be associated with the switching of the junction, when the current through JJ reaches itsI c producing an SFQ pulse. Precisely, a larger value of bias current results in a larger value of quiescent current through the JJ and for an incoming 44 SFQ pulse, it will take less time to increase from the JJ’s quiescent current to its critical current, leading to a smaller delay. Conversely, a smaller bias current leads to a larger delay. It must be pointed out that the bias currents are under our control. In contrast, the quiescent current value of the JJ is a result of the controllable bias current and can not be controlled precisely by the direct application of current. In fact, the quiescent current depends on (i) all values of inductances connected to the biasing point, (ii) the critical current of JJs connected to those inductances, and (iii) the quiescent current through those JJs. This current distribution cannot be derived from simple impedance calculations since JJs are nonlinear inductors [50]. Through the simulations, we have identified that the gap between the critical current and the quiescent current (I c I q ) will be a good parameter to use as a key for clock- to-Q delay LUTs and will be referred as deficit current from here on. For showing the dependence of a cell’s delay on a deficit current, JSIM simulation results in a circuit consisting of an AND cell with a load consisting of a constantL series value, 4 pH and a parallel JJ,J para with critical current ofI c and biasing current source,I b are given in Table 3.1. Both I c and I b values are varied to get different deficit current values and they closely correlate with the corresponding measured clock-to-Q delay values. Table 3.1 shows the clock-to-Q delay with respect to the gap for differentI c andI b . It can be seen that the delays for different critical currents with respect to deficit current values are within 0.1 ps. TheI q for JJ withI c of 100A is more than the appliedI b because of the current coming from the other biasing sources in the AND cell towards this junction. 45 Figure 3.2: Circuit for clock-to-Q delay LUT generation. Table 3.1: AND cell clcok-to-Q (C2Q) dependence onJ para junction currents (Ref. Fig. 3.2) I c (A) I c (A) I c (A) I b (A) I b (A) I b (A) I q (A) I q (A) I q (A) I c -I q (A) I c -I q (A) I c -I q (A) C2Q Delay (ps) 300 284 245 55 4.35 200 170 145 55 4.30 100 34 45 55 4.28 300 260 230 70 4.58 200 144 130 70 4.50 100 2 30 70 4.57 In case of other JJs, the current from appliedI b is going to other junctions in the circuit. This is because of the current redistribution effect. With the above observations of correlation on varyingL series , and the deficit current value ofJ para connected toL series , we considered these two parameters as keys for the clock-to-Q delay LUT. Accordingly, all cells are characterized by these two parameters to generate delay tables. As two adjacent cells in a circuit cannot afford to have a large value of current redistributed among them, which would lead to the failure of the circuit, our SFQ cells are designed to have a minimum current redistribution. With our cell library, we observed only a small difference (less than 3%) in the value of the quiescent current of peripheral JJs of SFQ cells when the adjacent connected gate is 46 varied, indicating that the effect of current redistribution is minimal. Note that for some cells, the deficit current may vary based on the state of the cell and may also depend on the input vector. However, not all cells have variation in delay based on the state, and/or input vector and the variation is very small compared to the actual delay of the cell. 3.3.2 LUTs for Setup and Hold times For an SFQ cell, setup and hold times depend on (i) the device parameters of the cell and (ii) the widths of input pulses. Since device parameters are fixed for a given cell, we did not consider timing characterization based on device parameters of cells. An SFQ cell receives two types of pulses from other cells: (1) input data pulse, and (2) clock pulse. This is because the larger the width of an SFQ pulse, the longer the time taken for it to supply the adequate current for the nearest JJ to switch. The width of these pulses depends on the cells that generate them (assuming no or minimal effect of current redistribution). An SFQ pulse, which has an irregular shape is generally represented in terms of its amplitude and pulse width. However, for our purposes, a simpler and efficient representation is measuring its half-width crossing time. For this, the time at which the pulse crosses 10% of its peak amplitude value is taken as the beginning of the pulse and the pulse duration is measured from then until the peak is reached. 10% of the peak amplitude value is chosen to effectively distinguish SFQ pulses from any other voltage fluctuations caused during operation. For example, 100V is taken as the beginning of the pulse for a pulse with a peak amplitude value of 1 mV . Another way to 47 Table 3.2: Setup and Hold Times of SFQ cells Gate Setup time (ps) Hold time (ps) DFF 0.71 0.03 AND -1.78 2.87 OR 3.13 -2.21 XOR 0.68 1.15 determine the width of the pulse is to build a criterion on the area under pulse which is equal to single flux quantum value ( 0 ) such as the time at which the pulses reach 50% of 0 value. From our simulations, we observed that the shape of an SFQ pulse depends mainly on c value (Stewart-McCumber parameter) and the deficit current value of the pulse generating JJ. In our simulations, all JJs have c = 2, and a comparable I c - I q value is adapted for output JJs of all cells. Hence, the pulse width does not alter much (stays around 3.5ps) and the nominal values of setup and hold times are presented in Table 3.2. However, for the simulation results of process corner cases, the output pulse width differed considerably. 3.3.3 Generation of LUTs To generate LUTs for clock-to-Q delay, the circuit in Fig. 3.2 is used. The critical current ofJ para is fixed at 150A as it is the approximate average value of input/output junctions in the simulated cells. InductanceL series is varied from 0 pH to 14 pH in step size of 1 pH (i.e., for 15 different values). For every inductance value,I b1 is varied in step size of 1A from 0 A to 150A. Next,I q (and thusI c -I q ) throughJ para as well as the 48 (a) (b) Figure 3.3: Formatting timing parameters in .lib-like format for (a) input pin (b) output pin clock-to-Q delay are measured. With these measurements, the LUT entries are generated with <L series , deficit current> as the keys and the clock-to-Q delay as the value. Note that the deficit current values are discretized to 8 different values ranging from 30A to 100A in step size of 10A. Thus, a 2D LUT of size 158 is generated for each cell in 49 the library. Appropriate and fixed values are chosen for circuit component values other than L series and J para during the LUT generation. They model the load and partially account for the current redistribution that happens when a gate is placed in the context of a circuit. For setup and hold times, 1D LUTs of size 10 are prepared by varying the width of the input data pulse. Again, the input pulse data width is discretized starting from 1 ps to 10 ps in step size of 1 ps. The values of setup and hold times are different for different gates and can be either positive or negative based on the circuit schematic and the component values. In Fig. 3.3, an example of formatting these LUTs in a .lib-like format is given for an input and an output pin is shown (as a scalar, 1D LUT, and 2D LUTs). Input pin has a 1D LUT with setup time values indexed with an incoming pulse width which is shown as index_1. Similarly, hold value is stored as a constant scalar value. Output pin has a 2D LUT with clock-to-Q delay values indexed withL series , deficit current values which are shown as index_1 and index_2 respectively for the output pin Q. Only a few values are shown for illustration purposes. 3.3.4 Delay Model for Non-clocked cells SFQ logic has several non-clocked cells such as a Josephson transmission line (JTL), splitter, merger, etc. These non-clocked cells are treated as combinational cells and 50 input-to-output delay will be calculated with the help of LUTs. For timing charac- terization, 2D LUTs with the same keys as clock-to-Q delay will be generated for all non-clocked cells. These LUTs will be used to calculate context-dependent input-to- output delays. For a combinational cell, the delay is also sensitive to the cell at its input side. For example, there will be a difference in the delay of a splitter based on the cell driving its input pin. LUTs based on input cell characteristics can be developed similar to the LUTs shown in this paper for the load-dependent delay. For simpler calculations, worst-case delay of a cell can be used resulting in conservative delay values. 3.3.5 Delay model for PTL connections In an SFQ circuit, the output of a cell is connected to the input of the next cell often through a JTL or a passive transmission line (PTL). A JTL cell can be treated as any other non-clocked cell. When PTLs are used for connection (Fig. 3.5(b)), which is highly preferred for long distances, a PTL transmitter (Tx) cell is attached to the output of the sending gate and a PTL receiver (Rx) cell is attached to the input of the receiving gate. This way, the PTL is connected between Tx and Rx cells and not between the actual gates. For convenience, Tx and Rx cells are usually integrated into standard cells. Design of Tx and Rx cells changes according to the characteristic impedance of the PTL. However, the PTL impedance does not change for an SFQ fabrication process [51]. For timing purposes, Tx and Rx cells can be considered as combinational cells. In the case of a PTL connection, calculation of delay values is as per following equations (refer to 51 Fig. 3.5(b)),T Clk1-to-X =C2Q Gate1 +T Tx , andT Clk1-to-Q =C2Q Gate1 +T Rx +T PTL +T Tx . T and C2Q represent the delay and clock-to-Q delay of their subscripts respectively. T PTL =l=speed PTL , wherel is the length of PTL used for interconnection andspeed PTL is the speed of signal transmission on PTL whose typical value is around 1:25 10 8 m=s [12]. Dispersion of SFQ pulse and PTL length limits [52] are not considered in these calculations. Since the design of Tx and Rx cells is fixed, except for the change in l, T Clk1-to-Y does not change for Gate 1 based on its context in the circuit. So, we will have a single look-up value each for clock-to-Q delay of Gate1 with Tx as the load and for the delay of Tx, T Tx . However, for the calculation of input-to-output delay of Rx, T Rx , an LUT is maintained as its value changes based on its load, Gate 2. To maintain the signal integrity in the case of long PTL connections, a recommended practice is to place buffers (a connection of Rx and Tx) at regular length intervals of PTL line. Delay of such buffered PTL lines can be found with similar equations as mentioned earlier with a single delay value for the buffer. 3.3.6 Impact of Bias Current It is well known that the bias current level of a cell affects the delay of the circuit by not only affecting the delay of the cell but also its neighboring cells by current dis- tribution effect. A bias current value larger/smaller than the designed value will lead to a smaller/larger delay. The discussions so far have been focused on calculating load- dependent delay and assumed that the bias current value stays at the level of designed 52 Figure 3.4: Clock-to-Q delay values for measured for AND-OR connection over differ- ent bias current levels). value for every cell. In section 3.3.1, it is mentioned that the change in quiescent cur- rent is less than 3% due to current redistribution and hence a single deficit current value is stored for every cell as characterized data. Since the bias current level also changes the level of quiescent current, a simulation experiment is run to measure the clock-to-Q delay values by changing the bias current level. Bias current levels of both Gate2 and Gate3 in Fig. 3.6 are changed and the clock-to-Q delay value of Gate3 is measured. The results presented in Fig. 3.4 belong to Gate2 and Gate3 as AND and OR gates re- spectively. It is observed that the change in delay is around 4% for 4% change in bias current indicating that the change in delay either due to current redistribution or a small bias current variation does not lead to larger variation in the calculation of final delay value. 53 (a) (b) Figure 3.5: Interconnection of two SFQ cells with (a)JTL (b)PTL 3.4 Simulation Results The test circuit in Fig. 3.6 is used to compare the results of clock-to-Q delay, setup time and hold time violations between our NLDM approach and JSIM simulations. The first gate is a DFF, which is used to generate a non-ideal SFQ pulse. Gate 2 and Gate 3 are chosen from three common SFQ gates, i.e., AND, OR, and XOR, resulting in a total of nine test cases. For each test case, the minimum time period (T min ) is measured as the time it takes from the arrival of Clk 2 until the output pulse of Gate 2 settles in Gate 3’s internal flop (interferometer). In other words, this delay is the minimum time period required for Gate 2 to complete its computation (i.e., C2Q delay of Gate 2) and prepare the result for Gate 3 to be used in the next clock cycle by Gate 3 (i.e., setup time of Gate 3). We have also verified that there are no setup time and hold time violations with the calculatedT min using NLDM approach. Comparison of results is presented in Table 3.3. The maximum percentage difference between JSIM simulations and our NDLM 54 approach for clock-to-Q delay is 2.1% among all test cases. Similarly, the maximum difference for setup and hold times is 3.2%. The differences between simulation and NLDM approach can be accounted for by the assumptions that we made in making compact LUTs. The delay usually depends on the first two inductances and two junctions. But, we made a reasonable assumption to just consider the first series inductance and the first parallel junction, and we introduced the new parameterI c -I q (deficit current) to obtain 2D LUTs instead of using multiple parameters resulting in higher dimensional LUTs. The accuracy varies from case to case because of the current redistribution effect, which was minimized by a robust design of the standard cell library and the circuit used for LUT generation. For every cell, the first inductance and the deficit current value of the first JJ at the input side are stored in the cell characterization data. Given a gate-level netlist, using this data of the following (load) cell, clock-to-Q delay or the input-to-output delay of a cell can be calculated. This characterized deficit current value of the first JJ at the input is the most reasonable value to get the result as accurate as possible. However, due to the current redistribution effect, this value changes slightly from case to case and is based on the gates that are connected adjacent to the gate in consideration. For CMOS design tools (such as PrimeTime), the delay of gates calculated with both CCS and NLDM delay models have a difference of 5% when compared against HSPICE simulations. Similarly, there is a 9% of variation for setup time values [53]. 55 Table 3.3: Comparison of JSIM simulations vs. our NLDM appraoch for the clock-to-Q delay under different test cases of Fig. 3.6 Gate2 Gate3 JSIM C2Q (ps) NLDM C2Q (ps) Difference (%) Tmin (ps) AND AND 6.88 6.90 -0.2 7.39 OR AND 4.78 4.80 -0.4 5.99 XOR AND 8.60 8.50 +1.1 9.69 AND OR 7.56 7.40 +2.1 10.30 OR OR 5.55 5.60 -0.9 8.52 XOR OR 9.91 9.84 +0.7 13.10 AND XOR 6.62 6.57 +0.8 9.31 OR XOR 4.48 4.47 +0.2 7.45 XOR XOR 8.00 7.94 +0.7 10.90 Figure 3.6: Test circuit for the proposed timing characterization approach. Results for the delay calculation of non-clocked cells (described in Section 3.3.4) are presented using a JTL interconnection. Comparison of delay values is given in Table 3.4 with a varying number of JTL stages as shown in Fig. 3.5(a) with n = {1,2,3,4,5}. Gate 1 and Gate 2 are fixed to AND and OR cells, respectively. Delay from point X to point Y in Fig. 3.5(a) is calculated with our NLDM approach and compared against JSIM simulations. Comparison of results shows a maximum difference of 3.6%. NLDM input-to-output delay of splitters with different load cells is also compared against JSIM simulations which resulted in a maximum difference of 4.0%. Similarly, comparison 56 Table 3.4: Comparison of Input-to-Output delay with JTL interconnection of n stages # of JTL Stages JSIM (ps) NLDM (ps) Difference (%) 1 2.29 2.27 +0.87 2 3.90 3.76 +3.58 3 5.35 5.25 +1.86 4 6.90 6.74 +2.31 5 8.45 8.23 +2.60 Table 3.5: Comparison of delay with PTL interconnection. (Ref. Fig. 3.5(b)) Gate1 Gate2 T PTL (ps) T Clk1-to-Q JSIM(ps) T Clk1-to-Q NLDM(ps) Difference (ps) AND OR 10 25.80 26.70 -0.90 AND OR 100 116.40 116.70 -0.30 AND OR 1000 1016.40 1016.70 -0.30 AND AND 100 116.20 116.27 -0.07 AND XOR 100 116.00 116.07 -0.07 of results for different combinations of PTL connection (varying length of PTL and the receiving gate) are presented in Table 3.5 (refer to Fig. 3.5(b)) with different PTL delay values and for different loads with AND gate as the (pulse) sending gate. The difference for shorter PTL connection is comparatively larger because of the effect of reflections is more pronounced at shorter lengths[54]. In general, for implementing a circuit in SFQ logic, the circuit has to be path- balanced (with DFFs) and splitter-inserted[55] (for fanouts greater than 1) before trying to get the minimum clock period and the latency of the circuit (gate-level pipelined). We have implemented a 4-bit ripple carry adder and 4-bit array multiplier and compared the 57 results of JSIM simulations with the NLDM approach. From the JSIM simulations, we obtained the minimum time periods to be 8 ps for the adder and 14.1 ps for the multi- plier, whereas our estimations using our NLDM approach are 8.15 ps (1.9% difference) for the adder and 15.65 ps (3.6% difference) for the multiplier. From Tables 3.4 and 3.5, it can be seen that the NLDM results are biased compared to JSIM. Since these values biased in one way (either positive or negative), these values can be calibrated to obtain delay values as accurate as possible during STA of a large scale circuit. 3.4.1 Process corners The results provided so far are for the nominal values of circuit components in our cell library. All the component values are subject to fabrication variations. To validate the characterization method presented in this paper, the results of NLDM approach are compared with JSIM simulation for process corners presented in Table 3.6 (which is used by [43] for their analysis on timing corners). Here,XI;XJ;XL, andXR refer to the scaling in bias currents, critical current of JJs, inductances and JJ shunt resistances, respectively. All the circuit components of a gate are changed according to a process corner definition and the corresponding delay LUTs are generated for each process cor- ner. For clock line, nominal values are used. Clock-to-Q delay values are calculated based on our NLDM approach using these LUTs and are compared against the JSIM simulations with different loads for the same process corners. Fig. 3.7 presents the delay comparison between JSIM simulation results and NLDM approach for gates at 58 Table 3.6: Process Corner Definition Corner XI XJ XL XR Nominal 1 1 1 1 Fast-Fast(FF) 1.1 0.9 0.95 1.05 Fast(F) 1 0.9 0.95 1.05 Slow(S) 1 1.1 1.05 0.95 Slow-Slow(SS) 0.9 1.1 1.05 0.95 Figure 3.7: Clock-to-Q delay comparison for process corners between JSIM simulations and NLDM approach for OR gate with different loads (AND, OR, XOR). process corners. Only an average difference of 1.6% and a maximum difference of 6% is observed among all process corners for the test cases. Though the delay variation does not cause the circuits to fail in F and FF process corners, S and SS corners are sus- ceptible to the late arrival of data pulses resulting in a setup violation causing the failure of the circuit. This failure can be avoided either by operating the circuit at a reduced frequency or by designing the clock-distribution line with the addition of the required delay to sustain variations. 59 3.5 Conclusions Due to the fundamental differences of SFQ technology with the CMOS technology, advanced CMOS CAD tools such as timing characterizers cannot be used directly for SFQ logic circuits. A context-dependent timing characterization method is developed in this chapter which generates LUTs for delay calculation looking at the load of the cell. This is the NLDM method for SFQ circuits and is a development towards building an STA tool with load-dependent clock-to-Q delay and pulse-width dependent setup and hold times similar to the advanced CMOS timing design tools. In order to minimize the dimensions of LUTs, we identified the critical parameters on which the delay of SFQ cells is dependent on. We have also presented the approach to generate the required LUTs through simulation. The NLDM approach can find the clock cycle of an SFQ circuit without the analog simulation of the whole circuit and find the setup and the hold time violations with high accuracy. Clock-to-Q delay comparison of our NLDM approach with JSIM simulations for the nominal case and for all process corners gives a maximum difference of 2.1% and 6%, respectively in our test cases of clocked cells. 60 Chapter 4 Simulation Analysis of ERSFQ Circuits Several energy-efficient superconducting logic circuits have been proposed and are being developed including SFQ [19, 56, 57, 58, 59] and adiabatic logic technologies [60]. One of the most promising and developed energy efficient SFQ technologies is ERSFQ [56]. In contrast to canonical Rapid SFQ (RSFQ) [6, 61], ERSFQ has no static power dissipation. This is achieved by replacing the resistor-based dc bias current distribution network with a superconducting network based on bias-limiting Josephson junctions (JJ) and inductors. As a result, the dominant static power dissipation asso- ciated with the resistor network is eliminated. In the following text, ERSFQ branch refers to the series connection of a bias inductance and a bias-limiting junction which is connected to the bias bus to draw bias current equal to the critical current (I c ) of the bias- limiting junction in normal operation. TheI c of a bias-limiting junction determines the maximum amount of bias current that can be fed to an ERSFQ branch in zero-voltage (superconducting) regime. In contrast to a current-bias source in RSFQ, a voltage bias source is used for biasing ERSFQ circuits. It is typically formed by a special Josephson transmission line (JTL) called feeding JTL (FJTL) connected via bias-limiting junctions and large inductances to 61 the bias bus for drawing the bias current and to maintain the average voltage of bias bus. Note that the FJTL is also biased in the same way as the logic circuit. The average bias bus voltage is set by the frequency of SFQ pulses (f FJTL ) that are being fed to the FJTL (called feeding clock) and is given byV b =f FJTL 0 . This average voltage is equalized by the sum of average voltages due to the switching of JJs in ERSFQ logic circuit (due to circuit activity) and its bias-limiting junctions. As a result, the dc bias current distributed to the ERSFQ gates is adaptively determined by balancing the average voltage across individual bias terminals and the voltage across FJTL and common bias bus. Typically, the logic circuit is operated at a clock frequency (f CKT ) lower than that off FJTL . In the following sections, the externally supplied bias current and its designed values for a circuit are represented byI SB andI SBD . Shawawreh et. al. [62] have studied the ERSFQ biasing and some effects opera- tional margins of circuits experimentally using a 64-bit shift register and 8-to-1 NDRO- based multiplexer representing synchronous and asynchronous circuits, respectively. In this paper, smaller versions of the same circuits are used to analyze and optimize the ERSFQ biasing network by simulation. Results of a detailed simulation-based analysis of the ERSFQ biasing network during circuit operation and the optimal values of various circuit components of the ERSFQ biasing network with consideration to the parameter margins of a circuit are reported. 62 (a) (b) Figure 4.1: (a) Example of the ERSFQ biasing scheme: with JTL as logic circuit. L Bi ;L BFi andBJ i ;BJ Fi represent bias inductors and bias (limiting) junctions for the logic circuit and the feeding JTL respectively. (b) High-level abstraction of ERSFQ biasing. 4.1 Simulation Setup 4.1.1 ERSFQ Biasing Scheme To follow-up the work in [62], we have used an 8-bit shift register (SR) represent- ing synchronous circuits and an 8-to-1 NDRO-based multiplexer (MUX) representing 63 asynchronous circuits as test benches for the simulation study. An example of ERSFQ biasing scheme is shown in Fig. 4.1 with a simple JTL as the logic circuit. It can be seen that both feeding JTL[56] and the logic circuit requires biasing circuitry which is a series combination of bias inductance (L Bi ;L BFi ) and bias-limiting junctions (BJ i ;BJ Fi ). Each circuit is simulated with several sizes of FJTL to analyze the effect of the size of FJTL on the operation of an ERSFQ circuit. Note that the increase in the FJTL size re- sults in increasing the externally supplied bias current to support the additional junctions of the FJTL. Below, we mention a few design issues for setting up our test benches. All simulations are done in WRSpice [40] circuit simulator. Bias-limiting junctions As per the ERSFQ biasing scheme, the bias-limiting junctions should have a critical current value equal to the nominal (or designed) bias current value for a specific branch and the sum of all these critical current values (I SBD ) is equal to the externally supplied bias current (I SB from a current source)[56]. If the critical current values are kept ex- actly at the designed values, these junctions can switch more often than required. It is because of a slightly higher value of current flowing through the bias JJs due to the dy- namic voltage changes on the bias bus. Even though this does not cause any problem to the bias current distribution, it can result in unnecessary dynamic power consumption. IV-characteristics of bias limiting JJs are in ERSFQ active current limiting range as il- lustrated in [62, Fig.6]. Hence, a critical current value slightly higher than the design 64 value is preferred to maintain the bias current in each branch close to the critical current value and not to exceed it. In all the simulations, bias-limiting junctions are overdamped with a c value (Stewart-McCumber parameter) of 0.5 and all other junctions in the logic circuit and FJTL have a c value of 1. Power-up procedure An important part of the ERSFQ biasing scheme is the process of initial bias current (from the current source) distribution along the bias bus to set the correct value of bias current for all the ERSFQ branches and this process is called the power-up procedure [19, 56]. A phase drop appears in the inductance of the superconducting bias bus during this procedure and compensation of this phase drop is achieved by the automatic switch- ing of bias-limiting junctions of all branches till approximately correct bias currents are set in all branches with a zero voltage across all bias junctions. To ensure the phase bal- ance in the bias bus, the current source should initially ramp-up to a larger current value than the designed total current value (I SBD ) for sufficient time (to allow bias-limiting junctions to complete bias current redistribution) and then ramp-down toI SBD . Simi- larly, adequate time should be given for settling down to achieve a steady-state average current value in all the branches before the operation of the circuit. In the case of phase imbalance or inadequate settling time, the biasing scheme may not work as expected. 65 Feeding JTL FJTL plays the main role in setting up the voltage across the bias bus and determin- ing the operating margins for the circuit. Proper design and connection among all JJs in FJTL are very important to ensure the benefit of a larger size FJTL [56]. In the case of the feeding clock not propagating through all the JJs in FJTL (due to issues such as missed connections, improper biasing or design, and fabrication issues), based on the number of missed JJs, the operating margins of a circuit can become smaller. For exam- ple, if there is a missed connection at the feeding clock source, the margins of the circuit will be worse than the circuit without any feeding JTL. 4.1.2 Tolerance Comparison It is generally accepted that the resistors in the bias network of an RSFQ circuit can be replaced by an ERSFQ biasing scheme without any changes to the rest of circuit com- ponents of a logic gate[56]. There has not been a published study yet on confirming this assumption by comparing the bias tolerances through either simulation or experimental setup, though its validity can be seen intuitively. As part of this study, we compared the margins and Monte Carlo yield of a shift register circuit to check the validity of this assumption. 66 Margins A 4-bit shift register is simulated with both ERSFQ and RSFQ biasing schemes. Margins of a D-flipflop cell are found in the context of a shift register instead of simulat- ing the cell in isolation. All the circuit components of the D-flipflop cell are individually varied from 50% to 150% of their nominal values and the point of failure is noted for every component. The result is presented in Fig. 4.2 with the range of each parameter for which the circuit functions when the other parameter values are kept constant at their individual designed values. RSFQ and ERSFQ cells have comparable margins and the components for which the margin value is dependent on the biasing scheme are shown in the figure. Yield A Monte Carlo simulation with 10,000 iterations is carried out to find the yield of a 4- bit shift register in both ERSFQ and RFSQ biasing. A sigma value of 3% is taken for all three circuit components (JJ area, inductance, and resistance). This simulation is done with 100% correlation among the same circuit components and no correlation among different circuit components. This resulted in RSFQ having a yield of 92.9% compared to ERSFQ’s 94.6%, indicating that the shift register circuit has a slight advantage with ERSFQ biasing. 67 (a) (b) Figure 4.2: (a) Schematic of D-Flipflop. (b) Margins of the circuit parameters in a D- Flipflop cell performing same functionality in the context of a shift register circuit with ERSFQ(E:) and RSFQ(R:) biasing schemes. Parameters having same margin values in both the schemes are not shown here. The designed values are shown alongside the parameters. 4.2 Simulation Analysis of ERSFQ Biasing In this work, analysis of bias inductance, bias junction critical current and feeding JTL size are independently to pass suggestions for each of them separately. 68 (a) (b) Figure 4.3: (a) Variation in the delivered bias current and (b) the count of bias junction flips with bias inductance variation for an FJTL branch in ERSFQ circuit test benches 4.2.1 Study on Bias Inductance The inductance of series-connected (bias) inductor (L B ) may not affect the average bias current value delivered to the circuit in a significant way. However, it acts as a filter to reduce the magnitude of deviation for delivering the designed current to the individual ERSFQ branches as each switching event changes the gate bias current by I = 0 =L B . Typically, a larger inductance value is preferred to decrease these bias current fluctuations and to decrease the bias current step increment after each bias junc- tion phase flip. For a regular ERSFQ branch, the flux stored in the bias inductances at normal operation isI cBJi L Bi . WhenI SB is reduced below theI SBD value, to have 69 the same fraction of bias current reduction in every ERSFQ branch, all bias inductances should have the the amount of flux stored in them, i.e. the same number of flux quanta (I cBJi L Bi = 0 ). However, in this study, we used a constant value of bias inductance for all the branches as it is more practical from the perspective of layout on a chip instead of separate inductances for each branch to store an equal value of flux. In this study, bias inductance is varied from 10 pH to 900 pH for both SR and MUX circuits to find the optimum value of bias inductance. Fig. 4.3(a) shows the profile of current variation (difference of maximum and minimum values of current delivered) to an FJTL branch as inductance increases from a smaller value to a larger value. Similarly, Fig. 4.3(b) shows the number of times the bias junction has switched (2 phase shift or a flip) over a fixed period of time for different bias inductance values. If the feeding clock frequency has a frequency greater than or equal to the maximum operating frequency of the circuit which is the case for the results shown in Fig. 4.3, the bias junction of FJTL is not supposed to switch. It can be observed that it is only true for an inductance value greater than 210 pH. The unnecessary bias junction switching adds to the consumption of dy- namic power along with causing a variation in voltage on the bias bus. The variation in current goes from approximately 220A to 3A as the bias inductance is varied from 10 pH to 900 pH for both the test bench circuits with a very small change in variation for values greater than 300 pH. Hence, it can be said that a larger inductance does not necessarily produce better results and the optimum value can be taken as 300 pH. At the 70 same time, it should be noted that the larger inductance leads to a larger settling time for setting up the bias currents in all the branches of an ERSFQ circuit. 4.2.2 Study on the Size of Feeding JTL It is understood that the feeding JTL helps in maintaining a constant voltage on the bias bus whose value is determined by the feeding clock frequency. Simulation of SR and MUX circuits is carried out with several sizes of feeding JTL. As the size of FJTL increases, the supplied current through the current source (I SB ) needs to be increased to support the additional JJs in feeding JTL. For example, the 8-bit SR circuit with FJTL having JJ count 0, 16 and 64 requires a bias current of 8960A, 11760A and 20,160 A, respectively. These circuits are expected to function properly when supplied with the designed current value. The characteristics of ERSFQ biasing is studied for SR and MUX circuits with different sizes of FJTL by varying the level ofI SB (45% to 125% of designed value). Fig. 4.4 shows the current profiles of delivered bias current to a ERSFQ FJTL branch (I FJTL ) and to a ERSFQ logic circuit branch (I CKT ) with different FJTL sizes when the externally supplied bias current is varied. For the results shown in Fig. 4.4, both the FJTL and circuit branches are designed to receive a bias current value of 350 A. It can be seen that the I CKT stays at the designed value for I SB values less than the designed value when FJTL is present. The larger the size of FJTL, the lower the level of I SB for which the I CKT stays put at the designed value. Simultaneously, I FJTL starts decreasing from the designed value once the I SB value is lower than the 71 Figure 4.4: Profiles of delivered bias current to both the feeding JTL and the logic circuit branches whenI SB is varied for an 8-bit SR circuit with FJTLs of different JJ count. designed value(I SBD ) and reaches a constant value with further reduction inI SB value. However, the rate of descent for I FJTL is smaller for the larger FJTL sizes. Fig. 4.5 shows the profiles of bias bus voltage (V bias ) when theI SB value is varied for different FJTL sizes. The feeding clock has a time period of 250 ps (frequency of 4 GHz) and the SR circuit is clocked at a frequency of 2 GHz. The larger the FJTL size, the lower the value of externally supplied bias current(I SB ) for which the bias bus voltage remains at the desired(expected) value. A second step can be observed at 4:14 10 6 V and it is due to the SR operating clock frequency. These observations are analyzed for different effects on ERSFQ dynamics in the following sections. 4.2.3 Operational Bias Margins The operational bias margins for both 8-bit SR and 8-bit MUX circuits are given in Fig. 4.6. It can be observed that the MUX circuit has higher margins when compared 72 Figure 4.5: Profiles of the bias bus voltage whenI SB is varied for an 8-bit SR circuit with FJTLs of different JJ count. with SR circuit forI SB values greater than the designed value. MUX circuits seem to be less sensitive to the presence of FJTL and to the size of FJTL in terms of change in margins. It can be noted that the lower (higher) the value of bias supply for which the circuit functions, better the bias margins of the circuit are on the lower (higher) side. Role of feeding JTL Since the recommended frequency of feeding clock (f FJTL ) is greater than (or equal to) the maximum frequency of operation of the logic circuit (f CKT ), the voltage at the bias terminals of the feeding JTL (and subsequently on the bias bus) will be larger than the voltage at any point in the logic circuit. Because of this voltage difference, the current will always be flowing towards the logic circuit blocks till the bias-limiting junctions in the branches switch due to a current flow greater than their critical current which leads to the balancing of voltage in the circuit. Because of this greater voltage 73 Figure 4.6: Operational bias margins for SR and MUX circuits for different FJTL sizes. The number in the label beside each bar represents the count of JJs in the FJTL. at feeding JTL, even when I SB is less than the designed current, as long as the FJTL functions, FJTL loses its bias current to provide the designed bias to the logic circuit. When I SB is low enough to stop the FJTL functioning, the bias current to the FJTL branch becomes a constant value and the bias current supplied to the logic circuit starts reducing. Hence, the larger the FJTL size, the better the bias margins on the lower side as it has more FJTL segments to sacrifice the current for the logic circuit. Failure of the circuit WhenI SB is further reduced after the current to FJTL branch becomes constant, the bias supplied to the logic circuit starts reducing and the circuit fails once this supplied current is lower than the margins of any individual logic gate. In the case of synchronous 74 circuits, the circuit may fail soon after the bias current becomes lower than the designed current since a lower bias leads to a larger delay and the circuits may fail with timing violations. For both the synchronous and the asynchronous circuits, only when the operation of the circuit does not produce the intended logical result according to the design, it is considered as the failure of the circuit without any regards to the delays of the output pulses at any stage. Synchronous vs. Asynchronous circuits In Fig. 4.6, MUX margins on the higher side of bias are considerably larger than SR circuit. Since it is an asynchronous circuit, the delays of the cells need not be syn- chronized with the clock signal and hence, despite the change of propagation delay of gates due to a higher bias, the circuit did not fail till 140% of the designedI SB . Whereas in the case of the SR circuit, synchronization of data and clock signals gets disturbed as the cell delays are affected by the change in bias. Synchronization will also get af- fected by the increased switching of bias junctions due to the increase in bias current as it interferes with the clocking of the circuit. Similarly, the lower side of bias margins of a MUX circuit is not as sensitive to the size of the FJTL as for an SR circuit for the same reasons as explained for the higher side of bias margins. The larger the size of the feeding JTL, the lower the value ofI SB for which logic circuit keeps receiving the designed value of bias current. 75 Figure 4.7: Minimum bias current at which SR and MUX circuits can still operate for different values of f FJTL with f CKT = 2 GHz. Best bias margins correspond to the minimum bias current supply. 4.2.4 Effect of Feeding Clock Frequency Feeding clock (FJTL) frequency (f FJTL ) is varied for both the MUX and the SR circuits to find the effect off FJTL on ERSFQ circuits by measuring the minimumI SB supply (as a percent of designed value) at which the circuit operates. For the SR circuit (synchronous), the clock operation frequency (f CKT ) is set at 2 GHz in simulation and the best bias margins are attained whenf FJTL f CKT and the bias margins worsen for f FJTL < f CKT . There is no advantage of increasingf FJTL abovef CKT since it does not improve the margins and leads to an additional dynamic power dissipation due to a higher number of bias junction flips in the logic circuit whenf FJTL >f CKT . A similar criterion can be said about asynchronous circuits but they do not have an operating clock frequency (f CKT ). Based on the circuit operation, the signals may arrive very close to each other at some section of the circuit. Hence, a f FJTL as large as possible would 76 Figure 4.8: Minimum bias current at which SR circuit can still operate for different values off CKT withf FJTL = 3.33 GHz. be better for asynchronous circuits as it can be seen that the bias margins improve as f FJTL increases. Minimum bias supply values(corresponds to maximum bias margins) for differentf FJTL values are presented in Fig. 4.7. 4.2.5 Effect of Circuit Operation Frequency The clock frequency of the SR circuit is varied to check if it affects the ERSFQ circuit. Fig. 4.8 shows the minimumI SB supply value at which the SR circuit operates for different f CKT values when f FJTL is fixed at 3.33 GHz. The circuit has the best bias margins (minimum bias supply) whenf FJTL is equal to the circuit clock frequency and they get slightly lower when f CKT decreases. When f CKT = f FJTL , there is no switching in the bias JJs of the logic circuit, and the circuit achieves best margins. When f CKT < f FJTL , there is switching in bias JJs and it interferes with circuit operation 77 and/or clock propagation which causes it to have slightly lower margins than the earlier case. It is already seen in section 4.2.4 that the margins worsen whenf CKT >f FJTL . 4.3 Conclusion Energy-efficient rapid single flux quantum (ERSFQ) circuits have become a viable alternative for the implementation of superconducting circuits due to a large amount of static power consumption in RSFQ circuits. ERSFQ circuits are built upon the popular RSFQ logic circuits by replacing the power-dissipating resistor bias network with a bias network consisting of active devices. A simulation study of ERSFQ biasing scheme is carried out by building simulation test benches for both synchronous(shift register) and asynchronous(multiplexer) circuits with different feeding JTL sizes. The dynamics of the biasing scheme are analyzed and worked out recommendations for the design of ERSFQ biasing in terms of biasing inductance, influence of the feeding Josephson trans- mission line (FJTL) and the effect of its size, the effect of the feeding clock frequency, and the effect of the circuit operating frequency. 78 Chapter 5 Reconfigurable SFQ Circuits: Superconducting Magnetic Field Programmable Gate Array One of the most successful circuits in the semiconductor industry is field- programmable gate arrays (FPGAs) [63]. They are pre-fabricated CMOS circuits which can be electrically programmed on the field to become any circuit or system, as per the requirement of the user. Typically, FPGA is a cheaper and faster solution when com- pared to application specific integrated circuits (ASIC), especially for the new circuit designs in the research and development phase [64]. Recently, a cryogenically-cooled CMOS FPGA was used to implement a classical controller for quantum computing pro- cessors [65, 66] despite the dissipation a significant amount of power. The circuit en- ergy efficiency is a priority for quantum computing applications requiring the cryogenic placement of FPGAs. Clearly, a superconducting energy-efficient FPGA would be an attractive option. 79 5.1 Previous Work The first superconducting FPGA based on RSFQ logic was proposed in 2007 [67]. It relied on the implementation of switches based on a derivative of a non-destructive readout (NDRO) circuit controlled by DC bias to program the routing and the lookup tables (LUT) used for a logic block in the FPGA fabric. As a result, the total area used by switches occupied 65% of the total chip area. It also proposed the use of transformer coupling to control switches, which at a large scale can potentially cause yield and crosstalk issues. Recently, another superconducting FPGA was proposed [68] based on reciprocal quantum logic (RQL) [59] and switchable phase shifters based on magnetic Josephson junctions embedded into DC SQUIDs. Although a complete operation or a detailed FPGA design was not elaborated, the use of SQUID-based switches and the combination of voltage-state (multi-SFQ) and SFQ signal regimes would make a future implementation of such FPGA challenging in achieving a high circuit density and energy efficiency. 5.1.1 Proposed work In this chapter, a new and complete SFQ FPGA design describing all the necessary circuit blocks is presented. It is based on energy-efficient ERSFQ logic [56] with pro- grammable DC biasing controlled by magnetic Josephson junctions (MJJs). This new approach allows us to avoid the use of SQUID- and NDRO-based switches and achieve a much higher area efficiency. In the MIT-LL process, the typical area of an NDRO gate 80 combined with a single JTL stage at input and output pins (I/O JTL) is 40X60m 2 . In contrast, the typical area of an MJJ is 2X2m 2 and combined with its associated bias lines, a total area of 3X3m 2 . Similarly, bias current required for the operation of an NDRO-based switch is not less than 1500A. In contrast, an MJJ-based switch can be implemented as part of an I/O JTL without any additional bias current. We propose two types of configurable logic blocks (CLBs) which work in the LUT based architecture and any special SFQ function based architecture. For demonstrating the advantages of implementing FPGA with MJJ-based switches over NDRO-based switches, our work on CLBs with NDRO-based switches is initially presented and later modified to CLBs with MJJ-based switches. We have also worked out SFQ FPGA designs which can operate both synchronously and asynchronously. 5.2 Basics of Field Programmable Gate Array There are several CMOS FPGA architectures commercially available in semicon- ductor industry from companies such as Xilinx[69] and Altera [70]. These companies have different FPGA architectures. However, all of these architectures contain (i) con- figurable logic blocks to implement desired logic functions; (ii) programmable routing structure that connects all the CLBs according to the functionality of the implemented circuit on the FPGA; and(iii) input/output (I/O) blocks to make off-chip connections to 81 the CLBs through the routing network. Based on the global arrangement of the rout- ing structure, FPGA architectures can be classified as either island-style or hierarchical [64]. Our SFQ FPGA fabric is based on the island-style FPGA architecture where CLBs appear as islands in a sea of interconnects. In this architecture, CLBs are arranged in a two-dimensional grid made by the routing network and it comprises of interconnects organized as horizontal and vertical routing channels (or tracks) with programmable switches to make connections among CLBs and from/to I/O blocks to/from CLBs. Note that both island-style and hierarchical routing architectures could have been explored for our proposed SFQ FPGA. However, for this work, the focus is only on developing all the FPGA sub-circuits and the fabric for the island-style architecture. We use the fol- lowing terminology for the three blocks that use programmable switches in the routing channels: 1. Switch box. 2. Horizontal connection block. 3. Vertical connection block. Adaptation of the island-style FPGA architecture proposed in this chapter can be seen in Fig. 5.1. 5.2.1 Overview of SFQ FPGA implementation SFQ FPGA cannot be directly derived or implemented based on its CMOS counter- part. None of the SFQ family technologies support the major benefits of the MOSFET switches and the bidirectional wires due to which the programmable routing becomes difficult and thus the implementation of SFQ FPGA also becomes difficult. SFQ con- nections are inherently unidirectional and a three-terminal switch like a MOSFET for an 82 Figure 5.1: Island-style architecture adaptation of SFQ FPGA with unidirectional and bidirectional data flow in horizontal and vertical directions respectively. A Configurable Logic Block (CLB) gets its inputs from the routing network through Vertical Connec- tion Block (VCB) and its outputs are carried to the routing network through Horizontal Connection Block (HCB). I/P: Input, O/P: Output, I/O: Input/Output. easy programming of routing channels is not yet available (in SFQ technology), though there is considerable work that is being done in that direction [71, 72]. Because of the unidirectional nature and the cost of routing network, (horizontal) data flow is only in one direction, from left to right in our implementation of SFQ FPGA. However, two separate lines are employed vertically, up (bottom to top) and down (top to bottom) for a bidirectional data flow. Due to the timing requirements of clocking in gate-level pipelin- ing, routing of signals with data flow in both directions for horizontal tracks can become very difficult and will be expensive in terms of area and delay. Hence, bidirectional 83 tracks are not implemented in the horizontal direction. Thus, the input ports are located on the left side of the FPGA block, the output ports are located on the right side of the block, and both input and output ports are on the top and the bottom sides of the block. Because of the reasons mentioned above, CMOS FPGA configurations of the switch box and the connection blocks cannot be directly used for implementing the pro- grammable routing in SFQ FPGA. We have modified the Wilton switch box topology [73] in a way that is SFQ specific and scalable for a larger number of routing channels. Our designs of horizontal and vertical connection blocks serve dedicated functions in terms of routing and interconnections. These programmable routing blocks contain MJJs that are used as bias limiting junctions in ERSFQ biasing to control the bias current delivered to the circuit components in the implementation of a programmable switch. This leads to a more compact design in contrast to the earlier implementations of a switch based on the use of NDRO cells, which consumes a larger area (for pro- grammable switches) compared to the other resources required for FPGA implementa- tion. In the rest of the chapter, unless it is mentioned otherwise, all the logical cells are to be assumed clocked cells and the operation of the circuit (or FPGA) is to be assumed synchronous operation. 84 (a) (b) (c) Figure 5.2: (a) Program and store (PS) block implementation with NDROs and DFFs. A PS unit is shown in a dashed red rectangle and a PS block is formed by serially connecting PS units. S2 represents 1-to-2 splitter. Functional waveforms in Verilog HDL simulation: (b) Signals during programming mode: Writing 0 1 0 1 (for PS units at positions 0 1 2 3). (c) Signals during reading mode. PS units at positions 0 and 2 do not produce an output pulse for the respective Read input. 85 5.3 Design and Details of SFQ FPGA Fabric 5.3.1 NDRO-based Configurable Logic Block (CLB) Program and Store Block Many commercially available CMOS FPGAs use Static memory (SRAM) cells for programming and storing the lookup tables (LUTs) of desired gates in CLBs of FPGA fabric. Program and Store block is one of the building blocks in our NDRO-based CLB implementation with the capability of programming and storing the data to configure a CLB into the desired gate, and its usage is explained in the following subsections. For SFQ technologies, SRAMs can be replaced by NDRO cells, though we cannot program and use these cells in the same way as SRAM cells. We propose a scan chain structure for NDROs as illustrated in Fig. 5.2(a) to program them serially. The scan chain struc- ture is used because of its built-in support for bit-serial programming. Parallel loading of the data-to-be-programmed into the storage elements (NDROs) is not possible due to the limitation of I/O pins count. Hence, a scan chain structure is used to load data seri- ally into all the NDROs of the circuit. Scan chain mechanism is popular in the testing of CMOS circuits. A scan chain is formed by serially connecting multiple Program and Store (PS) units. A single unit is shown with a dashed rectangle in Fig. 5.2(a). Data is serially given at the input Data in, one input per a programming clock pulse, Prg clk. Each PS unit has a Prg clk input and the programming clock pulse can either reach all units (in a 86 block) at the same time or consecutively beginning from the first cell to which the data is serially given at input, Data in to be stored in the block. In the case of programming clock pulse not arriving at all PS units at the same time, the time difference between its arrival at two consecutive cells cannot be more than the time period with which data is given serially as input at Data in. With the input data arrival, input (either 1 or 0) is stored in a PS unit’s DFF. Each (input) pulse at Prg clk first resets the respective PS unit’s NDRO and then clocks the DFF to release the data value stored in it. Then, it gets subsequently stored in the NDRO along with passing the same data to the next PS unit to receive along with the next Prg clk input (if present). So, the serial input data given at Data in keeps moving down by one PS unit with every programming clock pulse. The scan chain as shown in Fig. 5.2(a) can be programmed with 4 Prg clk pulses with the bottom-most PS unit’s input going as the first input to Data in and the top most PS unit’s input going at the end. The stored values can be read with the arrival of respective PS unit’s Read input. Hence, this PS block has two modes: programming mode and reading mode. Data out pin of a PS block will be connected to Data in pin of the next PS block in the FPGA fabric implying that all PS blocks in the fabric are serially connected (making a large PS block). Hence, all PS units in the fabric can be programmed by presenting the data-to-be-programmed at the first PS unit’s Data in pin in a serial bit stream. Programming clock pulses need to be given to all PS units along with the input bit stream whose number should be equal to the number of PS units in the fabric with the same frequency of input bit stream. Programming new data into a 87 Figure 5.3: Implementation of the LUT-based CLB for a 2-input gate using a decoder with DFFCs, PS block with NDRO-based switches and a 4-to-1 merger. DFFC: D- Flipflop with complementary outputs. PS block will automatically erase the old data stored in it. Using this PS block, we have designed two CLBs that are presented in the following subsections. These two types of CLBs will be modified by replacing NDRO-based PS block with magnetic switches and will be presented in a later section. LUT-Based CLB Fig. 5.3 shows our implementation of the lookup table based CLB unit for a 2-input gate. A four-PS unit block as shown in Fig. 5.2(a) is used to store four different output values for all four combinations of the two inputs of any 2-input gate. Initially, PS block in a CLB has to be programmed with the truth table of the gate to be implemented, in the programming mode. The truth table stored in a PS block will be held in it until it is in programming mode again, i.e. the arrival of the next programming clock pulse for a PS block. Once the programming of PS block is finished, it will be operated in the reading 88 mode. The circuitry to the left of PS block is the implementation of an SFQ decoder which gives out only one of the Read signals (of a total of four) based on the inputs to the CLB. The Read signal then reads the proper value stored in the PS block to give out the output of CLB. A 4-to-1 merger is used at the end to merge all four outputs of the PS block to collect the output signal at one node. Merger produces an output pulse for an incoming pulse at any of its inputs. Since only one output can come out of PS block per clock cycle, no two signals would ever be merged, but only one of the four Out signals will be presented at the CLB output. Function Selection (FS) Based CLB FS-based CLB consists of an actual implementation of logic gates instead of LUTs. In the case of CMOS, this kind of CLB implementation is undesirable. However, the comparable cost of implementation and the relatively small size of an SFQ cell library makes this implementation equally desirable for SFQ. Fig. 5.4 shows a (single-PS block) CLB implemented with four 2-input gates. One or more of them can be 1-input gates (e.g., inverter or D-flipflop). Each NDRO output of the PS block clocks one of the 4 gates in the CLB and it is programmed such that the only gate in the CLB that is to be implemented will have the respective NDRO set. Inputs A and B reach all four gates, but only the gate being implemented will be clocked, and hence only one of the gate’s result will be received at the CLB output. Since the inputs, A and B reach all four gates in the CLB but only the implemented gate is clocked, the other three gates are not reset, 89 Figure 5.4: Implementation of the FS-based CLB for four 2-input SFQ gates using a PS block with NDRO-based switches, an actual implementation of gates and a 4-to-1 merger. implying that these gates must be reset if these are to be used later. To reset the CLB, all NDROs in the PS block are to be set, and consequently, CLB is to be clocked once. To avoid this resetting before reprogramming, a triple-PS block CLB can be used with two additional PS blocks that select the gate toward which input A and B should be delivered. However, this will increase the cost of implementation by two-fold. Note that the CLB needs to be reset only when the whole FPGA fabric is being reprogrammed for the implementation of a different circuit. 5.3.2 Programmable Routing Programmable Switch Implementation Our approach is based on the ability to program the value of critical current (I c ) of an MJJ by manipulating the magnetization of its ferromagnetic layers using a magnetic field or eventual spin-torque transfer. The MJJ is used in place of a DC bias limiting 90 (a) (b) Figure 5.5: Switch implementation with MJJ as limiting junction in ERSFQ bias- ing. (a) Circuit schematic and representational symbol for MJJ-based switch. I c0 = 100A;I c1 =I c2 =I c3 = 200A;L 1 =L 2 =L 3 = 4pH. (b) Circuit simulation: result of switch output Q whenI c ofMJ 0 is 150A and 250A showing the blocking and the passage of input pulse, respectively. 91 (a) (b) (c) (d) (e) (f) Figure 5.6: (a) Switch box implementation. Inputs and outputs are represented by red and green color labels, respectively. Dashed connection lines represent the programming of MJJ switches to let the pulse pass through them. (b-e) Representational figures: (b) 3 signal merger. (c) 2 signal merger. (d) 3-way splitter (S3) with attached switches at outputs. (e) 2-way splitter (S2) with switches. (f) Functional waveforms of Verilog HDL simulation of switch box for the programmed switches shown in (a) with dashed connection lines. 92 Figure 5.7: The circuit implementation of a 2-way splitter with MJJ-based switches (Fig. 5.6(e)) used in FPGA subcircuits. BJ refers to a regular JJ that is used as bias limiting junction in ERSFQ biasing that does not require programming. MJ refers to a magnetic JJ that will be used in switch implementation with programmableI c . junction in ERSFQ biasing. This allows the use of a single MJJ instead of bulky SQUID and SFQ gates (e.g., NDRO) to perform FPGA programming. Please note that the typical size of the MJJ is comparatively much smaller than the size of a typical SQUID or an SFQ gate. In principle, any type of MJJ exhibiting modulation of critical current [74, 75, 76, 77, 78, 79, 80] can be used for the programmable bias current limiting junction. However, we consider a SIsFS-type MJJ [77, 78, 79] as preferable for several reasons: (i) simpler and higher yield fabrication due to a simpler structure with a single ferromagnetic layer and somewhat larger dimensions (2m x 2m); (ii) an acceptable bias current flowing through the MJJ providing the necessary reference self-field; and (iii) higherI c R n compatible to that of regular JJs used in SFQ circuits. The SFT-based 93 MJJ [80] due to its high I c R n would also work as a programmable current limiting junction in ERSFQ biasing for implementing switches. Fig. 5.5(a) shows the implementation of a programmable switch with an MJJ used in ERSFQ biasing. Simulations (Fig. 5.5(b)) show that the incoming SFQ pulse would pass from input to Q only when the I c of MJJ bias junction (MJ 0 ) is 250 A (high). When theI c is 150A (low), the pulse would not pass because of the insufficient bias current delivered to make J 2 switch (undergo a 2 phase slip) upon the arrival of an incoming pulse. In this case, J 0 switches. As one can see, the programmable switch is implemented using a very simple, robust and compact circuit, which is essentially a variant of a Josephson transmission line (JTL) stage. Switch Box In a general CMOS FPGA, a fixed and same number of metal tracks run horizon- tally and vertically, organized in channels. A programmable switch box is placed at each intersection of horizontal and vertical routing channels. In our FPGA fabric implemen- tation, because of the proposed unidirectional data flow in the horizontal direction, we use two (can be more) horizontal tracks going from left to right and four vertical tracks: two each running in up and down directions. We have modified the Wilton switch box topology for our switch box implementation to fit the unidirectional data flow in the horizontal direction and due to the relative difference in the number of tracks between 94 (a) (b) Figure 5.8: Connection blocks (CB). (a) Vertical CB. (b) Horizontal CB. horizontal and vertical channels. It is presented in Fig. 5.6(a) and it comprises of split- ters combined with aforementioned programmable switch implementation and mergers. 1-to-3 splitter is used for a signal coming from the left in order to transfer the signal from the left to either the top, right or bottom. MJJ-based switches attached to the split- ter outputs control the direction in which the signal is being transferred. Similarly, 1-to-2 splitter with switches is used for a signal coming from either top or bottom. Bias MJJs of switches attached to these splitters will be programmed in such a way that the signals are routed according to the circuit being implemented on the FPGA. Fig. 5.7 shows the schematic of Fig. 5.6(e), which is represented as a dotted rectangle in switch box archi- tecture of Fig. 5.6(a). 3-to-1 merger (2-to-1 mergers) is used to merge signals coming from the rest of the three (two) directions on the right side (top and bottom). Note that the programming of MJJ-based switches which is based on the routing of signals will ensure that no more than one input signal will be active for any merger. 95 Connection Blocks In our SFQ FPGA implementation, the horizontal connection block (HCB) and the vertical connection block (VCB) connect the CLBs with the routing channels and are part of programmable routing. We have separate and dedicated functions for HCB and VCB. Inputs are taken from the routing network to the CLBs through vertical CBs and the output of CLBs is taken to the routing network through horizontal CBs. Their imple- mentation can be seen in Fig. 5.8. In VCB, a signal from each vertical channel is split (with a switch at the output to control its destination) and one split output from each vertical channel is merged to be given as input to one of the CLB inputs. Similarly, an output from CLBs is split (with switches to control their destination) and then merged into each of the horizontal channels. 5.3.3 Magnetic CLB In section 5.3.1, two kinds of CLBs are explained with details. However, the imple- mentation of CLBs is done through the use of NDROs which consume a significantly large area and require extra steps for programming. We have presented these NDRO based CLBs earlier in order to explain our prior work and also to illustrate the advan- tages and savings that come with the usage of MJJ-based switches. For an LUT-based CLB with magnetic switches (MJJ-based switches), the PS block in the CLB (Fig. 5.3) can be replaced with four instances of MJJ switch (shown in Fig. 5.5(a)), each of which either transfers or blocks the signal from each of the four Read 96 (a) (b) (c) Figure 5.9: MJJ-based magnetic CLBs: (a) LUT-based (b) FS-based (triple-switch) (c) S4sw block: representation of 4-way splitter with switches. locations to respective Out locations (Fig. 5.9(a)). These four MJJs will be programmed to have critical currents in a way to reflect the truth table of the gate to be implemented. For example, in the case of AND gate implementation, MJJs in top three switches will be programmed to have a low critical current (150A) and the MJJ of the last switch will be programmed to have a high critical current (250A). Because of this programming, only in the case of arrival of both of the inputs, the decoded signal will pass through the switch producing a pulse at the CLB output. For an FS-based CLB with magnetic switches(MJJ-based switches), the PS block in Fig. 5.4 will be replaced by a 1-to-4 splitter with switches attached to the splitter 97 Table 5.1: Comparison of JJ count for CLBs CLB type Switch type JJ count MJJ count Logic Bias LUT NDRO 137 33 0 LUT MJJ 64 14 4 FS, single-PS NDRO 156 38 0 FS, single-switch MJJ 86 17 4 FS, triple-PS NDRO 316 78 0 FS, triple-switch MJJ 106 17 12 outputs similar to the ones shown in Fig. 5.6(d) and Fig. 5.6(e). Only one out of four MJJs belonging to four splitter outputs (S4sw block shown in Fig. 5.9(c)) will be programmed to have a high critical current and this splitter output will be clocking the gate-to-be-used out of the four gates in the FS-based CLB. Due to this programming of MJJs, though the input reaches all the gates, only one of the gates will be clocked, subsequently producing the output (depending on the internal state of that particular gate based on the inputs). After the replacement of NDRO-based switches with MJJ- based switches, we will call triple-PS block and single-PS block CLBs as triple-switch block and single-switch block CLBs, respectively. Comparison of the JJ count between NDRO-based CLBs and MJJ-based CLBs is shown in Table 5.1. Note that the bias JJs refer to the regular JJs that are used in the ERSFQ biasing scheme (e.g., BJ 1 in Fig. 5.7) and MJJs replace these regular biasing JJs whenever programming is required (e.g., MJ 1 in Fig. 5.7). 98 5.3.4 Switch Programming Fig. 5.10 describes our approach to implement the FPGA programming by setting MJJ-based ERSFQ switch biasing into high or lowI c values. The MJJ limits DC bias current delivered to the corresponding switch from a common power plane depending on the value of its I c . The I c can be programmed by applying currents via vertical (V AL) and horizontal (HAL) access lines that are magnetically coupled to each MJJ at their intersection in the crossbar structure [81] made by access lines (Fig. 5.10(a)). According to our estimate, each FPGA mosaic unit may require a maximum of 42 MJJs for a 2-input CLB (maximum MJJs are required for a mosaic with two-input triple- switch FS-based CLBs). One can arrange the programming FPGA layer as a matrix of blocks with 7 x 7 access lines shown in Fig. 5.10(b). Programming decoders can set the programming currents for each MJJ as shown in Fig. 5.10(c). These decoders can be SFQ-based (e.g. [82, 83]) and located on the periphery of the FPGA fabric. HAL and V AL are connected to program decoders through output current drivers. From a room temperature (RT) controller, one can send the MJJ address and the signal (1/0) for programming (N address bits + a programming bit to set the MJJ to either high or lowI c value). These bits can be sent in parallel throughN + 1 lines or in series via a single line to the on-chip serial to parallel converter. The serial operation would take longer but requires the minimum number of lines. In general, the programming speed is not a priority. This approach is also readily scalable, as the on-chip programming is done by the minimalistic MJJ crossbar wiring and the RT connection is minimized by on-chip 99 Table 5.2: NDRO-based switches Vs. MJJ-based switches Switch type NDRO-based MJJ-based Active devices Regular JJs Regular and magnetic JJs Implementation Bulky SFQ cells (e.g. NDRO) Part of biasing and I/O JTL, no additional cells Area comparison Larger Smaller Delay comparison Larger Smaller Power comparison Larger Smaller Programming method Serial programming of NDROs in a scan chain structure Magnetic coupling of MJJs with current lines in a crossbar structure Additional circuitry for programming Consumes a larger regular JJ chip space Most of it is implemented on a separate layer Fabrication process Single-layer SIS JJ process Both SIS and MJJ processes, preferably a double JJ-layer process periphery decoders and serial to parallel converters. Typical programming time for the MJJ is from 100 ps to 1 ns and it depends on the programming current value (currents through V AL and HAL). Since MJJs are typically fabricated using separate process steps compared to conventional SFQ JJs, the whole FPGA programming layer including the power plane, programmable MJJs, and access lines can be implemented separately from the FPGA logic and later be connected with the rest of the SFQ circuit implementation. As a result of this vertical integration, the area overhead of the programming layer will be minimized. A brief summary of the comparison between NDRO-based switches and MJJ-based switches is presented in Table 5.2. 100 (a) (b) (c) Figure 5.10: Programming layer for MJJs on chip with current lines (access lines (AL)). (a) Programming unit of MJJ. HAL: Horizontal AL; V AL: vertical AL. (b) MJJs are lo- cated near the intersections of crossbar made by HALs and V ALs used for programming MJJs. (c) Using external decoders to access specific MJJs out of all MJJs belonging to the FPGA fabric. 101 Figure 5.11: Clock pulse distribution to synchronous CLBs in SFQ FPGA. 5.3.5 SFQ FPGA Operation SFQ circuits (especially, RSFQ which is widely implemented) are operated in two well-known ways: Synchronous and asynchronous wave-pipelining. Synchronous oper- ation: each logic cell in the circuit requires a clock pulse for the operation and there is a minimum clock period determined by the implemented circuit for the proper operation of the circuit. Several ways of distributing the clock pulse to every cell in a circuit are described in [11]. An SFQ FPGA fabric containing either LUT-based or FS-based CLBs support the synchronous operation of FPGA. After the programming of all switches in an FPGA fabric, a CLB will be representing a specific gate in the implemented circuit and only a single clock is required per operation of that gate. A straightforward way of clock distribution to CLBs for synchronous operation is to use splitters and JTLs to form an H-tree, resulting in the zero-skew clocking scheme. Here, we present another way of clock distribution to the CLBs which is a variant of the clock-follow-data [11] clocking scheme and is shown in Fig. 5.11. A self-clocked DFF cell is made by feeding its data-input to its clock-input through a delay. The output 102 of this self-clocked DFF cell is fed to clock inputs of all CLBs in a column. The clock- input of the last CLB in the column is fed to a self-clocked DFF cell, which will again distribute the clock to CLBs in the next column. Multiple self-clocked DFF cells can be used to distribute the clock to CLBs of separate sections of a column, based on the total number of CLBs in a column. the delay element used in the self-clocked DFF cells can be engineered according to the actual implementation of FPGA fabric so that the circuit operation matches the delays of routed signals between CLBs. The clock-follow-data scheme requires all cells of level i to be clocked and the input data to be prepared for the next level before clocking any cell of level i+1 [10]. To implement this scheme, CLB columns are to be partitioned into groups designated for cells belonging to a specific level. For example, Column 1 belongs to level 1 cells and column n belongs to level n cells. However, the number of cells belonging to a level of a circuit can be larger than the number of CLBs in a column of FPGA fabric. In such a case, a minimum consecutive group of columns which are enough to implement the number of cells of a level will be assigned to that level. Hence, consecutive groups of columns from left to right will represent consecutive levels in a circuit beginning from level 1 to the maximum level of that circuit. In the case of cells belonging to a level taking up more than a column of CLBs, clock distribution between those columns need not be done through the self-clocked DFF but will be bypassed with a connection between them using an MJJ-based switch. 103 5.4 SFQ FPGA fabric extensions Two possible extensions of the above presented SFQ FPGA are to utilize the fabric for asynchronous wave-pipelining (AWP) and to modify the fabric for gates with more than two inputs (multiple-input) or for more than four gates. 5.4.1 SFQ FPGA for Asynchronous Wave-Pipelining In AWP, some of the logical cells in the circuit do not require a clock signal to op- erate and signals travel through the circuit asynchronously [84] with additional timing requirements. However, a ready pulse that follows the data is used to reset/clock some of the cells after a small period of time to make them ready for the next set of input sig- nals/to evaluate the current state of the cell. Since some gates produce the output without the requirement of clock signal and just with the arrival of input signals, only FS-based CLBs implemented with the desired combination of asynchronous and clocked cells can be used for the AWP operation of FPGA. A comparison of FS-based and LUT-based CLBs is provided in Table 5.3. FS-based CLB (for asynchronous operation of FPGA) is shown in Fig. 5.9(b). In this case, splitters distributing inputs to the gates and the split- ter distributing clock to the gates will have switches at their outputs (triple-switch block CLB) and they will be programmed accordingly. Note that all inputs including clock are directed towards the gate that is to be implemented in the CLB by programming the MJJ-based switches in S4sw block. A reset/clock signal as per the requirement of a cell in the implemented circuit can be distributed with the same mechanism as described in 104 Table 5.3: FS-based CLBs Vs. LUT-based CLBs CLB FS-based LUT-based Can implement clocked gates Yes Yes Can implement non-clocked gates Yes No Synchronous operation of FPGA Yes Yes Asynchronous Wave-Pipelining Yes No Any SFQ gate can be implemented Yes No Smaller area (JJ count) No Yes 5.3.5 for the AWP operation. Zero-skew clocking with H-tree implementation cannot be used for an AWP operation. 5.4.2 SFQ FPGA with Multiple-Input Gates SFQ fabric presented in the sections above has CLBs implementing 2-input gates and a routing network that can route signals only for a circuit implemented with 2-input gates. This fabric can be extended for multiple-input gates by modifying the CLBs to handle gates with more than 2 inputs and by increasing the number of routing tracks accordingly. A LUT-based CLB can be modified as follows: (i) Implement a decoder that can decode the maximum number of inputs that a gate can have in the desired CLB implementation. (ii) Attach an MJJ-based switch at every decoder output. (iii) Build a merge-block that can merge all of these switch outputs to give the CLB output. An FS- based CLB can be modified as follows: (i) Implement the desired gates for the CLB and implement splitters (with switches) for carrying the inputs (and clock) to all the eligible gates. (ii) Implement a merger circuit to merge outputs of all the gates in the CLB. 105 The Routing Network also must be modified according to the number of inputs. The number of horizontal tracks and the number of vertical tracks both in up and down direc- tions should at least be increased to the maximum number of inputs that a gate can have in the desired CLB implementation. Consequently, switch box and connection blocks should be upgraded to handle an increased number of tracks and the inputs to the CLB. An estimation of JJ count for the larger size CLBs (for synchronous operation) is given in table 5.4. JJ count estimation is based on the following observations: LUT-based CLB with n inputs should implement LUT with 2 n entries (thus, an n-to-2 n decoder with 2 n MJJ switches) and use a merger of size 2 n to 1. FS-based CLB with n gates should implement gates with log 2 n inputs, log 2 n number of 1-to-n splitters with one splitter having MJJ switches at the output, and a merger of sizen-to-1. For FS-based CLBs, JJ count can be smaller than the number given in the table, considering the fact that not all gates will havelog 2 n inputs out of totaln gates. 106 (a) (b) Figure 5.12: FPGA implementation example: (a) A circuit block of 8-bit ALU that contains all building blocks and the signal path from the inputs to the output of an asynchronous wave-pipelined ALU in [84]. (b) Synthesized (with all clocked cells), placed and routed ALU block on our proposed SFQ FPGA. FPGA fabric grid is shown with dotted lines. Routing of top- to-bottom and bottom-to-top vertical tracks and of horizontal tracks are shown in blue and gold, and black colored lines, respectively. 107 Table 5.4: JJ count estimation for LUT-based CLBs with multiple inputs and single- switch FS-based CLBs with larger number of gates CLB type JJ count MJJ count Logic Bias LUT based with 2-inputs 64 14 4 LUT based with 3-inputs 152 35 8 LUT based with 4-inputs 322 76 16 FS based with 4 gates 86 17 4 FS based with 8 gates 190 35 8 FS based with 16 gates 422 72 16 5.5 Results All the proposed circuit elements are designed and simulated in WRSpice circuit simulator with ERSFQ biasing. All circuit JJs have a c value of 1. For the sake of simulations, the typical high and lowI c values of MJJs are chosen based on the switch circuit implementation. They are changed manually to have either low (150A) or high value (250 A) in the circuit simulator due to the lack of simulation models. Verilog models have also been developed for all the FPGA subcircuits such as CLB, PS block, switch Box, HCB, and VCB for simulating the complete FPGA circuit. Circuit blocks related to the fabric extensions presented in section 5.4 are also modeled in Verilog. All simulations have given us the expected results and verified the operation of FPGA. 5.5.1 Implementation Estimations Table 5.5 shows the number of JJs required for each sub circuit in SFQ FPGA and for an FPGA mosaic consisting of a CLB, a switch box, an HCB and a VCB. An FPGA 108 Table 5.5: JJ count and Area estimation of FPGA subcircuits FPGA subcir- cuit JJ count MJJ count Area estimation (m 2 ) Logic Bias HCB 28 8 4 14400 VCB 70 22 12 33600 Switch Box 82 26 14 48400 CLB 106 17 12 56200 Total mosaic 286 73 42 152600 fabric will be made of several copies of this mosaic arranged symmetrically in an array. A few JTLs might be needed for interconnection that are not accounted for in the junc- tion count. However, the area estimations given in the table account for any extra JTLs required to layout the circuit of mosaic properly. For the implementation of a four-row and four-column FPGA fabric with FS-based CLBs, we have an estimated maximum operating frequency of 15GHz for synchronous operation. This frequency is calculated based on the time period required for a CLB to output its result on a horizontal routing channel, transfer through the switch box, routing channels, and then through VCB to go as an input to a CLB in the next column. 5.5.2 Circuit implementation example on FPGA fabric An 8-bit asynchronous wave-pipelined ALU is demonstrated in [84]. We have syn- thesized the building blocks of this ALU with all clocked cells so that it can be imple- mented on the designed FPGA fabric with the synchronous operation. To assess the efficiency of our FPGA approach, we implemented a circuit containing all the building 109 blocks of the ALU as shown in Fig. 5.12(a). In Fig. 5.12, we have shown the implemen- tation (synthesis, placement and routing on FPGA fabric) of a part of the ALU circuit containing all building blocks and the data path representing signal flow from the inputs to the output (refer to Figs. 1 and 2 in [84]). Logic synthesis of the circuit, placement on FPGA fabric and routing through the routing network is done manually. Fig. 5.12(b) shows the implementation of the ALU block with a clock-follow-data clocking scheme (presented in sec. 5.3.5) without the buffer DFFs for the signal paths that travel to any higher level other than the next level [10]. This implementation without buffer DFFs might require FPGA to be operated at a lower frequency so that the timing violations would not occur. It can be implemented on a 4 x 9 CLB array of SFQ FPGA fabric with synchronous FS-based CLBs containing these four gates: D-flipflop with complementary outputs, AND gate, OR gate and XOR gate. Only 11 out of 36 CLBs are not used, resulting in a utilization of 69.5% of total CLBs. For the maximum frequency of operation (or for clock distribution using H-tree), buffer DFFs must be inserted for signal paths with signals traveling more than one level. For this implementation, an FPGA fabric of 5 x 9 CLB array is required and it will have a utilization of 71% of total CLBs. Note that the implementation of a complete ALU circuit can result in a lower utilization of CLBs since there will be more signals to route across different ALU blocks similar to the block shown in Fig. 5.12(a). 110 5.5.3 Discussion Some discussion points to consider are as follows. (i) We do not expect to use any passive transmission lines in the implementation of SFQ FPGA fabric with our layout estimations showing all subcircuits can be laid out side by side and can be connected to each other with JTLs (if needed). No use of PTL helps in decreasing the delay. (ii) Similar to the vertical routing channels, two horizontal routing channels can also be run in both directions, left to right and right to left. The trade-off between implementation cost and routing advantage of bidirectional horizontal tracks guided us towards unidi- rectional horizontal tracks. However, in implementing circuits such as a complete 8-bit ALU with a few strategically placed bidirectional horizontal tracks can help in increas- ing the utilization percentage of CLBs. (iii) CAD tools and the algorithms for logic synthesis of a circuit for CLB specific SFQ FPGA fabrics, placement of synthesized gates on the fabric, and routing among CLBs are considered for future work. This work mainly focuses on the fabric design. (iv) New timing techniques (for clocking the CLBs) along with changes in routing channel structure can result in variations of the fabric for increasing the utilization percentage of the CLBs and/or frequency of operation. For example, (a) having two more vertical routing channels will help in routing differentP andG signals (e.g.,P 1 i ;G 1 i ;P 2 i ;G 2 i in Fig. 5.12(b)) across several ALU blocks for the implementation of whole ALU. Otherwise, unavailability of vertical channels for rout- ing across blocks due to the interconnections within a block results in under-utilization of CLBs; (b) some circuits (e.g. tree-based adders) have signals flowing among identical 111 blocks in an organized manner. Phase-wise clocking of different blocks according to the signal flow can help in the reduction of buffer DFFs and/or in the overall latency of the implemented circuit. 5.5.4 Implementation Considerations Implementation of the proposed magnetic SFQ FPGA would require co-fabrication of conventional superconductor-insulator-superconductor (SIS) junctions used in SFQ circuits and MJJs. Such fabrication process has recently been demonstrated in which both types of junction are fabricated within a 4-layer process [85]. A greater advantage will be achieved with MJJs and SIS JJs being located on the different vertically inte- grated layers similar to the double SIS JJ layer process recently developed in Japan [86]. Alternatively, one can use a multi-chip module (MCM) integration with the logic layer and programming layer implemented on different chips. However, this would require a large number of fully-superconducting bump bonds. Currently, such MCM technology with superconducting bonds is demonstrated only for < 4K operation [87]. Overall, the MCM integration approach appears to be more challenging and less scalable than the double-JJ layer integrated fabrication process described above. 112 5.6 Energy Saving Techniques for Large regular ERSFQ circuits like FPGA ERSFQ biasing requires constant feeding of clock pulses to the FJTL to maintain a constant bias bus voltage. This accounts for an additional dynamic power consumption in both FJTL and all the bias-limiting JJs of the circuit but is necessary for the circuit to function with good margins. However, when there is no circuit activity in some parts of the circuit, this additional power consumption is a loss. If this dynamic power con- sumption can be avoided, not only is power saved, but also interference can be reduced from one part of the circuit to the other part. The following subsections focus on saving this unnecessary dissipated power. 5.6.1 Feeding Clock Choking- Dynamic Sleep Regime The idea of feeding clock choking is to distribute the feeding clock separately to different sections of the circuit and stop the feeding clock to circuit blocks that do not have circuit activity. This way, we can implement the sleep regime for a section of the integrated circuit and save the unnecessary dynamic power dissipated in the absence of activity in this circuit. We propose to use magnetic Josephson junctions (MJJs) and exploit the property of programmable critical current through the magnetization of the ferromagnetic layer of MJJs [77, 79, 80]. Katam et. al. [88] explains the use of MJJs as bias-limiting junctions in ERFSQ biasing to create a programmable switch. One 113 such switch can be placed at the beginning of the FJTL of each circuit block and be programmed to choke (stop the propagation) the feeding clock when there is no circuit activity in a particular circuit block. The critical current of MJJs can be changed between high and low values by applying an external magnetic field. The field causes the reversal of magnetization of a ferromag- netic layer in MJJ. This magnetic field can be applied using the current-carrying lines integrated close to the MJJs and passing the current of the required value for inducing magnetization reversal. An illustration of placing MJJs in the crossbar and accessing a specific MJJ by decoders is explained in [88]. 5.6.2 Current Recycling The technique of serial biasing of SFQ circuit blocks to re-use the bias current of one circuit block for another is called current recycling and it has been demonstrated for RSFQ circuits [89]. Due to the presence of FJTL, this technique can be difficult to implement for ERSFQ circuits. A straight forward and intuitive way of implementing current recycling is to have FJTL only in the topmost block and bias the logic circuits serially. However, this does not work as the blocks below do not have the feeding JTL and it takes away the bias tolerance that comes due to FJTL’s bias current sacrifice. Fig. 5.13(a) shows a proper way of implementing the current recycling technique with FJTL present in all the serially biased blocks. One can come up with other ways of implementing current recycling which may save some hardware cost. 114 (a) (b) (c) Figure 5.13: (a) Implementation of current recycling technique for ERSFQ logic cir- cuits. MC represents magnetic coupling using inductors. Input and Output pulses of a D-flipflop in the context of shift register circuit of the upper block in current recycling with feeding clock choking (b) absent (c) present in the lower block. 115 5.6.3 Current Recycling with Feeding Clock Choking The usefulness of choking the feeding clock can be realized effectively in imple- menting the current recycling technique for ERSFQ circuits. For current recycling, the circuit has to be divided into several blocks of the similar bias current and be biased se- rially. Since the circuit is already divided into blocks, we can implement an MJJ based switch at the entrance of feeding clock for each block to implement the feeding clock choking. Whenever a certain block is not used in the circuit, the feeding clock for that block can be choked. It results in current entering and leaving the block without con- suming any dynamic or static power. This makes the idle circuit blocks achieve zero power consumption. Superconducting FPGAs and memories are such circuits where it can be applied directly as the circuit is already divided into blocks with equal bias cur- rent. In most cases, for any FPGA architecture, utilization ratio of configurable logic blocks(CLBs) is less than one [90]. In such a scenario, non-utilized CLBs and switch boxes can be kept in dynamic sleep regime by employing the feeding clock choking technique. For proof of concept, 4-bit shift registers are simulated by biasing serially with the current recycling technique and by using feeding clock choking. Results of the simula- tion (Fig. 5.13(c)) show another advantage of choking the feeding clock: SFQ pulses have comparatively lower noise levels in the upper block when the bottom block’s feed- ing clock is choked. 116 5.7 Conclusion We have designed the first superconducting energy-efficient magnetic FPGA. We used the ERSFQ biasing scheme in combination with magnetic Josephson junctions to result in a switch implementation that can be programmed with an external current source. We have designed both an NDRO switch based and a magnetic switch based CLBs whose programming is done serially with the use of an SFQ scan chain in the CLB structure and with magnetic coupling through current in the crossbar structure made by the current lines, respectively. CLB is also designed for asynchronous operation without a higher cost along with synchronous operating CLBs. We have modified the CMOS switch box architecture and designed connection blocks appropriately in the context of a unidirectional SFQ FPGA. A programming methodology to program the critical current of MJJs to either low or high values is presented. We simulated all the designed circuits in WRSpice circuit simulator and verified the functionality of circuits. We have also built Verilog models for each FPGA sub-circuit for ease of simulation for the implementation of whole FPGA structure. To demonstrate the functionality of the proposed FPGA approach, a circuit containing all the building blocks of an ALU is synthesized, placed and routed on the fabric. According to the estimations, our FPGA fabric takes much less area than the previous implementations. An innovative feeding- clock choking mechanism like power gating in CMOS technology is also proposed with the use of magnetic Josephson junctions for circuits in dynamic sleep regime. 117 Chapter 6 Conclusion and Future Work The goal of this research is to push the existing tools and methodologies for sin- gle flux quantum technology(SFQ) towards building large-scale circuits. To this end, design and analysis are done through simulation and in building tools in several topics that are interconnected that help in building large-scale superconducting circuits. Some conclusions are given below and with directions for future research. 6.1 Completed Work 6.1.1 Circuit Synthesis A new circuit synthesis tool is developed with logic optimization using a CMOS logic synthesis tool, ABC. We have used some of the logic synthesis features of ABC and added some new features to ABC to get an SFQ circuit netlist from Verilog HDL functional description of a circuit. The synthesis flow and algorithms used for D-flipflop insertion that helps in path-balancing the circuit and splitter-insertion for fanout imple- mentation are presented in the work. Retiming techniques of sequential circuits in VLSI 118 are used to reduce the path-balancing DFF count for SFQ circuits. Results of synthe- sized, placed and routed circuit netlist are given as the examples. This work can be found in [55]. 6.1.2 Design of Complex Cells The concept of two kinds of complex cells: stand-alone and interconnected cells is introduced. The design and analysis of 3,4,-5 input AND and OR gates, A+BC cell, and high-fanout splitters are presented as part of stand-alone complex cells. The advantage of these cells is presented through the synthesis of multiplexer, decoder and carry look- ahead adder circuits through the synthesis flow in ABC that is described in the paper. The complex cells help in reducing the overall area and the latency of the circuit. This work can be found in [55, 91]. 6.1.3 Timing Characterization of Standard Cells A context-dependent timing characterization method for SFQ standard cells is de- veloped. It generates look-up tables(LUTs) for delay calculation looking at the load of the cell. This is the non-linear delay model(NLDM) method for SFQ circuits and is a development towards building an STA tool with load-dependent clock-to-Q delay and pulse-width dependent setup and hold times similar to the advanced CMOS timing design tools. In order to minimize the dimensions of LUTs, we identified the critical parameters, (i) the first series inductance at the output of the cell and (ii) the difference 119 of critical current and the actual current flowing through the first parallel Josephson Junction (JJ) at the output of the cell, on which the delay of SFQ cells is dependent. We have also presented the approach to generate the required LUTs through simulation. The NLDM approach can find the clock cycle of an SFQ circuit without the analog sim- ulation of the whole circuit and find the setup and the hold time violations with high accuracy. Clock-to-Q delay comparison of our NLDM approach with JSIM simulations for the nominal case and for all process corners gives a maximum difference of 2.1% and 6%, respectively in our test cases of clocked cells. This work can be found in [13]. 6.1.4 Design of Superconducting FPGA based on MJJs We have designed the first superconducting energy-efficient magnetic FPGA. We used the ERSFQ biasing scheme in combination with magnetic Josephson junctions to result in a switch implementation that can be programmed with an external current source. We have designed both an NDRO switch based and a magnetic switch based CLBs whose programming is done serially with the use of an SFQ scan chain in the CLB structure and with magnetic coupling through current in the crossbar structure made by the current lines, respectively. CLB is also designed for asynchronous opera- tion without a higher cost along with synchronous operating CLBs. We have modified the CMOS switch box architecture and designed connection blocks appropriately in the context of a unidirectional SFQ FPGA. A programming methodology to program the critical current of MJJs to either low or high values is presented. We simulated all the 120 designed circuits in WRSpice circuit simulator and verified the functionality of circuits. We have also built Verilog models for each FPGA sub-circuit for ease of simulation for the implementation of whole FPGA structure. To demonstrate the functionality of the proposed FPGA approach, a circuit containing all the building blocks of an ALU is synthesized, placed and routed on the fabric. According to the estimations, our FPGA fabric takes much less area than the previous implementations. A new technique - feed- ing clock choking, a circuit solution for achieving a zero-power consumption in idle circuit blocks with the use of magnetic Josephson junctions, is proposed. This tech- nique performs a similar function as the power-gating in CMOS circuits. We also show how current-recycling in ERSFQ circuits can also be implemented in the simulation with and without feeding clock choking to demonstrate the advantages of feeding clock choking. This work can be found in [88, 92]. 6.2 Future Work 6.2.1 Standard Cell Library A suite of basic cells is designed for SFQ technology which is being used as a stan- dard cell library which is optimized and verified through simulation. These cells can be developed at the layout level and be extracted through extraction tools like Inductex [93] to further optimize and make it ready for successful fabrication and testing of the circuit. Once a proper layout is made for each gate in the standard library, they can 121 be characterized and be used for building several tools in the physical design flow of superconducting circuits. 6.2.2 Building Static Timing Analysis Tool The timing characterization method is built on the similar lines of NLDM method for CMOS circuits so that the static timing analysis can be done without extensive sim- ulation of a given circuit. For the SFQ circuit, the timing analysis will be very different than the CMOS VLSI circuit for the following reasons: (1) Sequential nature of each logic gate; (2) Time of propagation for a signal through clock distribution line is compa- rable to the data propagation through logic circuit; (3) Different delay models for signal propagation through passive transmission line(PTL) and Josephson junction based logic among many others. A tool has to be built taking different clocking strategies into con- sideration. 6.2.3 Building Optimal synthesis and placement tools for FPGA A design of superconducting FPGA architecture with a feasible switch circuit and all building blocks such as configurable logic blocks, switch box and connection blocks. A basic clocking scheme and programming scheme is also developed with a possibility of improvement in future in the architecture. Though the synthesis tool developed in this research can be used for the synthesis of a circuit on the fabric, an architecture-specific 122 synthesis tool with placement and routing as constraints can help in efficiently using the CLBs and routing resources of the fabric. 6.2.4 New Timing and Clocking techniques for the FPGA Though a basic clocking scheme is presented in the work, an innovative ways of clock distribution to the CLBs on the fabric can help in increasing the latency of an im- plemented circuit on the fabric along with improving the CLB utilization on the fabric. Some of the features of traditional VLSI can be used to build asynchronous ways of clocking. 6.2.5 Building Current Recycling Layouts and Algorithms One of the important development required for the implementation of large-scale SFQ circuit is the circuit recycling technique. Though the concept is well-known and is known to help solve high bias current problems, no successful circuit implementation has been found so far. The principal challenge is the design of inductive coupling circuits that interface between recycling layers at a layout level and get them to work as designed after the fabrication. It will be a worthwhile contribution to work on designing current recycling layouts that can follow-up the work presented in this research on building ERSFQ simulation benches with current recycling. 123 For the implementation of current recycling technique to a specific circuit, algo- rithms need to be developed to divide the circuits into blocks that are aware of place- ment and routing requirements along with keeping the serially biased blocks to have the same amount of bias current. These algorithms will look different for different biasing schemes. 124 Bibliography [1] L. Wilson, “International technology roadmap for semiconductors (ITRS),” Semiconductor Industry Association, vol. 1, 2013. [2] D. S. Holmes, A. L. Ripple, and M. A. Manheimer, “Energy-efficient superconducting computing—power budgets and requirements,” IEEE Transactions on Applied Supercon- ductivity, vol. 23, no. 3, pp. 1 701 610–1 701 610, 2013. [3] “Single flux quantum (SFQ) circuit fabrication and design: status and outlook,” https:// beyondcmos.ornl.gov/, accessed: 2016-04-06. [4] D. S. Holmes, “Superconducting computing: Lessons from an emerging technology,” in Energy Efficient Electronic Systems (E3S), 2015 Fourth Berkeley Symposium on. IEEE, 2015, pp. 1–1. [5] P. Bunyk, A. Oliva, V . Semenov, M. Bhushan, K. Likharev, J. Lukens, M. Ketchen, and W. Mallison, “High-speed single-flux-quantum circuit using planarized niobium-trilayer Josephson junction technology,” Applied physics letters, vol. 66, no. 5, pp. 646–648, 1995. [6] K. Likharev and V . Semenov, “RSFQ logic/memory family: A new Josephson-junction technology,” IEEE Transactions on Applied Superconductivity, vol. 50, no. 1, 1991. [7] T. Van Duzer and C. W. Turner, “Principles of superconductive devices and circuits,” 1981. [8] S. Yorozu, Y . Kameda, H. Terai, A. Fujimaki, T. Yamada, and S. Tahara, “A single flux quantum standard logic cell library,” Physica C: Superconductivity, vol. 378, pp. 1471– 1474, 2002. [9] P. Bunyk, K. Likharev, and D. Zinoviev, “RSFQ technology: Physics and devices,” in High Speed Integrated Circuit Technology: Towards 100 GHz Logic. World Scientific, 2001, pp. 257–305. [10] N. Katam, A. Shafaei, and M. Pedram, “Design of Complex Rapid Single-Flux-Quantum Cells with Application to Logic Synthesis,” in Superconductive Electronics Conference (ISEC), 2017 16th International. IEEE, 2017. [11] K. Gaj, E. G. Friedman, and M. J. Feldman, “Timing of multi-gigahertz rapid single flux quantum digital circuits,” in High Performance Clock Distribution Networks. Springer, 1997, pp. 135–164. 125 [12] H. Suzuki, S. Nagasawa, K. Miyahara, and Y . Enomoto, “Characteristics of driver and receiver circuits with a passive transmission line in RSFQ circuits,” IEEE transactions on applied superconductivity, vol. 10, no. 3, pp. 1637–1641, 2000. [13] N. K. Katam and M. Pedram, “Timing characterization for static timing analysis of single flux quantum circuits,” IEEE Transactions on Applied Superconductivity, vol. 29, no. 6, pp. 1–8, Sep. 2019. [14] L. A. Abelson and G. L. Kerber, “Superconductor integrated circuit fabrication technol- ogy,” Proceedings of the IEEE, vol. 92, no. 10, pp. 1517–1533, 2004. [15] D. Yohannes, A. Kirichenko, S. Sarwana, and S. K. Tolpygo, “Parametric testing of HYPRES superconducting integrated circuit fabrication processes,” IEEE Transactions on Applied Superconductivity, vol. 17, no. 2, pp. 181–186, 2007. [16] H. Numata and S. Tahara, “Fabrication technology for Nb integrated circuits,” IEICE trans- actions on electronics, vol. 84, no. 1, pp. 2–8, 2001. [17] R. Harris, J. Johansson, A. Berkley, M. Johnson, T. Lanting, S. Han, P. Bunyk, E. Ladizin- sky, T. Oh, I. Perminov et al., “Experimental demonstration of a robust and scalable flux qubit,” Physical Review B, vol. 81, no. 13, p. 134510, 2010. [18] S. K. Tolpygo, V . Bolkhovsky, T. J. Weir, A. Wynn, D. E. Oates, L. M. Johnson, and M. A. Gouker, “Advanced fabrication processes for superconducting very large-scale integrated circuits,” IEEE Transactions on Applied Superconductivity, vol. 26, no. 3, pp. 1–10, 2016. [19] O. A. Mukhanov, “Energy-efficient single flux quantum technology,” IEEE Transactions on Applied Superconductivity, vol. 21, no. 3, pp. 760–769, 2011. [20] C. J. Fourie, C. Shawawreh, I. V . Vernik, and T. V . Filippov, “High-Accuracy InductEx Cal- ibration Sets for MIT-LL SFQ4ee and SFQ5ee Processes,” IEEE Transactions on Applied Superconductivity, vol. 27, no. 2, pp. 1–5, 2017. [21] Y . Sakashita, T. Ono, Y . Yamanashi, and N. Yoshikawa, “Design and high-speed com- ponent tests of an SFQ FFT Processor Using the 10 kA/cm 2 Nb Advanced Process,” in Superconductive Electronics Conference (ISEC), 2015 15th International. IEEE, 2015, pp. 1–3. [22] Y . Yamanashi, T. Van Duzer, and N. Yoshikawa, “A novel power line to reduce the magnetic field of supply currents in Josephson Digital Circuits,” IEEE/CSC European Superconduc- tivity News Forum(ESNF), October 2008. [23] H. Suzuki, T. Ono, and N. Yoshikawa, “Experimental and simulation results of a symmetri- cal pad to reduce a stray ground current in superconducting integrated circuits,” in Journal of Physics: Conference Series, vol. 871. IOP Publishing, 2017, p. 012067. 126 [24] K. Sano, T. Shimoda, Y . Abe, Y . Yamanashi, N. Yoshikawa, N. Zen, and M. Ohkubo, “Re- duction of the supply current of single-flux-quantum time-to-digital converters by current recycling techniques,” IEEE Transactions on Applied Superconductivity, vol. 27, no. 4, pp. 1–5, 2017. [25] S. Narayana, Y . A. Polyakov, and V . K. Semenov, “Evaluation of flux trapping in super- conducting circuits,” IEEE Transactions on Applied Superconductivity, vol. 19, no. 3, pp. 640–643, 2009. [26] Y . Yamanashi, H. Imai, and N. Yoshikawa, “Influence of magnetic flux trapped in moats on superconducting integrated circuit operation,” IEEE Transactions on Applied Supercon- ductivity, 2018. [27] A. M. Kadin, R. J. Webber, and D. Gupta, “Current leads and optimized thermal packaging for superconducting systems on multistage cryocoolers,” IEEE Transactions on Applied Superconductivity, vol. 17, no. 2, pp. 975–978, 2007. [28] A. Akimoto, Y . Yamanashi, N. Yoshikawa, A. Fujimaki, S. Yorozu, and H. Terai, “Consid- eration of logic synthesis and clock distribution networks for SFQ logic circuits,” Physica C: Superconductivity and its applications, vol. 426, pp. 1687–1692, 2005. [29] V . Adler, C.-H. Cheah, K. Gaj, D. K. Brock, and E. G. Friedman, “A cadence-based design environment for single flux quantum circuits,” IEEE transactions on applied superconduc- tivity, vol. 7, no. 2, pp. 3294–3297, 1997. [30] K. Takagi, N. Kito, and N. Takagi, “Circuit description and design flow of superconducting SFQ logic circuits,” IEICE Transactions on Electronics, vol. 97, no. 3, pp. 149–156, 2014. [31] Y . Kameda, S. Yorozu, and Y . Hashimoto, “Automatic single-flux-quantum (SFQ) logic synthesis method for top-down circuit design,” in Journal of Physics: Conference Series, vol. 43. IOP Publishing, 2006, p. 1179. [32] N. M. Muchuka, “Hardware description language modelling and synthesis of supercon- ducting digital circuits,” Ph.D. dissertation, Stellenbosch: Stellenbosch University, 2017. [33] C. A. Hamilton and K. C. Gilbert, “Margins and yield in single flux quantum logic,” IEEE Transactions on applied superconductivity, vol. 1, no. 4, pp. 157–163, 1991. [34] Berkeley Logic Synthesis and Verification Group, “ABC: A System for Sequential Synthesis and Verification.” [Online]. Available: http://www.eecs.berkeley.edu/~alanmi/ abc/ [35] P. Pan, “Continuous retiming: Algorithms and applications,” in Computer Design: VLSI in Computers and Processors, 1997. ICCD’97. Proceedings., 1997 IEEE International Conference on. IEEE, 1997, pp. 116–121. [36] S. N. Shahsavani, T.-R. Lin, A. Shafaei, C. J. Fourie, and M. Pedram, “An integrated row- based cell placement and interconnect synthesis tool for large SFQ logic circuits,” IEEE Transactions on Applied Superconductivity, vol. 27, no. 4, pp. 1–8, 2017. 127 [37] N. Kito, K. Takagi, and N. Takagi, “Automatic wire-routing of SFQ digital circuits con- sidering wire-length matching,” IEEE Transactions on Applied Superconductivity, vol. 26, no. 3, pp. 1–5, 2016. [38] E. S. Fang, “A Josephson integrated circuit simulator (JSIM) for superconductive electron- ics application,” in Extended Abstracts of 1989 International Superconductivity Electronics Conference (ISEC’89), 1989. [39] P. Shevchenko, “PSCAN2 Superconductor Circuit Simulator.” [Online]. Available: http://pscan2sim.org/ [40] WRSpice, http://www.wrcad.com/wrspice.html (accessed January 31, 2019). [41] S. Polonsky, P. Shevchenko, A. Kirichenko, D. Zinoviev, and A. Rylyakov, “PSCAN’96: New software for simulation and optimization of complex RSFQ circuits,” IEEE transac- tions on applied superconductivity, vol. 7, no. 2, pp. 2685–2689, 1997. [42] H. Hayakawa, N. Yoshikawa, S. Yorozu, and A. Fujimaki, “Superconducting digital elec- tronics,” Proceedings of the IEEE, vol. 92, no. 10, pp. 1549–1563, 2004. [43] A. Inamdar, D. Amparo, B. Sahoo, J. Ren, and A. Sahu, “RSFQ/ERSFQ cell library with improved circuit optimization, timing verification, and test characterization,” IEEE Trans- actions on Applied Superconductivity, vol. 27, no. 4, pp. 1–9, 2017. [44] M. E. Çelik and A. Bozbey, “Statistical timing analysis tool for SFQ cells (STATS),” in Superconductive Electronics Conference (ISEC), 2013 IEEE 14th International. IEEE, 2013, pp. 1–3. [45] V . Stojanovic and V . G. Oklobdzija, “Comparative analysis of master-slave latches and flip- flops for high-performance and low-power systems,” IEEE Journal of solid-state circuits, vol. 34, no. 4, pp. 536–548, 1999. [46] Synopsys, https://www.synopsys.com/implementation-and-signoff/signoff/primetime. html (accessed January 31, 2019). [47] Q. Xie, X. Lin, Y . Wang, M. J. Dousti, A. Shafaei, M. Ghasemi-Gol, and M. Pedram, “5nm FinFET standard cell library optimization and circuit synthesis in near-and super-threshold voltage regimes,” in VLSI (ISVLSI), 2014 IEEE Computer Society Annual Symposium on. IEEE, 2014, pp. 424–429. [48] L. Ponta, A. Carbone, M. Gilli, and P. Mazzetti, “Resistively and capacitively shunted Josephson junctions model for unconventional superconductors,” in Electrical and Com- puter Engineering (CCECE), 2011 24th Canadian Conference on. IEEE, 2011, pp. 000 644–000 647. [49] B. Dimov, V . Todorov, V . Mladenov, and F. H. Uhlmann, “The Josephson transmission line as an impedance matching circuit.” WSEAS Transactions on Circuits and Systems, vol. 3, no. 5, pp. 1341–1346, 2004. 128 [50] A. Odintsov, V . Semenov, and A. Zorin, “Specific problems of numerical analysis of the Josephson junction circuits,” IEEE Transactions on Magnetics, vol. 23, no. 2, pp. 763–766, 1987. [51] S. Razmkhah and A. Bozbey, “Design of the passive transmission lines for different stripline widths and impedances,” IEEE Trans. Appl. Supercond, vol. 26, no. 8, 2016. [52] K. Takagi, M. Tanaka, S. Iwasaki, R. Kasagi, I. Kataeva, S. Nagasawa, T. Satoh, H. Akaike, and A. Fujimaki, “SFQ propagation properties in passive transmission lines based on a 10- Nb-layer structure,” IEEE Transactions on Applied Superconductivity, vol. 19, no. 3, pp. 617–620, 2009. [53] T. El Motassadeq, “CCS vs NLDM comparison based on a complete automated correlation flow between PrimeTime and HSPICE,” in Electronics, Communications and Photonics Conference (SIECPC), 2011 Saudi International. IEEE, 2011, pp. 1–5. [54] Y . Hashimoto, S. Yorozu, Y . Kameda, A. Fujimaki, H. Terai, and N. Yoshikawa, “Design and investigation of gate-to-gate passive interconnections for SFQ logic circuits,” IEEE transactions on applied superconductivity, vol. 15, pp. 3814–3820, 2005. [55] N. K. Katam and M. Pedram, “Logic optimization, complex cell design, and retiming of single flux quantum circuits,” IEEE Transactions on Applied Superconductivity, vol. 28, no. 7, pp. 1–9, 2018. [56] D. Kirichenko, S. Sarwana, and A. Kirichenko, “Zero static power dissipation biasing of RSFQ circuits,” IEEE Transactions on Applied Superconductivity, vol. 21, no. 3, pp. 776– 779, 2011. [57] M. H. V olkmann, A. Sahu, C. J. Fourie, and O. A. Mukhanov, “Implementation of energy efficient single flux quantum digital circuits with sub-aJ/bit operation,” Superconductor Science and Technology, vol. 26, no. 1, p. 015002, 2012. [58] M. Tanaka, M. Ito, A. Kitayama, T. Kouketsu, and A. Fujimaki, “18-GHz, 4.0-aJ/bit op- eration of ultra-low-energy rapid single-flux-quantum shift registers,” Japanese Journal of Applied Physics, vol. 51, no. 5R, p. 053102, 2012. [59] Q. P. Herr, A. Y . Herr, O. T. Oberg, and A. G. Ioannidis, “Ultra-low-power superconductor logic,” Journal of Applied Physics, vol. 109, no. 10, p. 103903, 2011. [60] N. Takeuchi, D. Ozawa, Y . Yamanashi, and N. Yoshikawa, “An adiabatic quantum flux parametron as an ultra-low-power logic device,” Superconductor Science and Technology, vol. 26, no. 3, p. 035010, 2013. [61] O. Mukhanov, V . Semenov, and K. Likharev, “Ultimate performance of the RSFQ logic circuits,” IEEE Transactions on Magnetics, vol. 23, no. 2, pp. 759–762, 1987. 129 [62] C. Shawawreh, D. Amparo, J. Ren, M. Miller, M. Kamkar, A. Sahu, A. Inamdar, A. Kirichenko, O. Mukhanov, and I. Vernik, “Effects of adaptive DC biasing on operational margins in ERSFQ circuits,” IEEE Transactions on Applied Superconductivity, vol. 27, no. 4, pp. 1–6, 2017. [63] I. Kuon, R. Tessier, and J. Rose, “FPGA architecture: Survey and challenges,” Foundations and Trends in Electronic Design Automation, vol. 2, no. 2, pp. 135–253, 2008. [64] U. Farooq, Z. Marrakchi, and H. Mehrez, “FPGA Architectures: An Overview,” in Tree- based Heterogeneous FPGA Architectures. Springer New York, 2012, pp. 7–48. [65] I. Conway Lamb, J. Colless, J. Hornibrook, S. Pauka, S. Waddy, M. Frechtling, and D. Reilly, “An FPGA-based instrumentation platform for use at deep cryogenic temper- atures,” Review of Scientific Instruments, vol. 87, no. 1, p. 014701, 2016. [66] H. Homulle, S. Visser, B. Patra, G. Ferrari, E. Prati, F. Sebastiano, and E. Charbon, “A reconfigurable cryogenic platform for the classical control of quantum processors,” Review of Scientific Instruments, vol. 88, no. 4, p. 045103, 2017. [67] C. J. Fourie and H. van Heerden, “An RSFQ superconductive programmable gate array,” IEEE Transactions on Applied Superconductivity, vol. 17, no. 2, pp. 538–541, 2007. [68] W. R. Reohr and R. J. V oigt, “Superconducting cell array logic circuit system,” Mar. 14 2017, US Patent 9,595,970. [69] A. Cosoroaba and F. Rivoallon, “Achieving higher system performance with the Virtex-5 family of FPGAs,” White Paper: Virtex-5 Family of FPGAs, Xilinx WP245 (v1. 1.1), 2006. [70] D. Singh, “Implementing FPGA design with the OpenCL standard,” Altera whitepaper, 2011. [71] I. P. Nevirkovets, O. Chernyashevskyy, G. V . Prokopenko, O. A. Mukhanov, and J. B. Ket- terson, “Superconducting-Ferromagnetic Transistor,” IEEE Transactions on Applied Su- perconductivity, vol. 24, no. 4, pp. 1–6, 2014. [72] I. P. e. a. Nevirkovets, “Control of supercurrent in hybrid superconducting–ferromagnetic transistors,” IEEE Transactions on Applied Superconductivity, vol. 25, no. 3, pp. 1–5, 2015. [73] S. J. Wilton, “Architectures and algorithms for field-programmable gate arrays with em- bedded memory,” Ph.D. dissertation, Department of Electrical and Computer Engineering, University of Toronto, 1997. [74] B. Baek, W. H. Rippard, S. P. Benz, S. E. Russek, and P. D. Dresselhaus, “Hybrid superconducting-magnetic memory device using competing order parameters,” Nature Communications, vol. 5, p. 3888, 2014. [75] M. Abd El Qader, R. Singh, S. N. Galvin, L. Yu, J. Rowell, and N. Newman, “Switching at small magnetic fields in Josephson junctions fabricated with ferromagnetic barrier layers,” Applied Physics Letters, vol. 104, no. 2, p. 022602, 2014. 130 [76] B. M. Niedzielski, E. Gingrich, R. Loloee, W. Pratt, and N. O. Birge, “S/F/S Josephson junctions with single-domain ferromagnets for memory applications,” Superconductor Sci- ence and Technology, vol. 28, no. 8, p. 085012, 2015. [77] T. I. Larkin, V . V . Bol’ginov, V . S. Stolyarov, V . V . Ryazanov, I. V . Vernik, S. K. Tolpygo, and O. A. Mukhanov, “Ferromagnetic Josephson switching device with high characteristic voltage,” Applied Physics Letters, vol. 100, no. 22, p. 222601, 2012. [78] V . V . Ryazanov, V . V . Bol’ginov, D. S. Sobanin, I. V . Vernik, S. K. Tolpygo, A. M. Kadin, and O. A. Mukhanov, “Magnetic Josephson junction technology for digital and memory applications,” Physics Procedia, vol. 36, pp. 35–41, 2012. [79] I. V . Vernik, V . V . Bol’ginov, S. V . Bakurskiy, A. A. Golubov, M. Y . Kupriyanov, V . V . Ryazanov, and O. A. Mukhanov, “Magnetic Josephson junctions with superconducting in- terlayer for cryogenic memory,” IEEE Transactions on Applied Superconductivity, vol. 23, no. 3, pp. 1 701 208–1 701 208, 2013. [80] I. P. Nevirkovets and O. Mukhanov, “A memory cell for high density arrays based on multi- terminal superconducting-ferromagnetic device,” in Superconductive Electronics Confer- ence (ISEC), 2017 16th International. IEEE, 2017. [81] G. Cerofolini, “The Crossbar Structure,” Nanoscale Devices, pp. 45–52, 2009. [82] I. Vernik, A. Kirichenko, O. Mukhanov, and T. Ohki, “Energy-efficient and compact ERSFQ decoder for cryogenic RAM,” IEEE Transactions on Applied Superconductivity, vol. 27, no. 4, pp. 1–5, 2017. [83] A. Kirichenko, I. Vernik, O. Mukhanov, and T. Ohki, “ERSFQ 4-to-16 decoder for energy- efficient RAM,” IEEE Transactions on Applied Superconductivity, vol. 25, no. 3, pp. 1–4, 2015. [84] T. V . Filippov, A. Sahu, A. F. Kirichenko, I. V . Vernik, M. Dorojevets, C. L. Ayala, and O. A. Mukhanov, “20GHz Operation of an Asynchronous Wave-Pipelined RSFQ Arithmetic-Logic Unit,” Physics Procedia, vol. 36, pp. 59–65, 2012. [85] I. Dayton, T. Sage, E. Gingrich, M. Loving, T. Ambrose, N. Siwak, S. Keebaugh, C. Kirby, D. Miller, A. Herr et al., “Experimental demonstration of a Josephson magnetic memory cell with a programmable-junction,” arXiv preprint arXiv:1711.01681, 2017. [86] T. Ando, S. Nagasawa, N. Takeuchi, N. Tsuji, F. China, M. Hidaka, Y . Yamanashi, and N. Yoshikawa, “Three-dimensional adiabatic quantum-flux-parametron fabricated using a double-active-layered niobium process,” Superconductor Science and Technology, 2017. [87] B. Foxen, J. Mutus, E. Lucero, R. Graff, A. Megrant, Y . Chen, C. Quintana, B. Burkett, J. Kelly, E. Jeffrey et al., “Qubit compatible superconducting interconnects,” arXiv preprint arXiv:1708.04270, 2017. 131 [88] N. K. Katam, O. A. Mukhanov, and M. Pedram, “Superconducting magnetic field pro- grammable gate array,” IEEE Transactions on Applied Superconductivity, vol. 28, no. 2, pp. 1–12, 2018. [89] J. Kang and S. Kaplan, “Current recycling and SFQ signal transfer in large scale RSFQ circuits,” IEEE transactions on applied superconductivity, vol. 13, no. 2, pp. 547–550, 2003. [90] L. Cheng, F. Li, Y . Lin, P. Wong, and L. He, “Device and architecture cooptimization for FPGA power reduction,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 7, pp. 1211–1221, 2007. [91] N. Katam, A. Shafaei, and M. Pedram, “Design of multiple fanout clock distribution net- work for rapid single flux quantum technology,” in Design Automation Conference (ASP- DAC), 2017 22nd Asia and South Pacific. IEEE, 2017, pp. 384–389. [92] N. K. Katam, O. A. Mukhanov, and M. Pedram, “Simulation analysis and energy saving techniques for ERSFQ circuits,” IEEE Transactions on Applied Superconductivity, vol. in press, 2018. [93] C. J. Fourie, “Full-gate verification of superconducting integrated circuit layouts with In- ductEx,” IEEE Transactions on Applied Superconductivity, vol. 25, no. 1, pp. 1–9, 2015. 132
Abstract (if available)
Abstract
The demand for high computing performance and energy efficiency has been driving the development of semiconductor technology for decades. Complementary Metal-Oxide Semiconductor (CMOS) technology is the widely used integrated circuit technology today's electronics. However, with increasing challenges to the physical scaling of CMOS devices and the conclusive end of Moore’s law insight, there is a significant need to search for new device technologies and circuit fabrics that would allow the continuation of performance and energy efficiency scaling beyond the end-of-scaling CMOS. In this context, superconductive digital electronics (SDE), especially Josephson junction (JJ)-based single flux quantum (SFQ), has appeared as a very promising “beyond-CMOS” device technology with a verified speed of 370GHz for simple digital circuits and switching energy per bit of ∼10-19J at T=4.2K (liquid helium temperature). ❧ Though the SFQ technology has a clear lead in terms speed and power consumption, there are several technical challenges for it to become a realistic option to realize large-scale, high-performance, and energy-efficient computing systems of the future. The main challenge is that the circuit fabrics and the architectures are different from current day semiconductor technology and consequently the development of efficient simulation and design automation techniques and tools for SFQ logic must be undertaken. The objective of this research is to design and develop circuit techniques for SFQ technology in the direction of realizing large-scale circuits with JJs. ❧ In this work, (i) automated synthesis of a digital synchronous rapid single-flux-quantum (RSFQ) circuit using a CMOS logic synthesis tool by adding some new features to the tool and modifying the existing features, and (ii) design and analysis of complex cells and using these cells to synthesize relatively large RSFQ circuits, are presented. Complex cells in this work include multiple-input AND and OR cells, high-fanout splitters, and the A + BC cell. Synthesis results show a significant reduction in the logical depth and number of Josephson junctions when complex cells are used during the technology mapping. ❧ Another technique presented here deals with timing characterization of single flux quantum (SFQ) logic cells to enable static timing analysis (STA) of circuits belonging to any SFQ logic family. The available methodologies or tools for performing timing analysis of SFQ circuits do not have a load-dependent timing characterization method for calculating the context-dependent delay of cells such as the nonlinear delay model (NLDM) for CMOS circuits. Accordingly, a new timing characterization method for SFQ logic cells is presented, which relies on low-dimensional look-up tables (LUTs) to store the clock-to-output delay, setup, and hold times of clocked cells and input-to-output delay of non-clocked cells in an SFQ standard cell library. The accuracy of the proposed LUT-based timing characterization method is compared against JSIM simulations, which shows a maximum error of only 2.1% of the tested clocked cells with different loads. ❧ Finally, a new architecture for superconducting Field-programmable gate arrays (FPGAs) is presented by utilizing the latest developments in SDE, namely Energy-efficient single flux quantum (ERSFQ) biasing and magnetic Josephson junctions (MJJs). Towards developing an SFQ-specific FPGA, new designs of FPGA subcircuits for both synchronous and asynchronous operation of SFQ circuits are presented in this work. MJJs are used as bias limiting junctions in ERSFQ biasing to implement programmable switches in various subcircuits of the proposed FPGA fabric. Designs of all FPGA subcircuits are developed and are verified through circuit simulation. Verilog hardware description language (HDL) models are also developed for all FPGA subcircuits to facilitate large-scale FPGA simulations for the implementation of the desired circuit on the proposed FPGA fabric. This fabric can be utilized for implementing quantum control algorithms for Quantum computing.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
High performance and ultra energy efficient computing using superconductor electronics
PDF
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
PDF
Clocking solutions for SFQ circuits
PDF
Charge-mode analog IC design: a scalable, energy-efficient approach for designing analog circuits in ultra-deep sub-µm all-digital CMOS technologies
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Library characterization and static timing analysis of asynchornous circuits
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Improving efficiency to advance resilient computing
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
Asset Metadata
Creator
Katam, Naveen Kumar
(author)
Core Title
Advanced cell design and reconfigurable circuits for single flux quantum technology
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/28/2019
Defense Date
03/19/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bias current,cell design,energy-efficient computing,field-programmable gate array,logic synthesis,magnetic Josephson junctions,OAI-PMH Harvest,serial biasing,superconducting electronics
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Beerel, Peter (
committee member
), Gupta, Sandeep (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
naveenkk15@gmail.com,nkatam@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-194619
Unique identifier
UC11663015
Identifier
etd-KatamNavee-7649.pdf (filename),usctheses-c89-194619 (legacy record id)
Legacy Identifier
etd-KatamNavee-7649.pdf
Dmrecord
194619
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Katam, Naveen Kumar
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bias current
cell design
energy-efficient computing
field-programmable gate array
logic synthesis
magnetic Josephson junctions
serial biasing
superconducting electronics