Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Development of electronic design automation tools for large-scale single flux quantum circuits
(USC Thesis Other)
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DEVELOPMENT OF ELECTRONIC DESIGN AUTOMATION TOOLS FOR LARGE-SCALE SINGLE FLUX QUANTUM CIRCUITS by Ting-Ru Lin A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) December 2020 Copyright 2020 Ting-Ru Lin Acknowledgements This thesis was accomplished based on precious assistance from my Ph.D. advisor, Prof. Massoud Pedram. He led me to a new research area, electronic design automation (EDA), which I had had very little background knowledge about and no research experience in. He spent time teaching me how to explore this new area, locate the problems, and further propose corresponding solutions. Prof. Pedram usually encouraged me to investigate the state-of-the-art methodologies in other areas to rene the traditional methodology in the EDA area, which motivated me to acquire diverse knowledge. This encouragement was not only applied to academic research but also extended to my daily life. His in uence in my whole life can never be overstated as I have learned too much from him in dierent perspectives. Next, I would like to thank other committee members in my thesis defense exam, Prof. Aiichiro Nakano and Prof. Sandeep Gupta, for providing valuable feedback to my research and helping me push the research boundary. I was lucky to meet and work with so many excellent people. The alumni members and current members of SPORT lab led by Prof. Pedram enjoy my ii sincere appreciation. They include Prof. Shahin Nazarian, Shuang Chen, Tian- song Cui, Alireza Shaefei, Di Zhu, Luhao Wang, Naveen Kumar Katam, Bo Zhang, Hassan Afzalikusha, Mahdi Nazemi, Ghasem Pasandi, Soheil Nazar Shah- savani, Mohammad Saeed Abrishami, Amir Erfan Eshratifar, Marzieh Vaeztour- shizi, Arash Fayyazi, Amirhossein Esmaili, Souvik Kundu, Mingye Li, and Haolin Cong. Moreover, Prof. Lizhong Chen, Yunfan Li, Drew Donald Penney, Ji Li, Ramy Tadros, Jizhe Zhan, Huimei Cheng, and Fangzhou Wang also provided sig- nicant help for me for the past years. Finally, I would like to express my sincerest appreciation to my parents and my wife for their unconditional support and love. Without any of aforementioned people, this dissertation would not be possible. iii Table of Contents List of Figures vii List of Tables x Abstract xi Related Publications xiv 1 Introduction 1 1.1 Status of Single Flux Quantum Circuits . . . . . . . . . . . . . . . . 1 1.2 Retiming for Single Flux Quantum Circuits . . . . . . . . . . . . . 3 1.3 Routing for Single Flux Quantum Circuits . . . . . . . . . . . . . . 5 1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Background 11 2.1 Single Flux Quantum Technology . . . . . . . . . . . . . . . . . . . 11 2.1.1 Josephson Junctions . . . . . . . . . . . . . . . . . . . . . . 12 2.1.2 Single Flux Quantum Pulses . . . . . . . . . . . . . . . . . . 14 2.2 Single Flux Quantum Communications . . . . . . . . . . . . . . . . 16 2.2.1 Basic SFQ Cells . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2 Transfer of Single Flux Quantum Pulses . . . . . . . . . . . 19 2.2.3 Path Balanced SFQ Circuits . . . . . . . . . . . . . . . . . . 21 2.2.4 SFQ Timing Fundamentals . . . . . . . . . . . . . . . . . . . 22 2.3 Unique Characteristics of SFQ Designs . . . . . . . . . . . . . . . . 26 2.3.1 Fabrication . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.2 Cooling Systems . . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.3 Clock Net Topology . . . . . . . . . . . . . . . . . . . . . . . 28 2.3.4 Power Consumption . . . . . . . . . . . . . . . . . . . . . . 29 3 Path-Balancing Retiming for High-performance Single Flux Quan- tum Circuits 32 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iv 3.2 High-performance SFQ Circuit Designs . . . . . . . . . . . . . . . . 36 3.2.1 Standard SFQ Cell Library . . . . . . . . . . . . . . . . . . 36 3.2.2 Single Clock Architecture . . . . . . . . . . . . . . . . . . . 37 3.2.3 Dual Clock Architecture . . . . . . . . . . . . . . . . . . . . 39 3.3 System Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.2 Path-Balancing Retiming Models . . . . . . . . . . . . . . . 44 3.4 Fully Path Balanced Circuits . . . . . . . . . . . . . . . . . . . . . . 45 3.4.1 Fully Path-Balancing Retiming . . . . . . . . . . . . . . . . 45 3.4.2 FPB-CREM Problem Formulation . . . . . . . . . . . . . . 48 3.4.3 FPB-CREM Problem Complexity . . . . . . . . . . . . . . . 49 3.4.4 FPB-CREM Solution Method . . . . . . . . . . . . . . . . . 50 3.5 Partially Path Balanced Circuits . . . . . . . . . . . . . . . . . . . . 51 3.5.1 Partially Path-Balancing Retiming . . . . . . . . . . . . . . 51 3.5.2 PPB-CREM Problem Formulation . . . . . . . . . . . . . . 52 3.5.3 PPB-CREM Problem Complexity . . . . . . . . . . . . . . . 54 3.5.4 PPB-CREM Solution Method . . . . . . . . . . . . . . . . . 56 3.6 Experimental Results and Discussions . . . . . . . . . . . . . . . . . 58 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4 qGDR: A Routing Tool for Single Flux Quantum Circuits 65 4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Routing Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 Standard SFQ Cell Library . . . . . . . . . . . . . . . . . . 69 4.2.2 Global Router . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.2.3 Feedback System . . . . . . . . . . . . . . . . . . . . . . . . 89 4.2.4 Detailed Router . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 Experimental Results and Discussions . . . . . . . . . . . . . . . . . 97 4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5 Post-Routing Optimization for Working Frequency of Single Flux Quantum Circuits 105 5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.2 Post-Routing Optimization Procedure . . . . . . . . . . . . . . . . . 108 5.2.1 Standard SFQ Cell Library . . . . . . . . . . . . . . . . . . 110 5.2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 112 5.2.3 Critical Path Optimization . . . . . . . . . . . . . . . . . . . 115 5.2.4 Comprehensive Path Rectication . . . . . . . . . . . . . . . 128 5.3 Experimental Results and Discussions . . . . . . . . . . . . . . . . . 132 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 v 6 Status, Plans, and Conclusions 144 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 References 153 vi List of Figures 2.1 Josephson junctions (a) A schematic of a JJ. (b) A RCSJ model. (c) A symbol of a JJ. (d) A JJ with a resistor. . . . . . . . . . . . . 12 2.2 An SFQ pulse (a) A simulated pulse. (b) A graphic representation of an SFQ pulse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Illustration of an SFQ cell and representation of bits. (a) OR cell. (b) A signal waveform plot. . . . . . . . . . . . . . . . . . . . . . . 16 2.4 (a) An SFQ D ip- op. (b) An SFQ splitter . . . . . . . . . . . . . 18 2.5 Passive transmission line (PTL) [29] where two overdamped junctions are connected by a microstrip line. L is the line length, is the microstrip impedance, is the time of ight of the pulse, c is the speed of transmis- sion, and I b is the bias current. . . . . . . . . . . . . . . . . . . . . . . 20 2.6 Josephson transmission line (JTL) [29] where JJs are parallelly connected by superonducting strips of a relatively low inductance. The dc-current is biased to pre-critical state (I b <I 0 ) of each JJ. . . . . . . . . . . . . 20 2.7 Cell i , Cell j and Cell k are SFQ logic cells with a clock input such as AND, OR, or DFF. S stands for a splitter. . . . . . . . . . . . . 23 2.8 Timing diagram of two connected clocked cells, denoted by Cell i and Cell j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 SFQ circuits with clocked cells. (a) a circuit in the single clock architecture. (b) a circuit in the dual clock architecture. S: splitter; R: register; and Gray rectangle: NDRO. . . . . . . . . . . . . . . . 40 3.2 A directed retiming graph, G r (V r ;E r ). The edges related to the condition ofy(v 2 ) +y(v n ) 1 are marked as red dotted lines in this gure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1 Routing process (a) A circuit layout. (b) Global routing. (c) Detailed routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2 Work ows of qGDR. . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3 (a) A standard cell consisting of clock and logic parts. JTL: Joseph- son transmission line. (b) A cell layout partitioned by the global router with tiles. (c) A cell layout partitioned by the detailed router with bins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 vii 4.4 Multilayer global routing approach. (a) Routing region decompo- sition. (b) 2-layer routing graph. (c) Single-layer routing graph construction after compression. (d) Single-layer graph routing. The solid lines are routing wires of nets. (d) Layer assignment from the single-layer graph to the multilayer graph. . . . . . . . . . . . . . . 72 4.5 L-shaped Routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.6 Rip-up and re-route process. . . . . . . . . . . . . . . . . . . . . . . 80 4.7 Maze routing algorithm. (a) Bounding box formation. b: source vertex. X: blockage. (b) Double fanout wave propagation. (c) Trace back. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.8 Maze routing algorithm with an expanding box. (a) Wave propaga- tion. b: source vertex. X: blockage. (b) Bounding box expansion. (c) Connection wire commitment. . . . . . . . . . . . . . . . . . . . 84 4.9 The layer assignment. (a) A single-layer global routes. (b) Dynamic programming process. (c) A multilayer global routes. . . . . . . . . 86 4.10 Feedback System. (a) Vertical Routing Track Increase. (b) Hori- zontal Routing Track Increase. . . . . . . . . . . . . . . . . . . . . . 91 4.11 Detailed Routing.(a) Grid Graph Construction. (b) Routing Masks. (c) Detailed Routing Results. . . . . . . . . . . . . . . . . . . . . . 92 4.12 Routing results of an 8-bit integer divider. (a) Circuit layout (b) Via density graph. (c) M1 wire density graph. (d) M3 wire density graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.13 Histogram of the wire length of the 8-bit integer divider. . . . . . . 103 5.1 Work ows of post-routing optimization . . . . . . . . . . . . . . . . 110 5.2 Density based spatial clustering of application with noise (DBSCAN) algorithm in the machine learning stage. . . . . . . . . . . . . . . . 113 5.3 Maze routing algorithm. (a) Bounding box formation. The dotted line represents the previous routing wire. b: source vertex. X: blockage. (b) Fanout wave propagation. (c) Trace back. . . . . . . . 118 5.4 Plowing operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.5 Interconnections between clocked cells. (a) An interconnection between two clocked cells. The black triangles represent clockless cells for forwarding clock signals. (b) Two interconnections between pairs of clocked cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.6 Fixing hold time violations by temporary blockage placement. . . . 131 5.7 Routing results of a 16-bit Kogge-Stone adder. (a) M1 wire density graph using Qrouter. (b) M3 wire density graph using Qrouter. (c) M1 wire density graph using qGDR. (b) M3 wire density graph using qGDR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 5.8 Pie charts of the timing factors of a critical path. (a) 4-bit KSA. (b) 8-bit IntDiv. The unit of the number is picoseconds. . . . . . . 139 viii 5.9 execution time of benchmark circuits with dierent numbers of nets. 140 5.10 M1 wire density graph after qGDR. (a) 8-bit Mul. (b) 8-bit IntDiv. (c) c499. (d) c880. . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 ix List of Tables 3.1 Standard SFQ Cell Library . . . . . . . . . . . . . . . . . . . . . . . 38 3.2 Circuit Specication . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.3 Results for Building Fully Path Balanced Circuits . . . . . . . . . . 60 3.4 Results for Building Partially Path Balanced Circuits ( = 1) . . . 61 3.5 Results for Building Partially Path Balanced Circuits ( = 2) . . . 63 4.1 Standard SFQ Cell Library . . . . . . . . . . . . . . . . . . . . . . . 71 4.2 Circuit Specication . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.3 Routing Results of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks . . . . . . . . 99 4.4 Routing Results of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks by Qrouter and qGDR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.1 Standard SFQ Cell Library . . . . . . . . . . . . . . . . . . . . . . . 111 5.2 Circuit Specication . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.3 Routing Results of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks by Qrouter w/o and w/ Post-routing Optimization . . . . . . . . . . . . . . . . 134 5.4 Routing Results of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks by qGDR w/o and w/ Post-routing Optimization . . . . . . . . . . . . . . . . . . . 136 5.5 Timing Analysis of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks . . . . . . . . 142 x Abstract Single- ux-quantum (SFQ) circuit technologies are promising digital circuit tech- nologies with high-speed and extremely low-power advantages. With the emer- gence of large-scale SFQ circuits, it is desirable to develop generalized electronic design automation (EDA) tools to assist designers in maintaining the advantages of SFQ circuits. In this thesis, we present two EDA tools: a post-synthesis tool and a routing tool. The former tool focuses on power minimization using a generalized retiming transformation, whereas the latter tool emphasizes ecient wire routing with timing optimization. Retiming is a circuit transformation whereby registers are relocated to optimize performance, area, or energy consumption; it has reached a high level of maturity in CMOS designs. However, the recent emergence and rapid rise of non-CMOS technologies are introducing new and important variants of the standard retiming problems. We present a generalized retiming transformation, applicable to SFQ designs, where the retiming solution must achieve full path balancing of the circuit while simultaneously minimizing the energy consumption of inserted registers with xi performance constraints. This optimization problem, which is called a constrained register energy minimization (CREM) problem, is precisely formulated and poly- nomially solved. We further extend the CREM problem formulation to retime the circuit under a dual clocking architecture, which requires partial (bounded depth dierence) path balancing only. It is shown that the extended CREM problem is NP-complete. We thus propose a polynomial-time approximation algorithm with a bounded error to solve this retiming variant. The unique characteristic of SFQ circuits is that large clock topologies are generally required to pass clock signals to distributed clocked cells with a fanout limitation for signal evaluation and state transition. The large clock topologies and the fanout limitation pose a great challenge to control clock skew and path delays of an SFQ circuit during the routing step. We present an integrated global and detailed router for the SFQ circuits, qGDR, which aims at reducing the impedance mismatch during signal transfer by minimizing the total number of used vias. The global router allocates routing resources while minimizing the via usage by using a dynamic layer assignment algorithm. The detailed router follows the global rout- ing results to complete the routing task by resorting to a maze routing algorithm. We further complement qGDR with a novel post-routing optimization framework which reduces path lengths for working frequencies and meanders paths for hold xii time violations. The optimization framework is developed in which machine learn- ing is applied to analyze wire distributions and then a maze routing algorithm is performed to re-route targeted paths. The performance of our post-synthesis and routing tools is tested using a library of SFQ logic cells following the MIT-LL SFQ5ee process technology. The results verify that the proposed retiming transformation can signicantly reduce register count and register energy consumption. The routing results conrm that qGDR can use only two routing layers to route an 8-bit integer divider with more than 40,000 Josephson junctions in a reasonable time. Moreover, the post-routing opti- mization can eciently improve the clock frequency and resolve all hold time vio- lations. xiii Related Publications S. N. Shahsavani, T.-R. Lin, A. Shafaei, and M. Pedram, \An Integrated Row- based Cell Placement and Interconnect Synthesis Tool for Large SFQ Logic Cir- cuits," IEEE Transactions on Applied Superconductivity, March 2017. T.-R. Lin, T. Edwards, and M. Pedram, \qGDR: A Via-Minimization-Oriented Routing Tool for Large-Scale Superconductive Single-Flux-Quantum Circuits," IEEE Transactions on Applied Superconductivity, Oct. 2019. T.-R. Lin, and M. Pedram, \Reducing the Maximum Length of Connections in Single Flux Quantum Circuits During Routing," International Superconductive Electronics Conference (ISEC), Aug, 2019. T.-R. Lin, Bo Zhang, and M. Pedram, \Post-Routing Optimization of the Working Clock Frequency of Single Flux Quantum Circuits," IEEE Transactions on Applied Superconductivity, June. 2020. [Accepted]{ T.-R. Lin, and M. Pedram, \Retiming for High-performance Super- conductive Circuits with Register Energy Minimization," International Conference on Computer Aided Design (ICCAD), 2020. [Editing]{ T.-R. Lin, and M. Pedram, \A Fast Multi-stage Router for Single Flux Quantum Circuits." xiv Chapter 1 Introduction 1.1 Status of Single Flux Quantum Circuits An exponential increase in energy consumption by computational and Internet- related systems is estimated to reach 15% of the total energy consumption in the world [1]. Computational systems based on superconductive electronics (SCE) are non-CMOS technologies that promise energy-ecient computing [2]. Switch devices in the SCE are Josephson junctions (JJs), which propagate single- ux- quantum (SFQ) pulses through logic cells in the order of1 ps and dissipate only10 19 J per JJ switching [3]. Rapid single ux quantum (RSFQ) circuits are capable of reaching70 GHz [1]. Energy-ecient RSFQ (ERSFQ) [4{7] and ecient single ux quantum (eSFQ) [8, 9] are developed based on RSFQ with dierent biasing schemes to further reduce the power consumption [10]. As a result, the static power dissipation of an ERSFQ cell is almost zero. Furthermore, the dynamic energy dissipation per SFQ INV cell is about 10 18 J whereas a minimum-size CMOS INV driving another identical INV in an industrial 12nm 1 FinFET technology consumes about 10 15 J. In the limit of CMOS scaling, this energy eciency advantage still stays at 200X or higher [4]. SFQ fabrication technologies have been developed for over two decades, and the number of the JJs in the SCE has increased from 1,000 to 809,000 [1, 11]. The advancement of the fabrication technologies motivates research on large-scale SFQ circuit designs and SFQ circuit optimization based on physical characteris- tics. Referring to the CMOS technology development, continued improvements in circuit integration levels and circuit performance have been made possible by the advancements of electronic design automation (EDA) methodologies and tools. It was not until recently that many SCE EDA tools were proposed [12{14]. Com- mercial EDA tools for CMOS circuits, however, have been constantly rened for optimization since 1981 which is marked as the beginning of EDA. Afterward, the EDA tools for CMOS facilitate the progress of large-scale CMOS circuits or even very-large-scale-integrated (VLSI) CMOS circuits. The mature EDA tools for CMOS designs, however, cannot be directly applied to SFQ designs due to the peculiar characteristic that every logic gate requires a clock to operate. This pecu- liar characteristic results in necessary enormous clock tree topologies to distribute clock signals to SFQ logic cells, which is rarely seen in CMOS designs. The physical characteristics of superconductive materials in SFQ designs also motivate researchers to refresh conventional EDA tools. In detail, JJs are the active components in SFQ circuits whereas transistors are the main components in 2 CMOS circuits. The passive elements of SFQ circuits are inductors while those of CMOS circuits are capacitors. Furthermore, the connections between fundamental cells in SFQ circuits are formed either by Josephson transmission lines (JTLs) or passive transmission lines (PTLs) instead of simple metal wires as CMOS circuits. The binary expression in SFQ circuits relies on prompt SFQ pulses instead of con- stant voltage levels, which encourages researchers to reevaluate possible limitations of conventional EDA tools [15]. Because of the magnetic nature of SFQ pulses, shielding layers or other protection schemes are required to prevent unexpected in uences from one cell on another cell. The protection schemes could pose chal- lenging limitations for EDA tools such as fewer routing layers. The aforementioned characteristics strongly imply that powerful and specialized SCE EDA tools are needed to facilitate the progress of large-scale SFQ circuits. 1.2 Retiming for Single Flux Quantum Circuits The logic synthesis of standard EDA work ows generates logic cells and register cells based on expected circuit functionalities. Standard logic cells in SFQ designs are known as clocked cells because they, unlike CMOS logic cells, need input clock signals to generate data signals at their outputs. Since the input data signals of any clocked cell must arrive within a target clock period, many registers are inserted between pairs of clocked cells to fully balance the path delays of the inputs to any 3 SFQ logic cell. To pursue optimal SFQ designs, we generalize the retiming trans- formation to tackle a constrained register energy minimization (CREM) problem, whose objective is to build a wave-pipelined circuit with minimal register energy consumption while meeting performance constraints. Many retiming algorithms have been proposed to address similar problems [16{22], but none of them are tailored to the CREM problem. The reasons are as follows: Standard logic cells in SFQ designs, unlike CMOS logic cells, generate output signals only after receiving clock signals. However, prior retiming algorithms are developed for standard CMOS design in which logic cells produce output signals as soon as input signals arrive. This key dierence suggests that prior retiming algorithms for register energy minimization with performance constraints cannot be applied to SFQ designs directly because SFQ logic cells are clocked cells instead of clockless cells. Due to the limited driving capability of SFQ cells, the output pin of each cell can only be connected to one input pin of another cell. This driving limitation requires the implementation of splitters which replicate an input signal and then generate the same signals at their two output pins. These splitters, unlike standard SFQ cells, do not receive clock signals or perform logic operations but merely repeat their input signals with an extra delay. This characteristic imposes a big challenge for register energy minimization when performance constraints are imposed. 4 Prior retiming algorithms can only be performed after initial registers are placed or inserted either by human eort or by some algorithms. However, the existing algorithms for initial register insertions for SFQ designs generally lack optimality guarantees for nal synthesis results. Thus, the optimization result for performance, area, or energy consumption using retiming algo- rithms could be compromised since prior retiming algorithms rely on initial register insertions. 1.3 Routing for Single Flux Quantum Circuits In standard EDA work ows, the placement step determines the locations of circuit cells on a die surface and it aects layout routability, circuit performance, and heat distributions [23]. Then, the routing step builds the connection wire of individual nets to interconnect the pins of placed cells or the pads at the chip boundary without violating design rules provided by chip foundries [12,23]. Given a circuit, the most important objective of the routing step is to ensure all nets are assigned with a physical wire connection. Other common objectives include minimizing the total wirelength after routing and satisfying the required timing budget of each net. The diculty of solving the routing problem depends on net counts, timing constraints, and available routing resources. Many routing frameworks have been proposed to tackle conventional CMOS designs but the development for 5 SFQ routing tools is still in its infancy [15]. In particular, the development of a router for SFQ circuits is more challenging because of the following. The driving limitation of SFQ cells introduces a large number of inserted splitters. The inserted splitter number is even comparable to the number of other cells when we consider complicated circuits (e.g. array multipliers). The cell count increase due to splitter insertions result in net count growth, which increases the diculty of the routing task signicantly. Modern SFQ process technologies only provides very limited available rout- ing layers for wire deployment. Specically, referring to the MIT-LL SFQ5ee process technology, only two (Nb) metal layers, a bottom-layer M1 and a top-layer M3, are reserved as routing layers for signal routing. Other layers such as M5 is used for the biasing, whereas M6 is used to implement connec- tions (and inductors) inside the cells. Available routing layers for modern CMOS/FinFET technologies, however, can reach more than 10. In SFQ combinational logic circuits, clocked cells receive a clock signal and hence a path delay is dened as a single cell delay plus the path propagation delay. If there are splitters along the path, we have to account for splitter delays plus multiple wire delays. Similar to CMOS circuits, the path prop- agation delays are the most critical factor to determine working frequencies 6 but the unique characteristics of clocked cells in SFQ circuits hinder direct implementations of conventional routing optimization algorithms. 1.4 Thesis Contributions This thesis studies the electronic design automation (EDA) problem of large-scale SFQ circuits. We discuss and provides robust frameworks for post-synthesis opti- mization and a routing procedure with timing optimization. Our contributions are summarized as follows. First, we propose a generalized retiming transformation which integrates regis- ter insertions and a conventional retiming transformation to optimally solve a con- strained register energy minimization (CREM) problem. We abstract an arbitrary synthesized circuit without path-balancing registers as a directed graph model and then formulate the CREM problem as an integer linear programming (ILP) problem based on the graph model. The formulated ILP problem is shown to cor- respond to a well-known minimum cost ow problem with polynomial algorithms through dual problem transformation. Second, we extend the CREM problem for building an arbitrary circuit under an advanced dual clocking architecture with an imbalance bound. We prove the extended problem to be NP-complete by reducing the vertex cover problem to it in polynomial time and space. Thus, we present a polynomial-time approximation algorithm with a proven bounded error to eciently solve the extended CREM 7 problem. Given 14 benchmark circuits, our experiment results demonstrate that our approach reduces 38% of register count and 50% of register energy consumption on average compared to the prior work [24]. Moreover, the solution acquired by our approximation algorithm is on average only 1.08X away from the optimal solution. Third, we develop a via minimization oriented routing tool, qGDR, for large- scale SFQ circuit designs. qGDR integrates a global router and a detailed router to accomplish the full-chip routing task. The global router cuts the routing regions of circuits into tiles and assigns loose (global) tile-to-tile paths for all global nets while attempting to minimize the number of used vias. Following the global routing results, the detailed router generates detailed wires and vias for each connection path of individual nets. A feedback system is embedded in qGDR to automatically increase the number of available routing tracks until all nets are routed. Fourth, we demonstrate a post-routing optimization framework which reduces the path delay of the critical paths of a large-scale SFQ circuit while controlling the clock skew. Furthermore, the hold time violations are all resolved by meandering routing paths. The whole optimization framework consists of machine learning, critical path optimization and comprehensive path rectication. Machine learning inspects local wire deployments and conducts a full-chip wire distribution analy- sis. Equipped with the distribution information, critical paths are identied and then re-routed with alternative short wires during critical path optimization. Last but not least, the comprehensive path rectication performs a ripup-and-reroute 8 process to increase clock skew for working frequencies and to build detour paths for hold time violations. Fifth, the robustness of qGDR with post-routing optimization is tested by 13 SFQ circuits including Kogge-Stone adders (KSA), array multipliers (Mul), integer dividers (IntDiv) and some of the ISCAS c-series benchmarks. The developed routing tool is capable of generating a compact layout for each circuit within no more than 0.5 CPU hour. Afterward, the post-routing optimization advances the working frequency by 7% on average and resolves all hold time violations of qGDR routing results. Finally, qGDR with post-routing optimization can generate large- scale SFQ circuits with satisfactory working frequencies without any hold time violations. While this thesis focuses on the discussion and the development of SFQ rout- ing tools, there are many challenging research topics specic to SFQ technologies. The recent progress of SFQ technologies enables the fabrication of complicated multilayer structures which require powerful 3D inductance extraction tools for reliable impedance and timing analyses [25]. The thermal issues of integrated SFQ systems as commercial products motivate the studies on ecient cryocooling systems or similar cooling systems [26]. Last but not least, a complete computer built with SFQ technologies is impossible without any SFQ storage elements. Nev- ertheless, the development of the SFQ storage elements based on standard SFQ technologies is still slow-paced compared to that of the SFQ computing elements. 9 Thus, more research eorts on the SFQ storage elements are highly anticipated. We would also want to mention that not only the router proposed in this thesis, but other electronic design automation (EDA) tools for SFQ circuits are still lack- ing [13]. Foreseeable breakthroughs in SFQ designs are expected with complete and advanced EDA tools. 1.5 Thesis Organization The thesis organization is as follows: Chapter 2 introduces background knowl- edge about SFQ technologies, SFQ communications, and dierences between SFQ designs and CMOS designs; Chapter 3 elaborates a generalized retiming transfor- mation which builds high performance SFQ circuits with minimal register energy consumption; Chapter 4 species a two-stage approach of global routing followed by detailed routing; Chapter 5 details a post-routing optimization framework which renes the specied two-stage approach for working frequency improve- ments; Chapter 6 concludes our work and describes a research topic related to timing optimization as future work. 10 Chapter 2 Background This chapter introduces the basic concept of single ux quantum (SFQ) designs as well as common academic terms, which are necessary for understanding the context of this thesis. We start with a brief introduction of superconductive elec- tronics and SFQ pulses. We then put the light on SFQ communication mechanisms between SFQ cells in circuits. The last part of this chapter covers other unique characteristics of SFQ designs to help readers explore other research areas. 2.1 Single Flux Quantum Technology Some materials demonstrate no electrical resistance and magnetic eld expulsion when they are cooled below a characteristic critical temperature. The electromag- netic characteristic of no electrical resistance and no magnetic eld expulsion is termed as superconductivity [27, 28]. Superconductivity is explained by a phe- nomenon called the Meissner eect which states the ejection of any suciently weak magnetic eld from the material inside. The presence of the Meissner eect implies that superconductivity is not a perfect conductor because the interior mag- netic eld cannot be a nonzero value. 11 2.1.1 Josephson Junctions The active component of superconductive electronics is a two-terminal Josephson Junction (JJ), shown in Figure 2.1. Figure 2.1(a) depicts a JJ which is formed by sandwiching superconductors together with an insulator. The material that is commonly manufactured as superconductors is Niobium (Nb) with atomic number 41. The superconductors in this thesis are Nb-based superconductors if it is not specied. Figure 2.1: Josephson junctions (a) A schematic of a JJ. (b) A RCSJ model. (c) A symbol of a JJ. (d) A JJ with a resistor. Figure 2.1(b) demonstrates the resistively and capacitively shunted junction (RCSJ) model of a JJ with a bias current I b . There are three types of currents passing dierent elements in the RCSJ model: a Josephson current (Josephson 12 junction), a normal current (resistance, R), and a displacement current (capaci- tance, C). The Josephson junction serves as a non-linear current source of a JJ which is specied by two relation equations J s () =J c sin (2.1) d dt = 4e h V (t) 2 V (t) (2.2) whereJ s is the supercurrent density;J c is the critical current density (I c is the cor- responding critical current); is the phase across the JJ;h is the Planck constant; V (t) is the voltage across the junction; =h=2 is a single quantum of supercon- ducting ux. Equation 2.1 and 2.2 are also known as the current-phase relation and the voltage phase relation, respectively. The resistance and the capacitance in the RCSJ model act as parasitic electrical elements of a JJ. I b is nonzero if current-biased Josephson junctions are implemented in SFQ designs. The electri- cal symbol of a JJ is shown in Figure 2.1(c) whereI c represents the critical current of a JJ. 13 If a JJ is provided with a dc input currenti, we can derive the following equation based on the RCSJ model i =I c sin + V (t) R +C dV (t) dt =I c sin + 2R d dt +C 2 d 2 dt 2 V (t): (2.3) A Josephson time constant J and a Steward-McCumber parameter c are dened as J = 2I c R (2.4) c = RC J = 2R 2 C : (2.5) If the Steward-McCumber parameter is much larger than 1, the JJs is termed as underdamped and we need to periodically reset the JJs to remove the hysteresis [29]. To avoid the periodic reset, we can connect the JJs with a small resistor in parallel which steers the current from the JJ, as shown in Figure 2.1(d). As a result, the Steward-McCumber parameter is around 1 or much smaller than 1 and those JJs are termed overdamped. Please note that all SFQ cells and SFQ circuits in this thesis are built with the overdamped JJs. 2.1.2 Single Flux Quantum Pulses SFQ digital circuits are built with JJs which can generate single ux quantum (SFQ) pulses for signal communications. In SFQ digital circuits, a binary digit 14 is dened by the quantized area of picosecond SFQ voltage pulses V (t) [29]. In contrast, a binary digit in CMOS digital circuits is represented by a voltage dc level. An SFQ pulse is illustrated in Figure 2.2(a) and the area of V (t) of an SFQ pulse satises Z V (t)dt = h 2e = 2:07mVps (2.6) This SFQ pulse can be naturally generated, reproduced, amplied, and memorized by SFQ cells and these cells with the same functionality as CMOS cells are utilized to build SFQ digital circuits. For simplication, we use Figure 2.2(b) as a graphic representation of an SFQ pulse. Figure 2.2: An SFQ pulse (a) A simulated pulse. (b) A graphic representation of an SFQ pulse. Since the SFQ pulse is a voltage signal within a short interval instead of a constant voltage for a period of time, we need to refresh the denition of a binary signal in SFQ circuits. According to prior work [29], an SFQ pulse received at a receiving terminal during a clock period is identied as a binary signal `1' whereas no SFQ pulse received at the terminal during a period is recognized as a binary `0' value. Figure 2.3 illustrates an example of an SFQ OR cell in which one input terminal IN 1 receives a `1' while the other input terminal IN 2 receives a `0' in a 15 clock periodT . With these input signals, the OR cell generates another SFQ pulse as an output signal `1' in the next period. More timing details are discussed later. Figure 2.3: Illustration of an SFQ cell and representation of bits. (a) OR cell. (b) A signal waveform plot. 2.2 Single Flux Quantum Communications Communications between superconductive cells rely on SFQ pulses and intercon- nection lines between cells. The rst part of this section describes the working mechanism of SFQ cells given dierent SFQ input pulses. The second part char- acterizes the interconnection lines between a pair of SFQ cells, and the last part species necessary timing conditions for successful SFQ communications between cells. 16 2.2.1 Basic SFQ Cells The uniqueness of SFQ cells is that all of these cells except an SFQ splitter receive a clock signal to evaluate the current memory state for each clock period. The SFQ cells with a clock input are known as clocked cells whereas the SFQ cells without a clock input are termed as clockless cells. The SFQ cells generally have the structure of an inductive storage loop. The structure is the celebrated superconducting quantum interferometer (also known as DC SQUID) [29], and it is used as a memory element to conserve a current loop circling along either a clockwise direction or a counterclockwise direction. We use SFQ D ip- op (DFF) to explain the working mechanism of the interferometer. Figure 2.4(a) depicts a DFF with an interferometer which is formed by J 1 ,J 2 and L 1 . J 0 and J 3 in Figure 2.4(a) are used to steer pulse propagation. There are two stable memory states of the DFF, a logic `0' and a logic `1'. A logic `1' is reserved by a clockwise current loop while a logic `0' is maintained by a counterclockwise current loop. If an SFQ pulse arrives at the input of the DFF, J 1 is changed to the logic `1' state and the direction of the current loop is the clockwise. When the sate is `1', the arrival of a CLK pulse causes J 2 to leap. Otherwise, the arrival of a CLK pulse causes J 3 to leap. The former generates an output SFQ pulse while the latter generates no SFQ pulse. While the illustrated DFF only consists of an interferometer, there could be multiple interferometers in an SFQ cell. Please note that the current loop remains cycling during the operation until the memory 17 state of the cell is evaluated by an SFQ clock signal. Thus, each clocked cell can be functionally viewed as a combination of a logic cell and a register cell. The state evaluation, however, is a destructive process (i.e. the state is reset to `0') for general SFQ cells and these SFQ cells are called destructive read-out (DRO) cells. The peculiar working mechanisms of the interferometer lead to the fundamental dierences between SFQ designs and CMOS designs [30]. Figure 2.4: (a) An SFQ D ip- op. (b) An SFQ splitter Figure 2.4(b) illustrates an SFQ splitter which reproduces an input SFQ pulse at each of its two (or even multiple) outputs without decreasing the pulse ampli- tude. The critical currents of JJs at the outputs are larger than that at the input (i.e. I c1 < I c2 and I c1 < I c3 ) while the inductance at the outputs is smaller than that at the input (i.e. L 1 > L 2 and L 1 > L 3 ) [29]. The splitter is required in SFQ designs because the fanout of general SFQ cells is only one due to a driving ability limitation [15]. This limitation is one of the key dierences between SFQ 18 circuits and CMOS circuits. Since the splitters in SFQ circuits are added for fanout requirements, they are clockless cells by default. 2.2.2 Transfer of Single Flux Quantum Pulses There are two types of transmission lines interconnecting pairs of SFQ cells for signal communications: passive transmission lines (PTL) and active Josephson transmission lines (JTL). Figure 2.5 depicts a PTL where two JJs are connected by a microstrip line. With the characteristics of low loss and dispersion nature of the microstrip line, PTLs transfer an SFQ pulse ballistically with a speed of 1:25 10 8 m/s (approaching the speed of light) [31] so the propagation delay between SFQ cells scales linearly with the PTL length. In contrast, the delay of wires in CMOS circuits is quadratically proportional to the wire length because wire connections are usually modeled as RC or RLC networks. The main disadvantage of utilizing PTLs is that a PTL driver and a PTL receiver are required at the launch end and the capture end of each PTL, respectively [15]. Thus, extra cost of chip areas and power consumption are expected. Moreover, the impedance of the PTL drivers and the PTL receivers has to be consistent with the impedance of the connected PTLs to ensure low signal loss and no dispersion during signal transmission. Figure 2.6 shows a JTL where multiple active JJs are joined in parallel by low inductance strips with a matched inductance L =I c . These JJs are all 19 (a) Schematic. (b) Simulation. Figure 2.5: Passive transmission line (PTL) [29] where two overdamped junctions are connected by a microstrip line. L is the line length, is the microstrip impedance, is the time of ight of the pulse, c is the speed of transmission, and I b is the bias current. (a) Schematic. (b) Simulation. Figure 2.6: Josephson transmission line (JTL) [29] where JJs are parallelly connected by superonducting strips of a relatively low inductance. The dc-current is biased to pre-critical state (I b <I 0 ) of each JJ. biased in the pre-critical state (I b < I c ) with a dc current. In contrast to PTLs, the constant dc current in JTLs represents additional power overheads of SFQ designs. To transfer an SFQ pulse, a 2-leap in J1 is triggered by an arriving pulse from inputA and then the triggered pulse causes another 2-leap in J2. The phase leap continues from input A to output B. 20 2.2.3 Path Balanced SFQ Circuits In general, all data paths between logic cells of a circuit need to be synchronized to generate correct output signals. However, SFQ designs, unlike CMOS designs, do not consist of a cloud of combinational clockless logic cells between two registers. The reason is that clock pulses are fed to each SFQ logic cell to ensure correct operations. Thus, each logic cell along a data path requires the signal arrival time of all input signals to be within a time interval given a clock or a synchronization signal. One well-known way to synchronize all data paths is to realize path bal- ancing [32, 33]. The rst step of the path balancing is to dene the logic level of a clocked cell as the length of the longest path in terms of clocked cell count from a primary input to the clocked cell [32]. If there is a logic level dierence between the inputs of a clocked cell, registers are repeatedly inserted between the input and the clocked cell until the logic level dierence is zero. Please note that regis- ters and DFFs are interchangeable in this thesis. The register insertion continues until all logic level dierences of the circuit are zero. We term the SFQ circuits with no logic dierence a fully path balanced (FPB) circuit. The FPB architec- ture, however, implies large energy overheads because of the enormous number of inserted registers. Moreover, the routing diculty increases with the inserted reg- isters because the additional registers with clock inputs could signicantly change the whole clock signal distribution. More details about FBP circuits and other advanced architecture will be studied in Chapter 3. 21 2.2.4 SFQ Timing Fundamentals High working frequencies of tens to hundreds of gigahertz has been of the utmost advantage of SFQ circuits over CMOS circuits [34]. This advantage is expected persistent after experts ooad heavy physical design tasks to emerging SFQ EDA tools for large-scale SFQ circuits [13, 35]. Timing behaviors of SFQ circuits are principally governed by the same rules and constraints as CMOS circuits [34]. This subsection elaborates on necessary timing fundamentals of SFQ circuits with a single-phase synchronous clocking scheme, which is the dominant design scheme due to its simplicity. These fundamentals can be extended to multi-phase clocking schemes or even asynchronous schemes through clock scheme partitions. More details can be found in [34]. Successful data exchanges between a pair of connected clocked SFQ cells must satisfy two conditions: (1) a setup time condition and (2) a hold time condition. We explain and derive the conditions using Figure 2.7 and Figure 2.8. In Figure 2.7, a pair of clocked cells are connected by a data path and clock signals are forwarded from a global clock input to these connected cell. Figure 2.8 illustrates a timing diagram to represent the SFQ pulses in Figure 2.7 in the time domain. Please note that an interconnection between two cells in an SFQ circuit can consist of multiple JTLs/PTLs and splitters for fanout requirements. Referring to Figure 2.8, a clock period T denotes the time interval between two consecutive clock pulses and the interval is measured between the peak of two 22 Figure 2.7: Cell i , Cell j and Cell k are SFQ logic cells with a clock input such as AND, OR, or DFF. S stands for a splitter. Figure 2.8: Timing diagram of two connected clocked cells, denoted by Cell i and Cell j . pulses. We dene clock skew as the time dierence of receiving the clock signal between two cells. The clock skew skew i;j between a clocked cell i and a clocked cell j in Figure 2.7 is expressed as skew i;j =t clk j t clk i ; (2.7) 23 where t clk i (t clk j ) is the arrival time of the clock signal at the clock input of a cell i (j) from the primary clock input. A data path delay denotes the pulse propagation time from the rst clocked cell of a path to the last clocked cell and it includes the clock-to-q delay of the rst cell and the cell delays of the splitters along the path. We term the rst cell as the launch cell and the last cell as the capture cell in this thesis. In Figure 2.7, the data path delay i;j of forwarding a signal from a cell i to a cell j through N i;j splitters is given by i;j =t c2q i +N i;j t splitter + N i;j +1 X k=1 t wire k ; (2.8) where t c2q i is the clock-to-q delay of a cell i; t splitter is the cell delay of a clockless splitter; t wire k is the wire propagation delay of the k-th JTL/PTL segment of the path. With Equation 2.7 and Equation 2.8, we can specify setup and hold timing conditions 1 [34,35] Setup: skew i;j + i;j +t setup j T (2.9) Hold: hold j skew i;j + i;j (2.10) 1 Both setup and hold time conditions are derived for nominal conditions but can be generalized if path delay variances or other factors are considered. 24 wheret setup j is the setup time of a cellj andhold j is the hold time of a cellj. These two conditions apply to any data paths with celli andj as the launch cell and the capture cell of the path. The rst condition ensures no signal loss by capture cells during operations whereas the second condition ensures signal integrity of output pulses from launch cells. Referring to Figure 2.8, the SFQ pulse from Cell i in the fourth row should arrives ahead of the gray time interval of Cell j in the fth row. As we can see, the characteristics of clocked cells and clockless cell in SFQ circuits revise the details of both conditions, which encourages the development of new timing optimization strategies rather than the direct utilization of conventional CMOS strategies. The maximum working frequency of an SFQ circuit is the inverse of the mini- mum clock period T min which is given by T min = max i;j skew i;j + i;j +t setup j : (2.11) This equation indicates that the maximum working frequency is limited by the paths with the worst setup time condition. These paths are known as critical paths including critical data paths related to data path delays and critical clock paths related to clock skew. A net refers to a direct connection between two SFQ cells and there can be multiple nets included in a path as shown in Figure 2.7. A physical interconnection of a net in an SFQ circuit is built with a physical wire. 25 The hold time condition specied in Equation 2.10 prevents signal race errors during circuit operation. A signal race error happens when the pulse red from the launch cell undermines the pulse from the capture cell within one clock period. Referring to Figure 2.8, a signal race error arises when the arrival time of the SFQ pulse from Cell i in the fourth row is within the left part of the gray time interval ofCell j in the fth row. To quantify the hold time violations, we dene hold slack as Slack hold i;j =skew i;j + i;j hold j : (2.12) If hold slack is no less than zero, the race error will not happen. Otherwise, the signal integrity of this path may be severely compromised during the operation. Timing margins can be added to the right side of both Equation 2.11 and 2.12 if more factors such as process variations [36] are considered. 2.3 Unique Characteristics of SFQ Designs SFQ designs enjoy diverse characteristics that distinguish themselves from CMOS designs and these characteristics encourage researchers to reevaluate the well- developed circuit design work ow especially electronic design automation (EDA) tools. Some characteristics have been discussed in previous sections such as binary bit representation, destructive read-out, splitter insertions, timing fundamentals. 26 This section introduces more characteristics to help readers to have a better under- standing of SFQ circuits. 2.3.1 Fabrication SFQ fabrication technologies are distinguished by the critical current density (J c ) of JJs. For example, the critical current density of SFQ5ee is 100MA=m 2 and that of SFQ5hs is 200 MA=m 2 [37]. Moreover, there are dierent types of fabrication layers for SFQ circuit manufacture. Niobium (Nb) metal layers are reserved for inductor implementations and PTL deployment and a silicon dioxide layer is used between metal layers for isolation. A Josephson junction (Nb/AlO/Nb) requires a layer of aluminum oxide as a special layer. More details can be found in [38]. SFQ circuit fabrication technologies have been developed for two decades and there is a fast-paced development in recent years. From 2014 to 2018, MIT Lincoln Laboratory (MIT-LL) has released 5 new fabrication technologies starting from SFQ3ee to SFQ7ee [37]. Specically, the line width of the critical layers has been reduced from 500 nm to 250 nm and the available Nb metal layers have increased from 4 to 10. Other well-known SFQ foundries such as Hypres, Inc., Advanced Industrial Science and Technology (AIST), and D-Wave Systems, Inc are also making progress in a fast pace [39{42]. 27 2.3.2 Cooling Systems Superconducting electronic systems consisting of SFQ circuits are immersed in a boiling bath of liquid helium during the circuit operation. It is known that the critical temperature of Niobium for SFQ wires is 9.3 K and the temperature of liquid Helium is 4 K. During the laboratory experiments, the liquid helium is stored in a tank and the superconducting systems immersed in the liquid helium include not only SFQ circuits but also long cryoprobes, large numbers of wires and cables for biasing, signal input/output [26]. In order to commercialize the super- conducting electronic systems, the superconducting electronics industry developed compact, user-friendly cryocoolers in which electrical leads and interface electron- ics are integrated with superconducting electronic systems. To our best knowledge, the thermal constraints of cryocoolers are still under discussion [26]. Regardless of the types of cooling systems, the cost of cooling system maintenance is not a negli- gible factor when researches claim the advantages of SFQ circuits over conventional VLSI circuits [4]. 2.3.3 Clock Net Topology SFQ designs have a large number of clock net connections because each clocked cell receives a clock signal to evaluate received inputs and then generate an output. Given a large-scale SFQ circuits, the register insertion or other techniques for input signal synchronization usually augment extra clock net connections. A clock net 28 topology determines how clock signals are delivered from the primary input to clocked cells. A well-designed clock topology can even control or optimize the clock skew. However, not every well-designed clock topology for CMOS circuits is applicable to SFQ circuits due to common small data path delays and large clock skew of SFQ circuits. Specically, the value of clock skew in CMOS circuits is generally much smaller than that of most data path delays because there are many logic cells with no clock input along most paths. In contrast, the clock skew in SFQ circuits is comparable or even larger than that of most data path delays because SFQ logic cells are clocked cells. Moreover, the clock waveform strays away from the ideal waveform due to parasitic eects. To cope with the above issues, some clock net topologies are proposed for SFQ circuits including H-tree, concurrent- ow, counter- ow and clock-follow-data [34]. 2.3.4 Power Consumption There are two types of power consumption: dynamic power consumption and static power consumption. The dynamic power consumption of SFQ circuits depends on SFQ evaluations and communications using the single ux quantum ( 2:07 10 15 Js=C [4]). In detail, dynamic power consumption is contributed by JJs with a 2-phase change and is formulated by P D = I b f, where I b denotes the bias current through a JJ andf represents the switching frequency. Generally, the JJs in an SFQ circuit enjoy the critical current of 150 A and is biased at 70-75% 29 of the I c for the circuit operation. If we envision a 32-bit Kogge-Stone adder with 20; 000 JJs working at 20 GHz, its dynamic power consumption is about 46.57 W. The static power consumption depends on SFQ circuit designs. In the rapid single ux quantum (RSFQ) design, the JJs are biased with resistors and the static power consumption is calculated by P S = I b V b , where V b is the bias volt- age across a JJ. Typically, the bias voltage V b in RSFQ circuits is set around 2.6 mV. If we implement a 32-bit Kogge-Stone adder in RSFQ architecture, the static power consumption is about 5.46 mW ( 46.57 W). The power consumption of RSFQ circuits with biased resistors is obviously dominated by the static power consumption whereas the static power of CMOS circuits is conventionally negligi- ble. The power consumption of RSFQ circuits even shows quadratic growth when the area with embedded JJs increases. To cope with the large power consumption, an inductive bias network is proposed to replace the resistors-biased network in RSFQ circuits. The SFQ circuits with the inductive bias network are known as energy-eicent RSFQ (ERSFQ) circuits but they are data dependent and require re-balancing [4]. Ecient single quantum (eSFQ) circuits are developed using a passive network that feeds the bias current to the inductive network through clocked interferometers. Both ERSFQ and eSFQ achieve zero static power con- sumption after removing the biased resistors. To the best of our knowledge, many 30 SFQ circuit designs and optimizations are still going on for power consumption minimization. 31 Chapter 3 Path-Balancing Retiming for High-performance Single Flux Quantum Circuits Retiming is a circuit transformation in which registers are relocated for the opti- mization of performance, area, or energy consumption in such a way that the func- tional behavior of the circuit remains the same [16]. Over the past decades, retim- ing has attained a signicant level of maturity in CMOS designs for diverse applica- tions including but not limited to testability [43,44], logic re-synthesis [45,46], cir- cuit partitioning [47,48], and physical planning [49]. However, with the approach- ing end of Moore's law, researchers have started exploring promising non-CMOS technologies for building high-performance and energy-ecient systems. The cir- cuits built with non-CMOS technologies generally encompass an enormous num- ber of registers or buer cells due to the physical limitations of fundamental 32 devices [4, 50, 51]. This unique characteristic motivates researchers to reevalu- ate the possibility of generalizing or even advancing retiming for rapidly evolving non-CMOS applications [21,24]. This chapter describes a path-balancing retiming transformation which inte- grates register insertions and a conventional retiming transformation to optimally solve the CREM problem. We abstract an arbitrary synthesized circuit without path-balancing registers as a directed graph model and then formulate the CREM problem as an integer linear programming (ILP) problem based on the graph model. The formulated ILP problem is shown to correspond to a well-known minimum cost ow problem with polynomial algorithms through dual problem transformation. Next, we extend the CREM problem for building an arbitrary circuit under an advanced dual clocking architecture with an imbalance bound (i.e., converging paths should be path balanced only within the said imbalance bound { we can think of this factor as a positive slack on relative sequential depths of the said paths). We refer to such circuits as partially path balanced (PPB) circuits, and we denote this problem as PPB-CREM problem. We prove the extended problem to be NP-complete by reducing the vertex cover problem to it in polynomial time and space. Thus, we present a polynomial-time approximation algorithm with a proven bounded error to eciently solve the PPB-CREM problem. Given 14 benchmark circuits, our experiment results demonstrate that our approach reduces 38% of 33 register count and 50% of register energy consumption on average compared to the prior work [24]. Moreover, the solution acquired by our approximation algorithm is on average only 1.08X away from the optimal solution. The remainder of this chapter is organized as follows. Section 1 talks about the motivation of the development of generalized retiming; Section 2 provides background on SFQ circuits including design architectures; Section 3 describes system graph models; Section 4 and Section 5 detail path-balancing retiming for SFQ designs; Section 6 provides experimental results; Section 7 concludes. 3.1 Motivation Standard logic cells in SFQ designs are known as clocked cells because they, unlike CMOS logic cells, need input clock signals to generate output data signals. Since the input data signals of any clocked cell must arrive within a clock period win- dow (input signals that arrive in previous clock periods will be \consumed and forgotten" by the receiving cell), many registers are needed in an SFQ circuit to fully balance the path delays of the inputs to any SFQ logic cell. An SFQ circuit built with clocked cells and path-balancing registers may be considered as a fully wave-pipelined circuit with a large number of pipeline stages. Take an 8-bit inte- ger divider as an example, there are more than 70 pipeline stages even after logic synthesis optimization [21]. The wave-pipelined architecture with a large number of pipeline stages is also prevailing in the circuit designs built with cutting-edge 34 nanotechnologies [50, 51]. Wave-pipelined SFQ circuits encourage researchers to review conventional CMOS retiming transformation because the enormous num- ber of balancing registers results in energy overheads and performance limitations. The impetus of reducing the energy overheads while satisfying performance require- ments brings up a constrained register energy minimization (CREM) problem, whose objective is to build a wave-pipelined combinational circuit with minimal register energy consumption while meeting performance constraints. We tackle the CREM problem by generalizing the retiming transformation which places the balancing registers with mathematical optimization guarantees. Even though a large body of prior work addresses similar problems, their pro- posed frameworks are not tailored to the CREM problem. Monteiro et al. [17] describe a retiming transformation which relocates registers for minimal register energy consumption. However, their retiming transformation can only be per- formed after initial balancing registers are inserted either by human eort or by heuristic algorithms, which generally lack optimality guarantees for nal synthe- sis results. The retiming results thus depend on the initial procedure of register placement or register insertions and this argument applies to the retiming trans- formation described in [16, 19{21, 33]. Shenoy [18] elaborates on a constrained minimum area problem in which registers are relocated for minimizing total regis- ter count subject to performance constraints using retiming. While this problem is closed to our problem, it is obvious that register count minimization cannot 35 promise energy minimization. Moreover, the direct implementation of the CMOS retiming transformation to SFQ circuits fails performance optimization because the retiming transformation used to be developed under the assumption that logic cells are all clockless cells. Thus, the retiming transformation needs to be revised or generalized with the knowledge of SFQ designs for satisfying specied performance requirements. 3.2 High-performance SFQ Circuit Designs 3.2.1 Standard SFQ Cell Library The standard SFQ cell library used in this chapter is the one developed by Sunmag- netics [52] and adheres to the MIT-LL SFQ5ee process technology rules [53]. There are dierent types of cells in this library: DC2SFQ, SFQ2DC, PTL driver/receiver, NDRO (non-destructive readout ip- op), Splitter, NOT, DFF, two-input AND, two-input OR, and two-input XOR. An SFQ cell can only drive one other cell because of the physical limitations of Josephson junctions (JJs) [4, 50, 51]. The rst input of the NDRO cell is \set" whereas the second input is \reset". The splitter, a clockless cell, only receives data signals. Logic cells such as AND, NOT, OR, and XOR, known as clocked cells, receive both data signals and clock signals. Data signals appropriately change the internal state of the cell upon arrival at 36 inputs of an SFQ cell. The clock signal produces an appropriate output signal while resetting the internal state of the cell back to its default state. In this library, the bias DC current of each JJ is approximately 100 A. When extra current ows into a biased JJ, the summation of its DC bias current and the extra current may exceed the critical current level of the JJ, causing the JJ to leave its superconductive state, emanating a quantum ux pulse of xed magnetic ux value 0 = 2:0678 10 15 Wb=Volt-second=Ampere-Henry. The JJ subsequently returns to its superconductive state. A JJ undergoing such a \leap" is called an active JJ. Given dierent input signals and a clock signal, the dynamic energy consumption of a cell is proportional to the number of active JJs in the cell during any clock period. More details are provided in Table 3.1. 3.2.2 Single Clock Architecture SFQ designs in a single clock architecture refer to the SFQ circuits with a single global clock input for all cells. In these circuits, clock signals are a steady periodic sequence of SFQ pulses with the clock period being dened as the inter arrival time of two consecutive pulses on the clock signal input. A pipeline stage is simply a clocked cell plus any signal splitters and JTL connections (or PTL connections with required PTL driver and receiver pairs). With this view of the a pipeline stage, SFQ circuits follow the same timing rule as pipeline CMOS circuits in that all input signals of any cell must arrive in the same clock period for correct operation. 37 Table 3.1: Standard SFQ Cell Library Cells Height (m) Width (m) #Inputs (+Clk) #Outputs Prop. Delay (ps) Input Signals Energy (10 19 J) Splitter 50 40 1 (+0) 2 5.7 (0) (1) 0 6.21 NOT 50 70 1 (+1) 1 13.0 (0) (1) 8.28 8.28 DFF 50 60 1 (+1) 1 6.8 (0) (1) 6.21 12.42 NDRO 50 90 2 (+1) 1 10.0 (0,0) (0,1) (1,0) 6.21 12.42 14.49 AND 50 70 2 (+1) 1 8.7 (0,0) (0,1) (1,0) (1,1) 8.28 10.35 10.35 16.56 OR 50 70 2 (+1) 1 6.0 (0,0) (0,1) (1,0) (1,1) 4.14 10.35 10.35 16.56 XOR 50 70 2 (+1) 1 6.3 (0,0) (0,1) (1,0) (1,1) 8.28 14.49 14.49 22.77 Take the operation of an SFQ AND cell as an example. A operation error occurs in an AND cell if its inputs with logic 1 value arrive in two dierent clock periods because the internal state of the AND cell is reset after every clock pulse. We describe the circuit built to operate correctly under the single clock architecture as a fully path balanced (FPB) circuit and dene it formally as follows. Denition 1. A combinational circuit is a fully path balanced (FPB) circuit if the inputs for any clocked logic cell in the circuit arrive within the same clock period. In other words, the dierence between the number of clocked cells on the shortest path and that on the longest path from the primary inputs to a clocked cell is zero and this condition holds for all clocked cells in the circuit. 38 This denition suggests that any clocked cell in a FPB circuit can only receive output signals from the cells in its previous pipeline stage. Thus, many registers are typically inserted between pairs of clocked cells to attain correct logic behaviors. The number of these registers can approach the number of all other cells [21, 24]. We elaborate on the register requirement of a FPB circuit using Figure 3.1(a). Let the pipeline stage of INV1 and OR1 receiving Inputs be 1. As the inputs of AND1 are provided by INV1 and OR1 (after passing through splitter S1 ), the pipeline stage of AND1 is 2. Similarly, the pipeline stage of INV2 is 3. R1 and R2 are inserted to balance the delays of the output signal from S1. Therefore, all signals from Outputs are generated by the cells in the 3 rd pipeline stage. 3.2.3 Dual Clock Architecture As shown in [24], an SFQ circuit can be realized with fewer path-balancing registers by adopting a dual clock architecture with two clock sources: a fast clock and a slow clock. The frequency of the fast clock is +1 times that of the slow clock (where 2Z + [f0g.) The architecture requires that each set of primary inputs is repetitively fed for additional cycles of a fast clock. As a result, each set of primary outputs can only be acquired after +1 fast clock cycles which correspond to one cycle of a slow clock. This result suggests that the throughput of an SFQ circuit in the single clock architecture is +1 times higher than the throughput of the circuit in the dual clock architecture. However, this is not a critical factor 39 Figure 3.1: SFQ circuits with clocked cells. (a) a circuit in the single clock archi- tecture. (b) a circuit in the dual clock architecture. S: splitter; R: register; and Gray rectangle: NDRO. in many applications because the data processing throughput is generally limited by other micro-architectural considerations or data dependencies. The advantage of repeating inputs for cycles is that the circuit can accommodate an imbalance bound of which denotes the maximum dierence between the number of clocked cells on the shortest path and that on the longest path from the primary inputs to any clocked cell. We describe the circuit built in a dual clock architecture as a partially path balanced (PPB) circuit and dene it as follows: Denition 2. A combinational circuit is a partially path balanced (PPB) circuit if the input signals for any clocked logic cell in the circuit arrive within a window 40 of + 1 clock cycles. In other words, the imbalance bound of the circuit is . Evidently, the primary inputs to PPB circuits must be persistently present for + 1 clock cycles, and the output can be read only every + 1 clock cycles. Notice that a FPB circuit is a special case of a PPB circuit where = 0. We explain the details of a PPB circuit using Figure 3.1(b). The signal of the fast clock is sent to the clock input of all clocked cells. As shown in [24], we can partition an arbitrary PPB circuit into modules and each module consists of a repeat band, a logic block, and a mask band. The inputs of the logic block are repetitively fed by the outputs of the repeat band which consists of only NDRO cells. The reset and set inputs of each NDRO are fed by the slow clock and the output of the previous module, respectively. The logic block decides the functional behaviors of the circuit and we use the circuit in Figure 3.1(a) to represent the logic block. The mask band is formed by 2-input AND cells whose data inputs are fed by the slow clock and the outputs of the logic block. Thus, the outputs of a module are updated at the rate of the slow clock and therefore, they are the correct outputs of the whole circuit. Given Figure 3.1(b), if we repeat the inputs to the logic block for additional two times ( = 2), we can remove all registers in the logic block. Although the design approach for FPB circuits has matured, the design approach for PPB circuits is still in its infancy. Pasandi and Pedram [24] proposed a heuristic algorithm which is able to generate PPB circuits in time complexity of O(jEj +jVj). Using their approach, the register count can be reduced by more 41 than 50% for many benchmark circuits given =4. However, the reductions may be as low as 10% in other benchmark circuits. These inconsistent reductions point to the necessity of developing a more rigorous approach to building optimal PPB circuits. Since this chapter addresses the optimization problem, the cost of the repeat and mask bands will not be considered since it is a xed cost. Moreover, details of SFQ clock delivery networks will not be discussed here due to page limitation but they can be found in [13]. 3.3 System Models This section denes the notation and terminology used in this chapter and elabo- rates on path-balancing retiming models. 3.3.1 Preliminaries We model a combinational circuit by a graph G(V;E). Each vertex v i 2 V rep- resents a cell where its propagation delay is denoted by d(v i ). Each direct edge e i;j 2 E corresponds to a connection from the output of v i to the input of v j . The weight w(e i;j ) of e i;j denotes the number of registers on the connection. Note that we view each clocked cell as a composition of an \immobile" (permanently attached) register and a clockless cell [24]. Thus, if e i;j ends at a vertex which 42 represents a clocked cell, value of w(e i;j ) is increased by 1. The expected energy consumption of a register along e i;j is calculated as follows: i;j =Pr 1 i;j E 1 r +Pr 0 i;j E 0 r ; (3.1) wherePr 1 i;j andPr 0 i;j are the probabilities of forwarding logic `1' and `0' values from v i to v j through e i;j . E 1 r and E 0 r are the internal register energy consumptions of generating `1' and `0', respectively. A path p from v i to v j is symbolized as v i p ! v j . We use w(p) to denote the sum of edge weights of path p. Similarly, we use d(p) to denote the the sum of vertex delays of path p (inclusive of delays of v i and v j ). Evidently, if there are two or more paths from v i to v j , all such paths are considered independently of each other. The clock period T of G(V;E) is calculated as follows: T = max p:w(p)=0 d(p): (3.2) A circuit retiming is an integer vertex labelling function r : V ! Z + [f0g on G(V;E) such that w r (v i ;v j ) = w(v i ;v j ) +r(v j )r(v i ) for each edge e i;j . A retiming is legal exactly when w r (v i ;v j ) 1 for all e i;j 2E. One can dene W (v i ;v j ) = min p w(p) for any path p :v i !v j and D(v i ;v j ) = max p d(p) for any path p such that w(p) = W (v i ;v j ). The former denotes the minimum of register counts along any path fromv i tov j whereas the latter denotes 43 the maximum (vertex) delay among all paths from v i tov j (inclusive of endpoints of the path) that have weight W (v i ;v j ). Well-known conditions for having a legal retiming transformation in which the clock period is less than or equal to T r are the following [54]: 1.8e i;j , we have r(v i )r(v j )w(e), and 2.8p :v i !v j we haver(v i )r(v j )w(p)1. This condition can be simplied to remove the path dependence, resulting in the condition that8v i ;v j such that D(v i ;v j )>T r , we have r(v i )r(v j )W (v i ;v j ) 1. 3.3.2 Path-Balancing Retiming Models In the context of SFQ circuits, we dene s(v j ) as the pipeline stage (sequential depth) of v j . If all inputs of v j are fed by primary signals, s(v j ) is 0 if v j is a clockless cell; ands(v j ) is 1 ifv j is a clocked cell. Otherwise,s(v j ) is calculated as follows: s(v j ) = max v i 2V;e i;j 2E s(v i ) +w(e i;j ): (3.3) We can interpret s(v j ) as how many clock cycles v j needs to wait for its imme- diate inputs to be ready after a set of primary input values is launched into the circuit. In our work, we denote a path-balancing retiming operation as an integer- valued edge-labelling function w r : E! Z + [f0g on G(V;E). More precisely, 44 the path-balancing retiming replaces w(e i;j ) in G(V;E) by w r (e i;j ) ( w(e i;j )) to form a new (fully or partially) path-balancing graph G r (V;E). See Lemma 1 for a formal exposition of the edge weight update function. The lower script \r" on some symbols (e.g. s r (v j )) indicates that the referred symbols are analogously dened on or derived from G r (V;E). Given an arbitrary synthesized circuit with- out any path-balancing registers, we aim to build the corresponding FPB/PPB circuit with minimal register energy while meeting performance constraints using path-balancing retiming. This problem is called the constrained register energy minimization (CREM) problem. 3.4 Fully Path Balanced Circuits We provide details of building a FPB circuit by a path-balancing retiming trans- formation, CREM problem formulation, CREM problem complexity, and solution methods in this section. 3.4.1 Fully Path-Balancing Retiming In this subsection, we prove the critical properties of FPB circuits and verify the validity of the proposed path-balancing retiming transformation following the denition in [54]. 45 Lemma 1. Let G r (V;E) be a transformed version of G(V;E) obtained by a path- balancing retiming transformation. G r (V;E) represents a FPB circuit if and only if s r (v j )s r (v i ) =w r (e i;j );8e i;j 2E. Proof. If G r (V;E) is a FPB circuit, all input signals of any vertex v j must arrive within the s r (v j ) th clock cycle. Since these input signals are generated from the previous vertices, we have s r (v j ) = s r (v i ) +w r (e i;j ); 8e i;j 2 E. Conversely, a circuit where s r (v j ) = s r (v i ) +w r (e i;j ); 8e i;j 2 E exhibits the property that an input signal arriving at v i within the s r (v i ) th clock period propagates on e i;j and arrives atv j within thes r (v j ) th clock period wheres r (v i ) +w r (e i;j ) =s r (v j ). This hold true for all inputs of any v i 's. Therefore, the circuit is a FPB circuit. Lemma 2. Let G r (V;E) be the FPB version of G(V;E). The clock period of G r (V;E) is less than or equal toT r if and only if8v i ;v j 2V such thatD r (v i ;v j )> T r , we have W r (v i ;v j ) =s r (v i )s r (v j )1. Proof. If G r (V;E) is a FPB circuit with its clock period being less than or equal to T r , then, from Equation 3.2, we have w r (p) 1 for any path p : v i ! v j that satises d r (p) > T r . Since G r (V;E) represents a FPB circuit,8p : v i ! v j we have w r (p) =W r (v i ;v j ) =s r (v j )s r (v i ) based on Lemma 1, ensuring d r (p) = D r (v i ;v j ). We thus haves r (v i )s r (v j )1;8v i ;v j 2V such thatD r (v i ;v j )>T r . Conversely, if G r (V;E) is a FPB with s r (v i )s r (v j )1; 8e i;j 2 E such that D r (v i ;v j )>T r , the path weight of any path with D r (v i ;v j )>T r must be at least 46 1. Since there is no zero-weight path with delay larger than T r , the clock period of G r (V;E) is less than or equal to T r . Denition 3. LetC andC r be two combinational circuits. Suppose that for every conguration c of C, there exists a conguration c r of C r such that when C is started inc andC r is started inc r , the two circuits exhibit the same behavior. The two circuits C and C r are said to be equivalent. Lemma 3. Given G(V;E), let C be a correct FPB circuit of G and C r be a transformed FPB circuit obtained by using a fully path-balancing retiming trans- formation. C and C r are equivalent. Proof. We prove this lemma by an induction argument similar to that used in [54]. Lett o be the maximum dierence of the largest stage delays from a primary input to any vertex between C and C r which may be represented as follows: t 0 = max v i 2V js(v i )Ts r (v i )T r j; whereT andT r denote the clock periods of C andC r , respectively. Suppose both C and C r start at time zero and run with an arbitrary sequence of inputs. For all t t 0 , we can nd that the operation performed by a vertex v i in C at time t is the same as that performed by v i in C r at time ts(v)T +s r (v)T r based on Lemma 1. Thus, the behaviors of C and C r are indistinguishable from tt 0 . 47 3.4.2 FPB-CREM Problem Formulation We perform the path-balancing retiming for a combinational circuit modeled by G(V;E) for solving the CREM problem with performance constraints which is formulated as follows. Minimize: X e i;j 2E i;j w r (e i;j ): (3.4) Subject to: s r (v i )s r (v j )1; 8v i ;v j 2V; D r (v i ;v j )>T r ; s r (v j )s r (v i )w r (e i;j ) = 0; 8e i;j 2E; w r (e i;j )c i;j ; 8e i;j 2E; s r (v i );w r (e i;j )2Z + [f0g; 8v i 2V;e i;j 2E: The objective equation is the sum of the expected energy consumption of registers on all edges. The energy consumption of immobile registers is also included in the objective equation but is not reported because it is a constant. The rst condition ensures that the clock period of the transformed circuit is equal to or less than T r based on Lemma 2. The second condition guarantees that the transformed circuit 48 is a correct FPB circuit given Lemma 1 and Lemma 3. The third condition sets the minimum number of the registers on edges. 3.4.3 FPB-CREM Problem Complexity Herein, we prove that the described CREM problem is polynomially solvable and propose a solution method. Theorem 1. The CREM problem for building a FPB circuit with its clock period equal or less than T r is a polynomially solvable problem. Proof. The second and third conditions of the CREM problem can be replaced by s r (v i ) s r (v j ) c i;j ; 8v i ;v j 2 V;e i;j 2 E because s r (v i ) s r (v j ) = w r (e i;j )c i;j . We then replace the objective function of P e i;j 2E i;j w r (e i;j ) with P v j 2V [ P v i 2FI(v j ) i;j P v k 2FO(v j ) j;k ]s r (v j ) where FI=FO stands for fan- in/fanout. The CREM problem is thus reformulated as follows. Minimize: X v j 2V [ X v i 2FI(v j ) i;j X v k 2FO(v j ) j;k ]s r (v j ): (3.5) 49 Subject to: s r (v i )s r (v j )1; 8v i ;v j 2V : D r (v i ;v j )>T r ; s r (v i )s r (v j )c i;j ; 8e i;j 2E; s r (v i )2Z + [f0g; 8v i 2V: The dual form of the reformulated problem is a minimum cost network ow problem with polynomial algorithms [16]. 3.4.4 FPB-CREM Solution Method The CREM problem can be solved by any solver for a minimum cost network ow problem. Values of pipeline stages are then derived from values returned by the solver. The dominant cost of this method is for solving the minimum cost network ow where the number of performance constraints is bounded byO(jVj 2 ). Therefore, the CREM problem can be solved inO((jEj+jVj 2 ) 2 log(jVj)+jVj(jEj+ jVj 2 )log 2 (jVj)) time [55], which is the same complexity as that of conventional constrained retiming algorithms [18]. 50 3.5 Partially Path Balanced Circuits As the previous section, we elaborate on path-balancing retiming for building a PPB circuit before formulate the corresponding optimization problem and describe our solution methods. 3.5.1 Partially Path-Balancing Retiming We prove a critical property of PPB circuits and verify the equivalence between FPB and PPB circuits. Lemma 4. LetG r (V;E) be the transformed version ofG(V;E) obtained by using a path-balancing retiming transformation. G r (V;E) is a PPB circuit with an imbal- ance bound of if and only if s r (v j )s r (v i )w r (e i;j ) ;8e i;j 2E. Proof. If G r (V;E) is a PPB circuit, the input signals of a vertex v j are ready in the s r (v j ) th clock period. Thus the arrival time of any input signals from v i to v j connected by e i;j must be after the (s r (v j ) 1) st clock period. Thus, we have s r (v j )s r (v i )w r (e i;j ) ;8e i;j 2 E. Conversely, the graph model with s r (v j )s r (v i )w r (e i;j ) ;8e i;j 2E guarantees that the arrival time of all input signals ofv j from its previous vertices is equal to or larger than the (s r (v j )1) st clock. The model is a PPB circuit. 51 Lemma 5. Given G(V;E), let C be its corresponding correct FPB circuit and C r be the transformed circuit with an imbalance bound of . Then C and C r are equivalent. Proof. We prove this lemma in the same way as what we did for Lemma 3. Let t o be the maximum dierence of the largest stage delay from a primary input to any vertex between C and C r and is represented as follows: t 0 = max v i 2V js(v i )T (s r (v i ) + )T r j: SupposeC andC r start at time zero and run with an arbitrary sequence of inputs. For alltt 0 , we nd that the operation performed by a vertexv i inC at timet is the same as the operation performed byv i inC r at timets(v)T + (s r (v) + )T r based on Lemma 4. Thus,C andC r behaviors are indistinguishable fortt 0 . 3.5.2 PPB-CREM Problem Formulation We extend the CREM problem for building a PPB circuit with minimal energy consumption while meeting performance constraints. Given an imbalance bound of , the PPB-CREM problem is formulated as follows. Minimize: X e i;j 2E i;j w r (e i;j ): (3.6) 52 Subject to: W r (v i ;v j )1; 8v i ;v j 2V :D r (v i ;v j )>T r ; s r (v i ) +w r (e i;j )s r (v j ) 0; 8e i;j 2E; s r (v j )s r (v i )w r (e i;j )b r (e i;j );8e i;j 2E; X e i;j 2E b r (e i;j )jFI(v j )j 1; 8v j 2V; w r (e i;j )c i;j ; 8e i;j 2E; s r (v i );w r (e i;j )2Z + [f0g; 8v i 2V;e i;j 2E; b r (e i;j )2f0; 1g; 8e i;j 2E: The objective equation is the summation of the expected energy consumption of the registers on all edges. Similarly, the energy consumption of immobile registers is included in the objective equation but is not reported. The rst condition ensures that if the worst delay of the path(s) with the minimum path weight is larger than T r , the path(s) starting from v i to v j have at least one register between their source and sink vertices. Note that the condition W r (v i ;v j ) = s r (v j )s r (v i ) only holds for FPB circuits. The second condition suggests that the pipeline stage of a vertex is equal to or larger than than that of its input vertex. The third condition indicates the input signals of a vertex can arrive within a window of +1 clock cycles. b r (e i;j ) is a binary variable that is set 53 to 1 if the connection from v i tov j is not on any longest path from primary input to v j . The fourth condition thus captures the fact that s r (v j )s r (v i ) = w r (e i;j ) when the connection from v i to v j is along any longest path from primary input to v j through the circuit; otherwise, we may have s r (v j )s r (v i )w r (e i;j ) . 3.5.3 PPB-CREM Problem Complexity Theorem 2. The proposed PPB-CREM problem for building a PPB circuit is NP-complete. Proof. We prove that the PPB-CREM problem is NP-complete by a polynomial- time reduction of a NP-complete vertex cover problem (VCP). Given an undirected graph G(V;E), VCP asks for a subset S V so every edge has at least one connecting vertex in S and is formulated as follows. min X v i 2V i y(v i ) subject to y(v i ) +y(v j ) 1; 8e i;j 2E; y(v i )2f0; 1g; 8v i 2V: y(v i ) = 1 if the vertex is in S. Otherwise, y(v i ) = 0. Problem reduction: We create G r (V r ;E r ) as shown in Figure 3.2 to show how an arbitrary vertex cover problem is reduced to the PPB-CREM problem. In G r (V r ;E r ), we create a super-source vertexv S and a super-terminal vertexv T , the 54 pair being connected by e S;T with a condition of w r (e S;T ) 2. For each vertex v i 2 V , we create e i k;l 2 E r with its tail/head vertices and assign the register energy consumption along this edge as i . Such edge is marked as a solid edge in Figure 3.2. For each condition ofy(v i ) +y(v j ) 1 (leti<j always be true), three edges are created inG r (V r ;E r ): one fromv S to the tail vertex ofe i k;l , one from the head vertex of e i k;l to the tail vertex of e j , and the last one from the head vertex of e j k;l to v T . We assign the register energy consumption along these three edges as P v i 2V i . The edges created for the VCP condition are marked as the dotted edge in Figure 3.2. As a result, we reduce an arbitrary vertex cover problem based on G(V;E) to the PPB-CREM problem with =1 based on G r (V r ;E r ) without performance constraints. Figure 3.2: A directed retiming graph, G r (V r ;E r ). The edges related to the con- dition of y(v 2 ) +y(v n ) 1 are marked as red dotted lines in this gure. Reduction time analysis: The described reduction is done inO(jVj +jEj) time and the solution acquired by solving the PPB-CREM problem can be mapped back to the VCP in O(jVj) time. Thus, our problem is NP-complete. 55 3.5.4 PPB-CREM Solution Method We resort to an approximation algorithm for the PPB-CREM problem by replacing W r (v i ;v j )1 with s r (v i )s r (v j )1 and assigning value 1 to all b r (e i;j ). The simplied problem is as follows. Minimize: X e i;j 2E i;j (s r (v i )s r (v i ) +w r (e i;j )): (3.7) Subject to: s r (v i )s r (v j )1; 8v i ;v j 2V; D r (v i ;v j )>T r ; z r (e i;j )s r (v j ) 0; 8e i;j 2E; s r (v j )s r (v i )w r (e i;j ) ; 8e i;j 2E; s r (v i )s r (v i )w r (e i;j )c i;j ; 8e i;j 2E; s r (v i )2Z + [f0g;8v i 2V;e i;j 2E; The dual form of the problem is a minimum cost ow problem which can be solved in O((jEj +jVj 2 ) 2 log(jVj) +jVj(jEj +jVj 2 )log 2 (jVj)) time [55]. For an arbitrary pair of v i and v j with D r (v i ;v j ) > T r , the rst condition of the simplied problem cannot guarantee W r (v i ;v j ) 1. If W r (v i ;v j ) = 0 but s r (v i )s r (v j )1 (which is guaranteed by the rst condition), we can locate 56 at least one edge along any path from v i to v j such that the weight of the located edge can be increased by one while the pipeline stage values of its connecting vertices remain the same. To meet the performance constraints, we perform a graph search to increase the edge weight, if necessary. As a result, we can have W r (v i ;v j ) 1;8v i ;v j 2 V;D r (v i ;v j ) > T r by adding no more thanjEj registers in G r (V;E) while keeping the existing pipeline stage of all vertices intact. Based on the second and the third conditions of the simplied problem, the value of s r (v j ) is in the range [max v i z r (e i;j ), max v i z r (e i;j ) + ]. To apply the solutions to the PPB-CREM problem, we need to check if s r (v j ) = max v i z r (e i;j ) holds for every vertex by performing a graph search again. If not, we evaluate the energy overhead of increasing the weight of dierent edges connected to v j to achieve s r (v j ) = max v i z r (e i;j ). We then choose the best action. No more than jVj registers will be added toG r (V;E) to ensure correct stage values. Proofs for above statements are straightforward and hence omitted here. The graph search for changing the edge weight for specied clock delays and setting correct stage values can be done in O(jVjjEj) time. Thus, the time com- plexity of the whole approximation algorithm isO((jEj+jVj 2 ) 2 log(jVj)+jVj(jEj+ jVj 2 )log 2 (jVj)). The bounded error of our algorithm is O(jEj + jVj) since the solution value of the objective function of the reformulated problem is less than or equal to that of the PPB-CREM problem. 57 3.6 Experimental Results and Discussions To evaluate the proposed optimization framework, we synthesized Kogge-Stone adders (KSA), array multipliers (Mul), and integer dividers (IntDiv) using the SFQ logic synthesis tools from [21, 33]. Some of the ISCAS c-series and EPFL benchmarks were also synthesized. We used the best clock period acquired by prior approach of [21, 33] as our performance constraints. More details are specied in Table 3.2. The number of performance constraints for SFQ circuits is actually far less thanjVj 2 because we just need to consider the paths which start from a cell to another cell without passing any clocked cells in between. Our framework is written in Python with the python optimization tool library and the hardware environment for generating the experimental results is a Linux machine with an Intel(R) Core(TM) CPU i7-6700 @3.40 GHz and a 16.0 GB RAM. We start by comparing our approach to the state-of-the-art approach proposed in [21,33] for building FPP circuits. Their approach resorts to a two-step method for achieving cell count minimization with performance optimization. Specically, they utilize dynamic programming algorithms for minimizing the number of initial path-balancing registers as well as other cells. They then perform conventional con- strained retiming algorithms for meeting performance constraints while minimiz- ing the number of the path-balancing registers in the end [18]. Note that although each step guarantees the optimal result, the nal result may not be optimal. In contrast to their approach, our retiming can perform register count minimization 58 Table 3.2: Circuit Specication Circuits #Clocked Cells #Clockless Cells Cell Area (mm 2 ) Cell Energy (10 19 ) Clock (ps) 16-bit KSA 194 178 1.04 2335 20.1 32-bit KSA 469 437 2.52 5415 20.1 8-bit Mul 320 320 1.76 4025 14.4 16-bit Mul 1408 1408 7.74 18117 14.4 8-bit IntDiv 601 491 3.09 7963 25.8 16-bit IntDiv 2095 1684 10.7 27970 25.8 c499 350 309 1.84 5414 23.4 c1908 436 348 2.22 6304 23.1 c3540 1356 1054 6.85 16964 25.8 c6288 2121 1690 10.8 29119 22.8 Coding-cavlc 841 708 4.35 10921 22.8 Int2Float 274 227 1.41 3509 22.8 Sin 8283 7053 43.10 123654 25.8 Voter 10334 6980 50.00 134952 20.1 while meeting performance constraints without reliance on initial path-balancing registers. Table 3.3 reports the number of the balancing registers and the register energy consumption. We do not report the running time because the time complexity of the prior approach [21, 33] is the same as our approach due to the constrained retiming algorithms (the running time of all circuits is less than 4 minutes and is just a few seconds for most circuits). As expected, all values are lower except #Register of 16-bit Mul because the lowest register energy can be acquired given dierent values of #Register. Based on the average ratios, we can see that the two-step procedure cannot promise optimal results. The main reason is that the rst step of the dynamic programming algorithms restricts the number of movable 59 path-balancing registers for the conventional retiming algorithms. We would like to emphasize that even though the energy reduction of about 5% compared to the reference is not that signicant, it is still noteworthy because of the ultra-high operating frequency of SFQ circuits (e.g., 20 GHz and higher). Table 3.3: Results for Building Fully Path Balanced Circuits Circuits #Registers Register Energy (10 19 ) FPB [21] + [33] FPB FPB [21] + [33] FPB 16-bit KSA 220 206 (0.93) 1894 1806 (0.95) 32-bit KSA 580 522 (0.90) 4915 4550 (0.92) 16-bit Mul 3390 3391 (1.00) 30621 30621 (1.00) 16-bit IntDiv 15418 15140 (0.98) 134147 130904 (0.97) c3540 1174 1117 (0.95) 10625 9978 (0.93) c6288 3426 3393 (0.99) 31288 30645 (0.97) Coding-cavlc 556 549 (0.98) 5059 4977 (0.98) Sin 11954 11445 (0.96) 112030 106083 (0.94) Average Ratio - 0.96 - 0.95 Table 3.4 reports the results of building PPB circuits with =1 using our approach and the state-of-the-art approach described in [24]. The state-of-the-art approach is a greedy-based approach with the time complexity of O(jEj +jVj). The optimal results generated by ILP solvers are marked by PPB opt as reference values. The optimal results of Sin and Voter benchmarks are not provided because they could not be generated in 24 hours. Although PPB circuits are developed for reducing register count, the prior approach could still result in a large number of registers when is small because it is unaware of the in uence of each register insertion on nal results. Take circuit Sin as an example. #Register of Sin in PPB is more than that ofSin in FPB. The reason is that although many registers 60 are removed in the early stage of this approach, far more registers are added in the late stage of the approach. If we compare the values of #Register and Register Energy between Table 3.3 and Table 3.4, we will nd that the energy reduction reaches 22% on average after the implementation of a dual clock architecture using our approach. So does the reduction of #Register. No less than 30% reduction in #Register can even be attained for some circuits including 16-bit KSA, 32-bit KSA, c3540, and Coding-cavlc. Table 3.4: Results for Building Partially Path Balanced Circuits ( = 1) Circuits #Registers Register Energy (10 19 J) Run Time (s) PPB opt PPB [24] PPB PPB opt PPB [24] PPB PPB opt PPB [24] PPB 16-bit KSA 105 127 (1.20) 117 (1.11) 971 1131 (1.16) 1068 (1.09) 4.17 0.01 2.84 32-bit KSA 277 369 (1.33) 325 (1.17) 2563 3165 (1.23) 2915 (1.13) 10.3 0.02 6.66 8-bit Mul 630 1153 (1.83) 669 (1.06) 5663 11843 (2.09) 5936 (1.04) 5.86 0.02 5.48 16-bit Mul 3047 10427 (3.42) 3175 (1.04) 27937 110180 (3.94) 28899 (1.03) 162.4 0.11 97.2 8-bit IntDiv 1808 2712 (1.50) 1842 (1.01) 15692 35766 (2.27) 15950 (1.01) 11.2 0.04 10.3 16-bit IntDiv 14384 25551 (1.77) 14490 (1.00) 123971 412116 (3.32) 124793 (1.00) 648.2 0.18 245.6 c499 168 176 (1.04) 171 (1.01) 1564 2434 (1.55) 1592 (1.01) 4.90 0.01 4.59 c1908 540 622 (1.15) 594 (1.10) 5122 7366 (1.43) 5517 (1.07) 5.73 0.02 5.46 c3540 703 948 (1.34) 817 (1.16) 6234 11092 (1.77) 7059 (1.13) 800.5 0.05 18.7 c6288 2681 8757 (3.26) 2960 (1.10) 24327 131960 (5.42) 26549 (1.09) 65.5 0.11 50.0 Coding-cavlc 304 456 (1.50) 368 (1.21) 2749 4890 (1.77) 3258 (1.18) 21.5 0.03 17.6 Int2Float 139 192 (1.38) 175 (1.25) 1303 2028 (1.55) 1584 (1.21) 3.97 0.01 3.66 Sin* - 22781 (-) 10055 (-) - 306565 (-) 92034 (-) >86400 0.52 5.36 Voter* - 4335 (-) 4034 (-) - 50456 (-) 33607 (-) >86400 0.46 4.65 Average Ratio - 1.72 1.10 - 2.29 1.08 - - - *: The experiments were run on an Intel Xeon E5-2450 v2 CPU with 32GB RAM. Based on the reported average ratios in Table 3.4, our approach reduces 38% of register count and 50% of register energy consumption compared to the prior approach [24]. Moreover, the register energy consumptions of 16-bit Mul and c6288 produced by the prior approach are more than 3X those of the optimal results. In contrast, our approximation algorithm produces results that are within 10% of the optimal results. These improvements demonstrate the benets of developing rigorous approximation algorithms for building PPB circuits. Notice that our 61 approximation algorithm has rather larger running times on some large circuits (e.g. 16-bit IntDiv) compared to the prior art approach. However, these running times are still acceptable because they are no more than a few minutes on the largest benchmarks. It is also worth stating that conventional retiming algorithms for area minimization with performance constraints have the same time complexity as our algorithm [16,18]. We study the values in the fth and the seventh columns in Table 3.4 to com- pare our results and optimal results when =1. The register energy acquired by our approach is only 1.08X the optimal results on average even though the error is theoretically bounded by O(jEj + jVj). Comparing the result acquired by our approach against the optimal result for the 16-bit IntDiv, we conrm that our pro- posed approach is feasible for large circuits which need more than 14,000 registers for path balancing. A relatively large ratio is observed for Coding-calvc, a part of a context adaptive variable-length coding (CAVLC) video encoder. This circuit contains look-up tables for coecients, total zeros, trailing ones, and other sig- nals, resulting in a relatively complex and non-repetitive circuit design [56]. Thus, many values of pipeline stages assigned by our minimum cost ow solver have large errors. A similar circuit design is also found in Int2Float. We further examine the bounded error of our approximation algorithm by set- ting =2 and provide our results in Table 3.5. By comparing the values in Table 3.4 and Table 3.5 generated by our approach, we observe a further reduction in 62 Table 3.5: Results for Building Partially Path Balanced Circuits ( = 2) Circuits #Registers Register Energy (10 19 J) Run Time (s) PPB opt PPB [24] PPB PPB opt PPB [24] PPB PPB opt PPB [24] PPB 16-bit KSA 78 83 (1.06) 95 (1.21) 726 763 (1.05) 863 (1.18) 5.01 0.01 3.28 32-bit KSA 217 252 (1.16) 277 (1.27) 2020 2257 (1.11) 2468 (1.22) 7.32 0.04 6.38 8-bit Mul 574 1050 (1.83) 635 (1.10) 5174 10802 (2.08) 5582 (1.07) 5.45 0.01 5.15 16-bit Mul 2926 9964 (3.40) 3172 (1.08) 26881 105484 (3.92) 28660 (1.06) 128.5 0.10 91.1 8-bit IntDiv 1668 2501 (1.50) 1721 (1.03) 14434 33393 (2.31) 14850 (1.02) 11.0 0.04 9.60 16-bit IntDiv 13923 24708 (1.77) 14132 (1.01) 119757 401869 (3.35) 121471 (1.01) 485.9 0.17 236.9 c499 136 144 (1.05) 136 (1.00) 1266 1937 (1.53) 1266 (1.00) 4.72 0.01 4.40 c1908 441 622 (1.41) 508 (1.18) 4171 6101 (1.46) 4758 (1.14) 7.51 0.01 4.98 c3540 506 706 (1.39) 639 (1.26) 4483 8140 (1.81) 5506 (1.22) 2074 0.05 17.7 c6288 2388 8228 (3.44) 2783 (1.16) 21727 124518 (5.73) 24914 (1.14) 13297 0.11 47.5 Coding-cavlc 219 323 (1.47) 285 (1.30) 1969 3433 (1.74) 2522 (1.28) 2741 0.03 16.64 Int2Float 99 132 (1.33) 124 (1.25) 900 1365 (1.51) 1111 (1.23) 6.12 0.01 3.65 Sin* - 21341 (-) 9563 (-) - 288708 (-) 87886 (-) >86400 0.54 5.76 Voter* - 2757 (-) 2874 (-) - 33723 (-) 24813 (-) >86400 0.36 5.66 Average Ratio - 1.73 1.15 - 2.30 1.13 - - - *: The experiments were run on an Intel Xeon E5-2450 v2 CPU with 32GB RAM. register count and register energy after the increase of . The reduction achieved by our algorithm remains far more than that reported by the prior approach [24]. However, the register energy ratio between our results and optimal results increases from 1.08 to 1.13 when increases from 1 to 2, which justies the derived bounded error ofO(jEj+jVj). While the average competitive ratio between our results and optimal results does increase with the increase of , we believe that this does not represent a signicant drawback because the substantial throughput drop (recall that the throughput is inversely proportional to +1) is generally not considered for most designs even when the peak target throughput is set by other design constraints. 63 3.7 Conclusion We presented a constrained register energy minimization (CREM) problem for superconductive designs in which register insertions are performed in the post- synthesis step for building a high-performance pipeline circuit with minimal reg- ister energy consumption. We formulated this problem as a mathematical opti- mization problem and prove that it is polynomially solvable through dual trans- formation. Moreover, the CREM problem was extended to advanced dual clock architecture while was proven as a NP-complete problem. To tackle the extended CREM problem, we presented a polynomial-time algorithm with a proven bounded error. We evaluated the feasibility and the robustness of our algorithm with 14 benchmark circuits. Experimental results showed that our approach reduces 38% of the register count and 50% of the register energy consumption compared to the state-of-the-art. Moreover, the average ratio of the register energy consumption between our solutions and optimal solutions is only 1.08. In closing, we point out that although in this chapter the application target for our generalized retiming framework was chosen to be superconductive SFQ circuits, the proposed formulations and algorithms have general applicability e.g., CMOS wave-pipelined circuits and other (emerging) non-CMOS circuits. 64 Chapter 4 qGDR: A Routing Tool for Single Flux Quantum Circuits Single- ux-quantum (SFQ) circuit technologies are promising digital circuit tech- nologies with high-speed and extremely low-power characteristics. However, heavy wire routing tasks are nished either by considerable human eort or by commer- cial routing tools with few physical considerations for the SFQ circuits. In this chapter, we present an integrated global and detailed router for the SFQ circuits, qGDR, which aims at reducing impedance mismatch during signal transfer by min- imizing total number of used vias. The global router allocates routing resources while minimizing the via usage by a dynamic layer assignment algorithm. The detailed router follows the global routing results to complete the routing task by resorting to a maze routing algorithm. Following the MIT-LL SFQ5ee process technology, qGDR can use only two routing layers to route an 8-bit integer divider with more than 45,000 Josephson junctions in about 30 minutes. The remainder of this chapter is organized as follows: Section 1 talks about the motivation for router development; Section 2 presents the routing procedure and 65 the implementation of qGDR; Section 3 details the simulation results; Section 4 concludes this chapter. 4.1 Motivation After the cell placement step (cf. Figure 4.1), circuit routing generates the con- nection wire of individual nets to interconnect pins of a cell or pads at the chip boundary without violating design rules provided by chip foundries [12, 23]. The most important objective of routing is to ensure all nets of a circuit are routed. Other common objectives include minimizing the total wirelength after routing and satisfying the required timing budget of each net. The diculty of solving the routing problem increases with the number of nets. To make it manageable, the routing problem is usually solved through a two-stage approach of global routing followed by detailed routing [23]. After identifying the routing regions of a circuit layout, global routing rst partitions these regions into tiles and then generates loose connection wires for nets, as shown in Figure 4.1(b). Following the loose wires from the global routing, detailed routing assigns wire segments and vias to form exact connection wires for nets, as shown in Figure 4.1(c). Notice that detailed routing must account for design rules. Currently, the heuristic algorithms of both global routing and detailed routing are mainly developed for CMOS cir- cuits [23,57], which may not be the best t for SFQ circuits. Thus, we develop a complete routing tool, qGDR, tailed for large-scale SFQ circuits. 66 Figure 4.1: Routing process (a) A circuit layout. (b) Global routing. (c) Detailed routing 4.2 Routing Procedure Multilayer routing of large-scale SFQ circuits is a very complex combinatorial prob- lem. qGDR solves this problem by utilizing a (well-known) two-stage approach of global routing followed by detailed routing. The work ows of qGDR are shown in Figure 4.2. Referring to the MIT-LL SFQ5ee process technology [53], we only have two (Nb) metal layers for signal routing, denoted as a bottom-layer M1 (con- nections on this layer are created using striplines that are sandwiched between grounded M0 and M2 layers) and a top-layer M3 (connections on this layer are created using striplines that are sandwiched between grounded M2 and M4 lay- ers). M5 is used for the biasing, whereas M6 is used to implement connections (and inductors) inside the cells. Although M6 can also be used for routing among cells, in this work, we ignore it as an inter-cell routing resource since we want our PTL connections to be surrounded from above and below with ground wires. In 67 fact, it makes sense to use M6 in combination with M5 to make JTL connection cells. Again something that falls outside the scope of our present work. Both global and detailed routers route nets in a sequential manner to avoid the scalability problem which happens when all nets are routed concurrently. Note that because of limited fanout drive capability of SFQ cells, we consider the case that explicit splitters are used to generate fanout of two from a signal source. Therefore, all nets in a typical SFQ circuit are 2-pin nets. The explanations in this chapter address 2-pin nets but qGDR is more general and can handle any number of pins per signal net. Figure 4.2: Work ows of qGDR. 68 4.2.1 Standard SFQ Cell Library The standard SFQ cell library used in this chapter is adopted from [12]. Figure 4.3(a) shows that there are two parts to each SFQ standard cell in our library: 1) a logic design part realizing a Boolean function and 2) a built-in clock distribution part splitting and/or passing clock pulses. Precisely, the clock distribution part contains a splitter that provides the clock signal to the attached logic part. Fur- thermore, the clock part can forward the clock pulse to a next cell. The design of all SFQ standard cells follows the MIT-LL SFQ5ee process technology [53]. Figure 4.3: (a) A standard cell consisting of clock and logic parts. JTL: Josephson transmission line. (b) A cell layout partitioned by the global router with tiles. (c) A cell layout partitioned by the detailed router with bins. There are eight types of the standard cells: SplitterCLK, Splitter, NOT, DFF, AND, OR, XOR, and NDRO. SplitterCLK without the logic part is used for clock tree synthesis. The height of the clock part is 40m and that of logic part is 80m. Therefore, the height of all cells is 120m except the height of SplitterCLK is 40m. All signal input and output pins of the logic part are in the M1 layer. All signal input and output pins of the clock part are in the M3 layer. All cells are dc 69 biased and each bias pillar is of size 2.5m x 2.5m. The propagation (clock-2-q) delay of AND is 8.7ps. More details are listed in Table 4.1 below. As stated above, the MIT-LL SFQ5ee process technology [53] provides two routing layers for routing procedure and each layer is used for PTLs with a prop- agation speed 100m=ps. In this work, we do not use Josephson transmission lines (JTLs) for data signal routing because JTLs require special routing (esp. when they cross in an orthogonal direction to one another), must be sized prop- erly, and on long interconnections, result in larger propagation delays relative to PTLs [58,59]. Notice that each PTL requires a PTL driver at the driving point and a PTL receiver at the receiving point. The PTL driver and receiver are embedded in each standard cell. The design of the PTL driver and receiver is an intricate task requiring matching the source impedances of the circuits with the charac- teristic impedance of the PTL interconnect. More details can be found in [31]. The width and the pitch of PTLs are 5m and 10m respectively. PTLs between two layers connect with each other through a via and the size of a via is 5m x 5m. In [31], the simulation suggests the PTL length can be more than 5mm if no loss and dispersion are encountered. Furthermore, it has been shown that, practically speaking, the length of the longest PTL of a fabricated SFQ chip could reach 3.95mm [60]. We have thus conservatively set the length limit to be 2mm in this work. 70 Table 4.1: Standard SFQ Cell Library Cells SplitterCLK Splitter NOT DFF AND OR XOR NDRO Height (m) 40 120 120 120 120 120 120 120 Width (m) 40 30 30 40 50 50 50 50 #JJs 3 6 12 11 16 14 18 18 #Inputs (+Clk) 0 (+1) 1 (+1) 1 (+1) 1 (+1) 2 (+1) 2 (+1) 2 (+1) 1 (+1) #Outpus (+Clk) 0 (+2) 2 (+1) 1 (+1) 1 (+1) 1 (+1) 1 (+1) 1 (+1) 1 (+1) Delay (ps) 5.7 5.7 13.0 6.8 8.7 6.0 6.3 10.0 4.2.2 Global Router The rst stage of qGDR is global routing (cf. Figure 4.4). The global router divides the routing regions into tiles and determines tile-to-tile wires for nets. This is very helpful to the overall routing process because the global router can quickly generate loose routing solutions, which are subsequently improved during a detailed routing step [23,57]. To begin with, routing regions are decomposed into rectangular tiles on dierent layers. Next, we construct an L-layer routing graph, G L (V L ;E L ). Each vertex v l i in V L represents an individual tile i in the l-th layer, (l = 1:::L). If two tiles in layer l are adjacent, we add a (routing) edge between the corresponding vertices. In addition, we add a (via) edge between a pair of tiles which are aligned in the vertical direction (one is directly above the other). For the MIT LL process, L = 2 because only two metal layers are available for PTL routing. As shown in Figure 4.4(a), routing regions are decomposed into a 3 by 3 tile array for the bottom routing layer and another 3 by 3 tile array for the top routing layer. Colored vertices in this gure represent net pins. Vertices of the 71 same color belong to the same net and must be connected through a routing tree constructed on the said graph. The objective of the global router is to generate the shortest connection wires of the specied nets in the 2-layer routing graph (which is a 3-D grid) subject to the available capacity constraints for the tiles. These connection wires are also known as tile-based wires. Next, we explain how the global router achieves its goal by referring to Figure 4.2. Pin Tile (a) (b) (d) (e) Netlist net1 net2 net3 (c) X: length Y: width Figure 4.4: Multilayer global routing approach. (a) Routing region decomposition. (b) 2-layer routing graph. (c) Single-layer routing graph construction after com- pression. (d) Single-layer graph routing. The solid lines are routing wires of nets. (d) Layer assignment from the single-layer graph to the multilayer graph. The length (measured along the x direction) and width (measured along the y direction) of a tile are specied as integer multiples of the PTL pitch (i.e., the minimum allowed distance between the center of one PTL to the center of an adjacent PTL). Therefore, the length and width of the tile indicate the maximum available routing tracks for the tile (called the maximum capacity). So, if the tile 72 has a length of 3 PTL pitches and a width of 2 PTL pitches, respectively, then we can have a maximum of threey-direction wires and a maximum of twox-direction wires going through the tile. Note that because a tile may include blockages, the available routing tracks of the tile may be fewer than the maximum values. After abstracting each tile as a vertex, qGDR annotates the right connection edge and the top connection edge of the vertex with the number of the available routing tracks (called available capacity) in horizontal and vertical directions, respectively. For example, if a vertex represents a tile that has a length of 3 PTL pitches and a width of 2 PTL pitches with no blockages inside, the available routing capacities of the top connection edge and the right connection edge of the vertex are 3 and 2, respectively. With a blockage of 1 horizontal track and one vertical track inside the tiles, the available routing capacities of the top connection edge and the right connection edge of the vertex will drop to 2 and 1, respectively. Single-layer Routing Graph Construction There are two main approaches for the multilayer global routing problem: direct and indirect. The direct routing method works directly on the multilayer routing graph but can take an excessive amount of CPU time for large-scale circuits [61]. In contrast, the indirect routing method compresses the multilayer routing graph into a single-layer routing graph, solves the routing problem on this graph, and recon- structs the multilayer routing solution by doing layer assignment on the solutions 73 obtained from the single-layer routing graph. The objective of the layer assign- ment is to minimize the number of total number of vias used. The global router in qGDR is based on the indirect routing approach in order to handle large-scale SFQ circuits. First, we build a single-layer routing graph G(V;E) by compressing the multi- layer routing graph G K (V L ;E L ), as shown in Figure 4.4(c) . The pin locations in the multilayer graph are projected into the single-layer graph and are represented by the colored vertices. Second, we dene the available capacityc(e i ) of each edge e i in the single-layer graph to be the summation of the available capacities of the corresponding edges in the two routing layers. Note that the blockages are the forbidden routing regions with the minimum size of 1 horizontal pitch x 1 vertical pitch. In this chapter, the four bias pillars at the corners of each cell (used to provide DC bias currents to a SFQ logic cell) are considered as blockages. For example, if two vertically-aligned tiles (on two dierent metal layers) have a width of 2 pitches and include no internal blockages, then the right connection edge of the represented vertex in the single-layer routing graph will have an available capacity of 4. The available capacity of the top connection edges of the tile in the single- layer routing graph is similarly calculated. Third, we dene the routing demand 74 d(e i ) as the number of routing wires going through e i . The over ow value of an edge e i may then be calculated as: ov(e i ) = 8 > > < > > : d(e i )c(e i ); if d(e i )c(e i )> 0 0; otherwise : (4.1) That is, the over ow value of e i is greater than 0 when there is excessive routing demand one i . The global router seeks to minimize the total over ow value in the single-layer routing graph. Netlist Mapping The circuit netlist provides a list of pin connections of nets N =fn i g of a SFQ circuit. After the routing region decomposition, N =fn i g are separated into two net sets: a global net setfn g i g and a local net setfn l i g. For each global netn g i , two pins of the net are located in dierent tiles. For each local net n l i , pins are located in one tile. The global router of qGDR focuses on routing global nets because the local nets are less critical for the performance of the large-scale SFQ circuit and are handled by the detailed router. Netlist mapping refers to the process of generating a loose global routing solu- tion. Therefore, the step of sorting nets is skipped to save time. The mapping is done by performing L-shaped routing. The L-shaped routing connects each 2-pin net with no more than 1 bend, as shown in Figure 4.5. If there is a multi-pin 75 netn g i , the rectiliner Steiner minimal tree (RSMT) forn g i is generated by the esti- mation algorithm of FLUTE [62]. RSMT is a tree with the minimum total edge length (measured in Manhattan distances) to connect a set of nodes possibly using Steiner routing nodes. The multi-pin net is then decomposed into several two-pin nets for the subsequent L-shaped routing. During the L-shape routing, the cost of using an edge e i for a net connection is given by cost(e i ) =d(e i )c(e i ) (4.2) where d(e i ) and c(e i ) are the demand and the capacity of e i , respectively. We call cost(e i ) the edge cost. Note that this cost increases monotonically as the L-shaped routing process progresses. Herein, we dene the routing cost of a wire as the summation of the edge costs of all edges which comprise the wire. When there are multiple connection wires for selection, the L-shaped routing prefers a connection wire with lower routing cost. (a) (c) (b) net1 net2 Figure 4.5: L-shaped Routing. 76 Routing Demand Calculation and Over owing Edge Identication After the netlist mapping, the demand and the over ow values of all edges are calculated and edges with non-zero over ow values are identied as over owing edges. If there are no over owing edges, the global router moves on to the layer assignment process. Otherwise, the global router starts the process of determining a group of nets to be ripped up and rerouted. Determination of a Group of Nets to be Ripped Up To begin with, the global router selects nets with connection wires going through any over owing edges. The global router only rips up a subset of the said selected nets because the number of the ripped-up nets should be relatively small [63]. There are three steps for determining a group of nets to be ripped up. First, all selected nets are ranked with a scoring function. The score of a netn g i in thek-th iteration of the rip-up and re-route process is calculated as: score(n g i ) =c 1 OverflowEdgeCnt WireLengthLB +c 2 (BendCntlog(k + 1)PinCnt) +c 3 ActualWireLength WireLengthLB (4.3) where OverflowEdgeCnt denotes the number of the over owing edges used for routing net n g i ; BendCnt is the number of times the routing direction changes 77 between the horizontal and vertical directions; PinCnt refers to the number of connection pins ofn g i ;c 1 ;c 2 andc 2 are pre-determined weighting coecients. Next, all scores are normalized to lie in the interval [0:1; 0:9] and a random number generator is used to generate a uniformly distributed random number in the interval [0; 1] for each net. Finally, the global router rips up a net if the score of the net is larger than the random number generated by the generator. As a result, nets with higher scores are more likely to be ripped up. We provide more details about the scoring function in this paragraph. OverflowEdgeCnt in Equation (4.3) is divided by WireLengthLB because the OverflowEdgeCnt of long connection wires is generally larger than that of short connection wires. If we do not divide OverflowEdgeCnt byWireLengthLB, the scores of the long connection wires could be much larger than that of the short wires. As a result, the rip-up process may be applied only to the long connection wires. The second term in Equation (4.3) emphasizes BendCnt as the iteration count goes up. We make this term the dominant term of the scoring function because we want to minimize the number of total vias (an important consider- ation in routing of PTLs in superconductive electronics). This is because wires with a high number of bends after the global routing usually result in routing solutions with a high number of vias after the detailed routing because there is a preferred direction of routing for each metal routing layer. The third term in 78 the scoring function focuses on the minimization of the total wire length of the complete routing solution. Round 1: Rip-up and Re-route to Minimize Edge Over ows There are ve iterations in round 1 . In each iteration, the global router rips up nets which are determined by the previous process and reroutes them. The general framework of the rip-up and re-route process is illustrated by an example in Figure 4.6. The global router sorts the ripped up nets in a non-decreasing order of their normalized scores, updates the edge costs, and performs the re-route process according to the said order. The cost of using an edge e i in the k-th iteration is given by: cost(e i ) = (1 + h k e i k )penalty (4.4) h k+1 e i = 8 > > < > > : h k e i + 1; if e i is overflowing 0; otherwise (4.5) whereh k e i is the history cost of edgee i ,penalty is the penalty cost, andh k e i =k 1. In the re-routing process, the global router uses a maze routing algorithm to nd connection wires with the least routing cost. There are three steps in the maze routing algorithm: bonding box formation, wave propagation, and trace back. We explain these steps on a 7 by 7 tile array shown in Figure 4.7. Notice that a tile is equivalent to a vertex in the routing graph, and thus, these two terms will be used 79 (c) (b) net1 net2 (a) Rip-up Re-route Figure 4.6: Rip-up and re-route process. interchangeably in this chapter. First, the global router chooses a search space by forming a rectangular bounding box containing source vertices of each global net, as shown in Figure 4.7(a). Second, minimum costs from source vertices to other vertices in the search space are calculated by using a wave (frontier) propagation method until a minimum-cost connection wire between the source vertices is found. Wire cost calculation is not performed for any blockages because they are forbidden routing regions. The costC(v i ;v s ) of creating a wire from vertexv i to source vertex v s is written as: C(v i ;v s ) = min v j C(v i ;v j ) +C(v j ;v s ) (4.6) where v j can be an arbitrary vertex in the search space. In the interesting case of v j being an adjacent vertex ofv i , if we lete ij denote the edge connectingv i andv j , then C(v i ;v j ) is equal to cost(e ij ). Figure 4.7(b) shows a simple example with all edge costs set to 1. Finally, the wire connecting the two source vertices with the least cost is identied (and committed to) by performing a trace back from from 80 one source vertex to the other. The wire commitment increases the edge costs of all used edges by a constant penalty, wherepenalty is the same constant used in Equation (4.4). Because of this increase in the edge costs, the subsequently performed re-routing step will prefer to route nets utilizing unused graph edges and thereby minimize edge over ows. b X X XX b b X 3 1 2 3 X2 2 XX2 1 3 3 2 1 b b X 1 X 2 XX 3 3 2 1 b (a) (b) (c) Figure 4.7: Maze routing algorithm. (a) Bounding box formation. b: source vertex. X: blockage. (b) Double fanout wave propagation. (c) Trace back. Round 2: Rip-up and Re-route with Expanded Search Space and Max- imum Routing Cost Threshold When there are no over owing edges after the round 1 process, the global router starts the layer assignment process. Otherwise, the global router repeats the rip-up and re-route process (up to three more times) as part of a round 2 process. The 81 scoring function used to identify the nets to be ripped up is, however, dierent. More precisely, the scoring function for the round 2 process is given by: score(n g i ) =c 1 OverflowEdgeCnt WireLengthLB +c 2 (BendCntPinCnt) +c 3 ActualWireLength WireLengthLB : (4.7) Here, we remove the weight of log(k + 1) before BendCnt in Equation (4.3) so that minimization of BendCnt is not emphasized as iteration count increases. The reason is that the global router is now trying to nd global routing results without any over owing edges and via count minimization is less important. Note, however, that the likelihood of ripping up nets with many bends still remains higher than that of ripping up other nets. The global router must update the edge costs before the re-routing process. The cost of using an edge e i in the k-th iteration is calculated as follows: cost(e i ) = (1 +c 4 log(kh k e i + 1))penalty; (4.8) where h k e i is the history cost dened in Equation (4.5) and c 4 is a weighting coef- cient. Notice that we don't reset the value of h k e i after the round 1, so the value monotonically increases during the whole global routing process. If some edges are frequently identied as over owing edges, the factor oflog(kh k e i +1) in the above 82 equation can discourage the global router from using these edges due to higher costs compared to using other edges. In round 2, the global router runs the maze routing algorithm with an expanded search space and a maximal routing cost threshold for 10 passes. Figure 4.8 demon- strates an example. In detail, the bounding box can be expanded to broaden the search space in successive passes. The bounding box expands in each direction by exactly one vertex after each pass as shown from Figure 4.8(a) to Figure 4.8(b). To minimize the number of the over owing edges, the routing cost of n g i in the m-th pass cannot be larger than a routing cost upper bound, which is given by: UBound(n g i ; 1) =LowerBoundLength penalty (4.9) UBound(n g i ;m> 1) =UBound(n g i ;m 1) +c 5 penalty (4.10) where c 5 is another weighting coecient. As an example, in Figure 4.8(c), if UBound(n g i ; 2) of the shown net is 6 or higher, we will accept the connection wire; otherwise, we will reject it. After routing the newly found connection wire, edge costs of all used edges are increased by penalty. 83 X 1 X XX b (a) b X X XX b (a) b X 1 X 2 XX 3 3 2 1 b (a) b Figure 4.8: Maze routing algorithm with an expanding box. (a) Wave propagation. b: source vertex. X: blockage. (b) Bounding box expansion. (c) Connection wire commitment. Last-Gasp Re-routing of Any Nets While Avoiding Any Edge Over ows If there is no over owing edge after the round 2 process, the global router will continue with the layer assignment process. Otherwise, the global router starts the last-gasp re-routing process to eliminate all edge over ows. Here, last-gasp re-routing simply means that this re-routing step is the nal re-routing attempt of the global routing stage. The way of determining the group of nets to be ripped up is the same as before but a dierent scoring function is used. The scoring function for the last-gasp round is given by: score(n g i ) =c 1 OverflowEdgeCnt ActualWireLength +c 2 BendCnt: (4.11) where c 1 and c 2 are the same weighting coecients used in Equation (4.3). 84 The global router must update the edge costs before the last-gasp re-routing process can commence. The cost of using e i during the last-gasp process is: cost(e i ) = 8 > > < > > : 1; if e i is overflowing 0; otherwise : (4.12) The last-gasp process comprises of 10 passes. The routing algorithm used in each pass is the same as that used in round 2. However, the routing cost upper bound is set to zero (which means no connection wire can include any over owing edges). As a result the nal global routing solution will not include any over owing edges. Layer Assignment to Global Routes The objective of the layer assignment process is to transform single-layer global routes into multilayer global routes without changing the routing topology or gen- erating any over owing edges [57]. To clarify the problem, letfn i g be the routed net set and T (n g i ) = fv j g be the vertex set for routing n g i in the single-layer routing graph, G(E;V ). The layer assignment problem is to generate the vertex set T L (n g i ) =fv l j g in the multilayer routing graph, G L (E L ;V L ), for each n g i in fn g i g without increasing the number of over owing edges. Figure 4.9(a) depicts an example where edges along a connection wire can be assigned either to the M1 layer or to the M3 layer. In [61], a dynamic layer assignment algorithm is pro- posed for solving the layer assignment problem while minimizing the number of 85 generated vias. The global router in qGDR implements this algorithm with some modications as explained next. (a) (c) parent child parent child blocked M3 M1 (b) Figure 4.9: The layer assignment. (a) A single-layer global routes. (b) Dynamic programming process. (c) A multilayer global routes. There are two main parts in the dynamic layer assignment algorithm: net order determination and single-net layer assignment. If there is a multi-pin net (which has been decomposed into several two-pin nets nets as part of the netlist mapping process), the global router produces the complete routing solution for the said multi-pin net by combining routes of corresponding two-pins nets. The global router determines the net order by sorting all routed nets by their scores from high to low. The scoring function is: score(n g i ) = c 6 ActualWireLength + c 7 PinCnt +c 8 P e j 2Wire(n g i ) d(e j ) P e j 2Wire(n g i ) c(e j )) (4.13) where Wire(n g i ) is the connection wire of net n g i and c 6 ;c 7 and c 8 are weighting coecients. Given the rst term in this equation, the layer assignment for long wires is postponed relative to that of shorter wires because the layer assignment 86 of long wires tends to quickly consume valuable routing resources [61]. In general, the layer assignment process favors the routing layer where the connection pins are located. The layer assignment of long wires may thus utilize the routing resources which are also needed by many other nets. Indeed, a better routing strategy is to route long wires using infrequently utilized routing layers. The second term in Equation (4.13) is added because we would like to do the layer assignment of the two-pin nets before doing the multi-pin nets. The third term in Equation (4.13) considers the presence of congested routing regions. Indeed, the layer assignment of routed nets in the congested regions has a signicant impact on the quality of nal global routing results. Therefore, if the connection wire of a net goes through congested regions, the global router should do the layer assignment of such nets before other nets. The single-net layer assignment is a dynamic programming algorithm, which generatesT L (n g i ) =fv l j g in the multilayer routing graph for one net at a time while minimizing the via count for the assignment. To begin with, the tree topology of each routed net is built and one source vertex is identied as a child/leaf vertex, as shown in Figure 4.9(b). Assume that the child vertex v s is in the l-th routing layer. Next, this dynamic programming algorithm calculates the wire cost from a vertex v l1 i in the l1-th layer to the source vertex v l s by the following formula: C(v l1 i ;v l s ) = min l2 C(v l1 i ;v l2 j ) +C(v l2 j ;v l s ) (4.14) 87 where v l2 j is the root vertex of the sub-tree of v l1 i . Whenl2 andl1 are dierent,Wire(v l1 i ;v l2 j ) is calculated asjl1l2j to represent the required via count. The layer assignment is determined by tracing back the least cost wire. Notice that the layer assignment should not increase the number of over owing edges so the algorithm sets up two constraints while searching the solution: ov(e l j ) max e k 2Wire(n g i ) ov(e k ) L ;e j 2Wire(n g i ) (4.15) L X l=1 ov(e l j )ov(e j ) (4.16) where ov(e l j ) is the over ow value of e l j in G L (E L ;V L ), following the denition of Equation (4.1). Equation (4.15) states that if we use e l j in the multilayer routing graph to represent e j in the single-layer routing graph, the updated ov(e l j ) cannot be larger than the maximum ov(e k ) divided by L where e k is the edge used for routingn g i inG(E;V ) andL is the number of layers in the routing graph. Equation (4.16) states that after the assignment, the summation of ov(e l j ) in dierent layers cannot be larger than ov(e j ) in G(E;V ). Following Equation (4.14) with the said two constraints, the single-net layer assignment transforms all T (n g i ) =fv j g into T L (n g i ) =fv l j g without changing the routing topology or increasing over ow values while minimizing the via count. 88 4.2.3 Feedback System After the global routing, qGDR checks the number of the successfully routed nets and if the number is more than 90% of the total net count, it starts the detailed routing process. Otherwise, a feedback mechanism is triggered. Herein, we assume that all cells are placed in predened set of rows,frow i g;i = 1;:::;n, on the chip. Note that this assumption is compatible with general electronic design automation (EDA) ow and tools targeting cell-based ASIC design [23]. Routing Track Increase If the feedback system is triggered for the rst time, qGDR initializes the value of an internal feedback counter to 0. Each time the feedback system is triggered, the counter value is incremented by 1. If the counter value is divisible by 3, the feedback system increases the number of vertical routing tracks between individual cells, as shown in Figure 4.10(a). Otherwise, the feedback system increases the number of horizontal routing tracks between rows of cells, as shown in Figure 4.10(b). When the feedback system increases the vertical routing track count, the system identies the row placed with the largest number of cells. The space between each two neighboring cells in this row is increased by 1 vertical pitch by rightward shifting of the cells on that row. The rightward shift changes the x-coordinates of the cells in this row. We denote by x the largest change of the cell x-coordinates 89 in this row. For example, if there are 100 cells in the row with the largest number of cell, x is (100-1) vertical pitches. Before increasing the horizontal routing track count by an upward shift of y pitches, the feedback system checks how many failed nets are connected to the cells in row i . There are three cases: 1) if 80% or more of the failed nets are connected to the cells in row i , all cells in row i are shifted upward by 5 horizontal pitches; 2) if 30% or more but less than 80% of the failed nets are connected to the cells in row i , all cells in row i are shifted upward by 2 horizontal pitches; 3) if less than 30% of the failed nets are connected to the cells inrow i , all cells inrow i are shifted upward by only 1 horizontal pitch. All cell rows above the aected row are moved by an upward shift of y pitches. After increasing the routing track counts, qGDR restarts the global routing process with the new cell placement. 4.2.4 Detailed Router Following the global routing results, the detailed router of qGDR determines spe- cic wire segments and vias used for routing each net without violating any design rules. The objective of the detailed router is to generate as many routed nets as possible. Following Figure 4.2, we explain how the detailed router approaches this objective. Our detailed router is built on top of the open-source Qrouter tool [64]. 90 Figure 4.10: Feedback System. (a) Vertical Routing Track Increase. (b) Horizontal Routing Track Increase. Grid Graph Construction and Net Ordering for Routing The routing regions of circuits are rst decomposed into bins to construct a grid graph, as shown in Figure 4.11(a). The width and the length of each bin are one vertical pitch and one horizontal pitch, respectively. After the grid graph construc- tion, the detailed router orders the nets such that nets for which no global routing results exist (such nets are mostly local nets) have a higher priority than those with the global routing results. Such a net ordering typically leads to better routing results [23]. The following paragraph explains details of the ordering process. First, the smallest (rectangular) bounding box for each is determined. Next, width(n i ) andlength(n i ) ofnet i are set to the width and the length of its smallest 91 bounding box. For nets without any global routing results, those with smaller values ofmin(width(n i );length(n i )) are routed before the other ones. In case of a tie, the net with larger pin count is routed before the other one(s). For nets with global routing results, those that have smaller values as generated by the following comparison function are given higher priority: f(n i ) =c 9 max e i 2Wire(n g i ) [c(e i )d(e i )] +c 10 ViaCnt(n g i ) +c 11 min(width(n i );length(n i )) +c 12 PinCnt (4.17) wheren g i denotes the corresponding global net ofn i andViaCnt(n g i ) is the number of vias used after the layer assignment process for routing n g i . This comparison function is designed to arrange the nets which have a higher routability chance before the others. (a) (b) (c) Net 1 Net 2 Net 3 Figure 4.11: Detailed Routing.(a) Grid Graph Construction. (b) Routing Masks. (c) Detailed Routing Results. 92 Detailed Net Routing The detailed router also utilizes a maze routing algorithm for 10 passes with a maximal routing cost threshold. However, details of the routing algorithm between the global and detailed routers are dierent as explained next. To improve the search eciency, we use three types of routing masks to dene the search space for dierent nets as the bounding box used in the global routing: (1) a rectangular mask (2) a trunk-and-branches mask and (3) a global-routing based mask. The rectangular mask is the same as the minimum bounding box and is applied to the 2-pin nets without any global routing results. For example, the search spaces of the Net1 and the Net2 in Figure 4.11(b) are determined by the rectangular mask. qGDR applies the trunk-and-branches mask to the multi-pin nets without any global routing results. More information about the trunk-and- branches mask can be found in [23]. The global-routing based mask is applied to nets with global routing results. In detail, the global routing process generates a single-layer tile-based routing wire for each net and then the mask of each net is formed by projecting the single-layer tile-based wire to all available routing layers. For example, the blue mask of Net3 in Figure 4.11(b) is a global-routing based mask. The gure shows that the tile size is 2 horizontal pitches by 2 vertical pitches and that Net3 is routed with a L-shaped tile-based wire. Wave propagation starts after the search space has been determined. It operates in the same way as it does during the global routing process. The detailed router, 93 however, calculates the cost of building a connection wire from a source bin, b l s , in the l-th layer, to another bin, b l1 i , in the l1-st layer by the following formula: C(b l1 i ;b l s ) = min b l2 j g(b l1 i ) +C(b l1 i ;b l2 j ) +C(b l2 j ;b l s ) (4.18) where b l2 j is the adjacent bin of b l1 i . g(b l1 i ) is equal to CrossoverCost if b l1 i is over or under an unrouted pin. Otherwise, it is zero. g() helps prevent pins of a net from getting surrounded by wires of other nets before the pins are used for connection. The value ofC(b l1 i ;b l2 j ) depends on three parameters: (1)SegCost, (2) JogCost, and (3) ViaCost. The initial value of C(b l1 i ;b l2 j ) is 0. SegCost is added toC(b l1 i ;b l2 j ) ifl1 =l2 andb l2 j is propagated tob l1 i in the preferred direction in the l1-st routing layer. JogCost is added to C(b l1 i ;b l2 j ) if l1 =l2 and b l2 j is propagated to b l1 i against the preferred direction in the l1-st routing layer. Moreover, the addition of SegCost and JogCost is weighted if the length (wirelength) of the wire connecting b l s and b l1 i is more than 1:05X of the longest wire found by the global router. Such condition happens when there are many detours in the wire. ViaCost is added to C(b l1 i ;b l2 j ) if l1 is not equal to l2. ViaCost is increased if the via count for building the wire is higher than the max via count allowed per net. As a result, the max wirelength and the max via count are controlled during the detailed routing procedure. 94 The cost of building a connection wire for a net n i is limited by a maximum routing cost threshold. The maximum routing cost threshold in the k-th pass is given by threshold(n i ;k) = 2threshold(n i ;k 1) (4.19) threshold(n i ; 1) =c 11 +c 12 ViaCost +c 13 min(width(n i );length(n i ))SegCost (4.20) where c 11 ;c 12 and c 13 are weighting coecients. Note that the detailed router routes each net without ripping up routed nets. The nets which fail the detailed net routing are kept in a list as failed nets for the next process. Detailed Rip-up and Re-route After the detailed net routing, failed nets in the list are re-routed. The routing algorithm in the re-route process is still the maze routing algorithm with a maximal routing cost threshold. The search space is now the whole grid graph, so the detailed router can focus on routing as many nets as possible regardless of the search eciency. Another cost parameter CollisionCost for the wire cost calculation is intro- duced during the wave propagation. To be specic, a short circuit connection 95 between two nets is not allowed in the previous process, so bin b l1 i is unusable for routing a failed net if this bin is already used. However, b l1 i becomes usable in the detailed rip-up and re-route process but a CollisionCost term will be added to the wire cost to route a failed net if the said wire uses b l1 i . Therefore, g(b l1 i ) in Equation (4.18) will include CollisionCost if b l1 i is already used by a routed net. After the connection wire of a net is committed, nets that are shorted with the wire are ripped up and the ripped-up nets are append to the list of nets to be re-routed. By adding CollisionCost, threshold(n i ; 1) may be relaxed and subse- quently calculated as follows: threshold(n i ; 1) =c 11 +c 12 ViaCost +c 13 min(width(n i );length(n i ))SegCost +c 14 CollisionCost: (4.21) threshold(n i ;k) is calculated by Equation (4.19). The detailed rip-up and re-route process continues until all nets are routed. However, there are two \early stop" criteria such that the process can stop in a reasonable time even if a fully routed SFQ circuit has not be obtained. These criteria are: 1) If more than 10 routed nets are needed to be ripped up for routing a net, this net is claimed to be unroutable by the detailed router. 2) Before the rip-up and re-route process, the global router initializes the stored value of a counter with 0 and variableFailedNetCnt with the number of the failed nets. The 96 counter increases its stored value by 1 each time the re-route function is called. When the counter value is equal to eight timesFailedNetCnt, the detailed router checks the detailed routing progress. If the number of remaining failed nets is fewer than aFailedNetCnt=2, the detailed router updatesFailedNetCnt and resets the counter value to 0 and detailed rip-up and re-route process continues. Otherwise, the detailed rip-up and re-route process is terminated with a list of all unrouted nets. Routing Result Generation Finally, the routing result for the target circuit is generated in the design exchange format (DEF). If all nets of the circuit are routed, the routing process is concluded. Otherwise, qGDR executes the routing track increase process of the feedback sys- tem and restarts the global and detailed routing processes anew. 4.3 Experimental Results and Discussions The proposed post-routing optimization is implemented in about 8,000 lines in C language. The hardware environment for the experimental results is a Linux machine with the Intexl(R) Xeon(R) CPU E7-8837 @2.67 GHz. For the testing of the proposed framework in two layer routing, we synthe- size Kogge-Stone adders (KSA), array multipliers (Mul), integer dividers (IntDiv). Furthermore, we synthesize a few of the ISCAS c-series benchmark circuits with 97 dierent net counts to conrm the robustness of our framework. The synthesized netlists are path-balanced and all fan-outs are implemented with splitters. All of the SFQ cell placements are then generated by a placement tool developed spec- cally for SFQ logic netlists [12]. The clock tree topology is the H-tree. More details are specied in Table 4.2. The values of #Nets in the third column of Table 4.2 is about 1.5X of that of #Cells in the second column. The reason is that the number of SplitterCLK cells is close to that of other cells due to the H-tree clock topology. Since the fanout counts of SplitterCLK and other cells are 2 and 1, respectively, #Nets is about 1.5X of #Cells. Inputs to the proposed framework are open stan- dard library exchange format (LEF) and design exchange format (DEF) les. LEF includes design rules and cell layouts, whereas DEF includes circuit netlists and corresponding circuit layouts. Table 4.2: Circuit Specication Circuit Spec. #Cells #Nets Area (mm 2 ) IdealTotalWL (m) 4-bit KSA 171 258 1.75 4.83e4 8-bit KSA 534 776 3.87 1.44e5 16-bit KSA 1215 1847 8.88 4.03e5 32-bit KSA 3753 5311 30.9 1.62e6 4-bit Mul 526 771 4.24 1.32e5 8-bit Mul 3458 4815 25.1 9.51e5 4-bit IntDiv 1092 1636 9.41 3.68e5 8-bit IntDiv 7363 10555 86.0 2.87e6 c432 2291 3500 25.0 9.70e5 c499 2091 3045 18.2 9.52e5 c880 3649 5129 33.8 1.59e6 c1355 2130 3124 21.5 1.12e6 c1908 3706 5255 30.3 1.45e6 98 Fig. 4.12(a) shows the layout of the most complex SFQ circuit reported in this paper: an 8-bit integer divider with 10,555 nets. The circuit consists of more than 45,000 Josephson junctions. qGRD nishes routing this circuit in a half hour of CPU time. To analyze the distribution of used routing resources, we generate a via density graph, a M1 wire density graph, and a M3 wire density graph by using a convolution matrix. Fig. 4.12(b) is the via density graph, which clearly shows the vias in the layout are evenly distributed. Fig. 4.12(c) is the M1 wire density graph and shows that some routing regions are congested by M1 wires. However, these regions are small compared to the whole layout. Fig. 4.12(d) is the M3 wire density graph. It shows that M3 routing resources are less utilized by qGDR. This is because pins of the logic part of the logic cells are implemented in the M1 layer. Table 4.3: Routing Results of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks Circuit Global Routing Detailed Routing Total #RoutedNets TotalWL (m) #Vias Time (s) TotalWL (m) MaxWL (m) #Vias #MaxVia Time (s) Time (s) 4-bit KSA 241 4.55e4 362 0.27 5.55e4 920 353 6 1.40 1.67 8-bit KSA 753 1.48e5 1208 0.50 1.71e5 1400 1086 14 7.13 8.61 16-bit KSA 1794 4.16e5 2986 1.94 4.72e5 2130 3025 16 40.5 42.5 32-bit KSA 5224 1.52e6 6880 9.79 1.80e6 5380 7703 26 777 787 4-bit Mul 757 1.33e5 1115 0.51 1.49e5 1510 893 8 3.13 3.64 8-bit Mul 4786 9.67e5 6550 4.02 1.04e6 3490 4523 18 96.8 101 4-bit IntDiv 1620 3.65e5 2315 1.17 4.15e5 2170 2187 19 24.5 25.7 8-bit IntDiv 10527 2.86e6 11998 24.43 3.05e6 6960 9959 24 1583 1607 c432 3462 9.41e5 4338 3.70 1.06e6 4620 4560 16 150 154 c499 2970 8.72e5 4238 3.04 1.07e6 4810 5031 24 390 393 c880 5054 1.51e6 6785 9.25 1.75e6 7030 6969 24 725 734 c1355 3048 1.03e6 4280 4.58 1.25e6 5650 5575 28 449 454 c1908 5209 1.41e6 6855 6.77 1.60e6 5880 6783 24 530 537 We compare the performance between previous routers and qGRD by analyzing the running time for routing large-scale SFQ circuits [1]. To be fair, we consider 99 Figure 4.12: Routing results of an 8-bit integer divider. (a) Circuit layout (b) Via density graph. (c) M1 wire density graph. (d) M3 wire density graph. circuits with similar number of nets because the running time strongly depends on the net count [1]. In [58], the router based on A* algorithm nishes routing an 8-bit microprocessor (MPU) with 248 nets in 707.2s. On the other hand, qGDR nishes routing a 4-bit KSA with 258 nets in 1.67s, as reported in the second row of Table 4.3. Furthermore, the area of the MPU is 41.96mm 2 whereas the area of the 4-bit KSA is 1.75mm 2 . This area reduction is the advantage of the full-chip routing over the channel routing because there are more available routing tracks for the full-chip routing given the same layout area. In [65], the router based on an integer linear programming solver could not nish routing of a 16-bit Slansky adder 100 Table 4.4: Routing Results of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks by Qrouter and qGDR Circuit TotalWL (m) MaxWL (m) #Vias #MaxVia Total Time (s) Qrouter qGDR (Ratio) Qrouter qGDR (Ratio) Qrouter qGDR (Ratio) Qrouter qGDR (Ratio) Qrouter qGDR (Ratio) 4-bit KSA 5.72e4 5.55e4 (0.91) 1010 920 (0.91) 375 354 (0.94) 8 6 (0.75) 1.52 1.67 (1.10) 8-bit KSA 1.76e5 1.71e5 (0.97) 1390 1400 (1.01) 1185 1086 (0.92) 12 14 (1.17) 9.28 8.61 (0.93) 16-bit KSA 5.12e5 4.74e5 (0.93) 2360 2130 (0.90) 3776 3025 (0.80) 20 16 (0.8) 60.6 42.5 (0.70) 32-bit KSA 1.86e6 1.80e6 (0.97) 5470 5380 (0.98) 9500 7703 (0.81) 28 26 (0.93) 868 787 (0.91) 4-bit Mul 1.57e5 1.49e5 (0.95) 1430 1510 (1.06) 1000 893 (0.89) 10 8 (0.80) 4.11 3.64 (0.89) 8-bit Mul 1.07e6 1.04e6 (0.97) 3230 3490 (1.08) 5414 4523 (0.84) 16 18 (1.13) 130 101 (0.78) 4-bit IntDiv 4.35e5 4.15e5 (0.95) 2190 2170 (0.99) 2578 2187 (0.85) 20 19 (0.95) 30.1 25.7 (0.85) 8-bit IntDiv 3.08e6 3.05e6 (0.99) 7050 6960 (0.99) 11416 9959 (0.87) 24 24 (1.00) 1425 1607 (1.13) c432 1.09e6 1.06e5 (0.97) 4460 4620 (1.04) 5346 4560 (0.85) 16 16 (1.00) 146 154 (1.05) c499 1.11e6 1.07e5 (0.96) 5030 4810 (0.96) 6357 5031 (0.77) 22 24 (1.09) 406 393 (0.97) c880 1.79e6 1.75e6 (0.98) 6990 7030 (1.01) 8317 6969 (0.84) 26 24 (0.92) 723 734 (1.02) c1355 1.29e6 1.25e5 (0.97) 5530 5650 (1.02) 6424 5575 (0.87) 28 28 (1.00) 545 454 (0.83) c1908 1.65e6 1.60e6 (0.98) 5840 5880 (1.01) 8238 6783 (0.82) 22 24 (1.09) 601 537 (0.89) Average Ratio - 0.96 - 1.00 - 0.85 - 0.97 - 0.93 in 2,000s. On the other hand, qGDR nishes routing a 16-bit Koggen-Stone adder in 42.5s. The layout area of the 16-bit Slansky adder is 39.6mm 2 (the area of the 16-bit KSA is 8.88 mm 2 ). The above comparisons demonstrate the performance improvement of qGDR over prior work in routing large-scale SFQ circuits. We now discuss the results in Table 4.3 in more detail. If we compare #Nets in the third column of Table 4.2 and #RoutedNets in the second column of Table 4.3, we can see that more than 90% of nets of each attempted circuit can be routed with loose tile-to-tile wires in very short times. Moreover, this result implies successful routing resource allocation by the global router. TotalWL in the third column under the global routing is no less than 80% ofTotalWL in the sixth column under the detailed routing, which shows the high delity of the global routing results. In general, #Vias in the fourth column under the global routing is a count between 0.75X and 1.25X of #Vias in the eighth column under the detailed routing. The most signicant dierence is observed when routing the 4-bit Mul because the 101 global router requests the vias for almost every turning points of routing wires. In contrast, the detailed router prefers to route detour wires without using vias. As a result, the total via count of routing the 4-bit Mul with 771 nets by the global router is about 1.25X of that by the detailed router. The ninth column in Table 4.3 shows that the #MaxVia are all fewer than 30. The control of via count under the detailed routing demonstrates a successful integration between the global and detailed routers. As a result, qGDR is expected to mitigate the potential impedance mismatch due to vias in large-scale SFQ circuit layouts. (The maximum number of vias is somewhat large due to having access to only two PTL routing layers.) We further report routing results of both Qrouter and qGDR for the SFQ circuits in Table 4.4. The implementation of the global router in qGDR eectively reduces #Via (0.85X). The complex maze routing algorithm in the detailed routing process of qGDR concurs 0.97X running time of Qrouter because the complex maze routing algorithm in the detailed routing process of qGDR. We are expected to investigate more ecient maze routing algorithms or other routing algorithms in the near future. TheMaxWL of the 8-bit IntDiv and c880 in Table 4.3 may seem large but the value is only 1.01X and 1.04X of the idealMaxWL respectively. Figure 4.13 shows the histogram of the wire length of the 8-bit IntDiv. According to the histogram, the wire length generated by qGDR is comparable to that of the ideal router. More precisely, MaxWL achieved by the ideal router is 6,700m whereas the value is 102 Figure 4.13: Histogram of the wire length of the 8-bit integer divider. 6,960m for qGDR, which is only 1.04X larger than the ideal (minimum possible) value. The best strategy for the MaxWL improvement is to reallocate the cell placement to minimize the maximum distance between connection pins of a net. We note that the total routing time is dominated by the detailed routing (see the fth and the eleventh columns in Table. 4.3). The large CPU time for the detailed routing step emanates from the need to search the full grid graph during the detailed rip-up and re-route process. This run time may be reduced by limiting the search space to a portion of the full grid graph. 4.4 Conclusion We presented an integrated global and detailed router, qGDR, for large-scale SFQ circuits fabricated in the MIT-LL SFQ5ee process technology. qGDR realizes the 103 full-chip routing procedure using only 2 PTL routing layers. The router comprises of two steps: Global and detailed routing. A global router eciently allocates routing resources of PTL wires in a single-layer routing graph. Furthermore, the global router minimizes the total number of vias to mitigate the impedance mis- match by a dynamic programming algorithm. A detailed router follows the global routing results to maximize the number of routed nets by the rip-up and re-route process. Moreover, a feedback system in qGDR keeps adding more routing tracks in an automated manner until fully routed layouts are generated by qGDR. As a result, qGDR is capable of generating a compact layout of an 8-bit integer divider consisting of more than 45,000 Josephson junctions in 0.5 CPU hour without any manual intervention. 104 Chapter 5 Post-Routing Optimization for Working Frequency of Single Flux Quantum Circuits Routing tools determine the exact values of data path delays and clock skew. Advanced routing tools is expected to optimize the routing process for critical paths that limit the maximum working frequency based on data path delays and clock skew. Another challenge for advanced routing tools is resolving hold time violations which are prevailing in circuits with many small data path delays. Due to the uniqueness of SFQ designs, researchers refresh the denitions of data paths and clock paths and reformulate the corresponding delays. The development of specialized strategies for rening routing tools are thus highly anticipated to solve the puzzle of SFQ timing optimization created by data path delays and clock skew. We present a post-routing optimization framework which reduces the delay of the critical paths of large-scale SFQ circuits while controlling the clock skew. Fur- thermore, the hold time violations are all resolved by meandering routing paths. 105 The whole optimization framework consists of machine learning, critical path opti- mization and comprehensive path rectication. Machine learning observes local wire deployments and conducts a full-chip wire distribution analysis. Equipped with the distribution information, critical paths are identied and then re-routed with alternative short wires during critical path optimization. Last but not least, comprehensive path rectication renes clock skew for frequency maximization and builds detour paths for hold time violations. Given 14 SFQ circuits routed by the state-of-the-art routing tool, the proposed framework not only improves the maximum working frequency by 7% on average but also resolves the hold time violations in 200 seconds for all circuits. The rest of this chapter is organized as follows: Section 1 describes our moti- vation for developing post-routing optimization; Section 2 details the proposed post-routing optimization framework; Section 3 provides our experiment results; Section 4 concludes this chapter. 5.1 Motivation The high working frequency of SFQ circuits is one of the main reasons why SFQ circuits are promising elements of high-performance superconductive computers. The working frequency, however, has been decreasing with the scaling of SFQ circuits because of induced variations by fabrication [36]. Even worse, the high working frequency may no longer hold after the utilization of routing tools which 106 lack timing optimization specialized for SFQ circuits. Specically, the delays of the wires deployed by routing tools signicantly in uence working frequencies because SFQ cell delays are usually few nanoseconds (14 ps in our library). To our best knowledge, the development of many SFQ routing tools [15, 58] follow that of CMOS routing tools which search and deploy the wires for nets based on delicate cost functions. In [58], the router based on A* algorithm sequentially routes SFQ circuits with hundreds of nets while qGDR proposed in [15] completes routing large-scale circuit with thousands of nets sequentially through a maze routing algorithm. The drawback of both routing tools is the weak warranty of reserving abundant routing resources for critical nets because other nets might have exploited the most resources before critical nets. In [65], the routing tool realizes a consistent delay of all paths in the same stage through integer programming solvers. However, the feasible solution of a consistent delay for each stage is unlikely to reach the minimum delay of all stages because of no priority for critical paths as well as critical nets. To be worse, the scalability problem occurs when we apply the routing tool proposed in [65] to large-scale SFQ circuits. As a result, none of the aforementioned routing tools ever possesses strong SFQ timing knowledge to maximize working frequencies and even resolve hold time violations. 107 To amend routing results generated by state-of-the-art SFQ routing tools, we propose a post-routing optimization framework which advances maximum oper- ating frequencies and even resolve hold time violations of large-scale SFQ cir- cuits. Our framework explores alternative short wires for both critical data paths and critical clock paths for working frequency improvements. Then, the proposed framework targets the paths with hold time violations and generates meandering wires through a maze routing algorithm for the targeted paths. 5.2 Post-Routing Optimization Procedure The proposed framework is depicted in Figure 5.1. There are three stages: machine learning, critical path optimization, and comprehensive path rectication. The machine learning begins by studying the wire distributions of a given circuit and then groups distributed routing wires into clusters. The grouping is built on the local wire density of the routing regions occupied by irregular routing wires. The critical path optimization identies critical paths through breadth-rst search and applies a ripup-and-reroute process to the identied paths. The breadth-rst search explores all paths from the primary inputs to the primary outputs and identies the critical paths which determine the working frequency of the given circuit. During the ripup-and-reroute process, the nets of the identied critical paths are ripped up and then re-routed by resorting to a maze routing algorithm. The implemented maze routing algorithm reallocates alternative short 108 wires for the nets while evaluating the risk of causing an innite ripup-and-reroute process based on the machine learning results. The alternative wires of the critical data nets are even frozen to prevent being ripped up again. If there are ripped-up nets after re-routing the critical nets, a re-routing process is kept running until all nets are routed. Both the critical path identication and the ripup-and-reroute process are repeated if the path length of the critical paths is improved. The comprehensive path rectication renes the critical clock paths and mini- mizes the number of the hold timing violations. More clock nets in the whole clock distribution topology are targeted to further increase the clock skew of the critical paths through the ripup-and-reroute process. The ripup-and-reroute process in the comprehensive path rectication stage is repeated for few interactions to prevent aggressively changing the routed clock nets. In the end, the paths with hold time violations are located and the detour wires with large delays are created to replace old wires for resolving hold time violations of the given SFQ circuit. Referring to the MIT-LL SFQ5ee process technology [1], we only have two (Nb) metal layers for signal routing, denoted as a bottom-layer M1 (connections on this layer are created using striplines that are sandwiched between grounded M0 and M2 layers) and a top-layer M3 (connections on this layer are created using striplines that are sandwiched between grounded M2 and M4 layers). Both M1 and M3 are used for PTL connections surrounded by top and bottom ground wires/layers. M5 is used for the biasing, whereas M6 is used to implement connections (and 109 Figure 5.1: Work ows of post-routing optimization inductors) inside the cells. Following prior work [15], M6 is reserved for the design of an SFQ cell library and is not used for passive transmission lines (PTL). In this paper, clockless splitters are used to generate fanout of two for an input signal to solve the fanout limitation which requires each nets in a typical SFQ circuit a 2-pin nets. The explanations in this paper address 2-pin nets but our ideas can be extend to multi-pin nets through focusing on the longest interconnection of each multi-pin net. 5.2.1 Standard SFQ Cell Library The standard SFQ cell library used in this paper refers to prior work [12,15]. There are two parts of each SFQ cell in this library: (1) a logic design part realizing a Boolean function and (2) a built-in clock distribution part splitting and/or passing 110 clock pulses. There are eight types of the standard cells: SplitterCLK, Splitter, NOT, DFF, AND, OR, and XOR. SplitterCLK without the logic part is used for clock tree synthesis. The height of the clock part is 40m and that of logic part is 80m. Therefore, the height of all cells is 120m except that the height of SplitterCLK is 40m. The pins for a signal input and output of the logic part are in the M1 layer whereas the pins for a signal input and output of the clock part are in the M3 layer. All cells are dc biased and each bias pillar is of size 2.5m x 2.5m. Details for timing parameters are listed in Table 5.1. The design of all SFQ standard cells follows the MIT-LL SFQ5ee process technology [53]. Table 5.1: Standard SFQ Cell Library Cells SplitterCLK Splitter NOT DFF AND OR XOR NDRO Height (m) 40 120 120 120 120 120 120 120 Width (m) 40 30 30 40 50 50 50 50 #Inputs (+Clk) 0 (+1) 1 (+1) 1 (+1) 1 (+1) 2 (+1) 2 (+1) 2 (+1) 1 (+1) #Outpus (+Clk) 0 (+2) 2 (+1) 1 (+1) 1 (+1) 1 (+1) 1 (+1) 1 (+1) 1 (+1) Clock-to-q Delay (ps) 5.7 5.7 13.0 6.8 8.7 6.0 6.3 10.0 Setup Time (ps) - - 3.9 1.1 0.0 2.6 4.8 10.0 Hold Time (ps) - - 6.1 4.0 4.7 3.1 4.8 10.0 The MIT-LL SFQ5ee process technology supports two routing layers for a rout- ing procedure and these layer are reserved for PTLs with a propagation speed 100m=ps in our work. Following prior work [12, 15], we do not use Josephson transmission lines (JTLs) for routing procedure because JTLs require special rout- ing when they cross in an orthogonal direction to one another [58, 59]. More- over, long JTLs introduce a signicant delay which is obviously against the objec- tive of working frequency maximization. Notice that each PTL requires a PTL 111 driver at the driving point and a PTL receiver at the receiving point. The PTL driver and receiver are embedded in each standard cell with matching character- istic impedance for PTL connections. The length of PTLs can reach 5mm with negligible signal loss and more details can be found in [31]. The width and the pitch of PTLs are 5m and 10m, respectively. PTLs between two layers connect with each other through a via and the size of a via is 5m x 5m. 5.2.2 Machine Learning Mass resource reallocation is likely to happen when re-routing critical nets because most routing resources are exploited after the routing step. The mass resource real- location after the routing step, however, should be avoided because it might weak the achievement of the routing step such as via minimization [15]. To assess the probability of mass resource reallocation, we utilize a renowned machine learn- ing algorithm called density based spatial clustering of application with noise (DBSCAN) to analyze wire distributions [66]. DBSCAN is an unsupervised learn- ing algorithm which groups distributed points within a space based on local den- sities, as shown in Figure 5.2. Transformation of Continuous Routing Wires The input of DBSCAN is discrete data points so we transform continuous routing wires of an SFQ circuit to discrete points. To begin with, the whole routing area 112 Figure 5.2: Density based spatial clustering of application with noise (DBSCAN) algorithm in the machine learning stage. is cut into bins, as shown in the Figure 5.2(a). The width and the length of a bin are a pitch size in x-direction and in y-direction, respectively. Continuous wires over bins are transformed into discrete points by assigning an unlabeled point to the center of the bin if the bin is passed by a wire. White and empty circles in Figure 1(b) illustrate these assigned points. Density Based Spatial Clustering of Application with Noise (DBSCAN) The discrete point representation empowers DBSCAN to perform wire distribution analysis based on local point densities without tracing irregular continuous wires. Distributed unlabeled points are grouped by DBSCAN and there are two required 113 constants for point grouping: the radius of a search window (R) and a grouping threshold (MinPts). Algorithm 1 describes the details of DBSCAN [66]. There are four main steps which are repeated until all points are labeled: 1. Select any unlabeled point as the center point of a search window of radiusR and count the number of the points enclosed within the search window. The unlabelled points (as known as UNCLASSIFIED) refer to the white empty points in Figure 5.2(b); 2. If the number is more than MinPts, label the selected point with a cluster ID (ID) and group all enclosed points as a seed set. Otherwise, label the selected point as NOISE and go back to Step 1; 3. Count the enclosed points within the search window centered at one chosen point from the seed set. If there are more thanMinPts enclosed points, the chosen point is labeled with the same cluster ID as Step 2 and other enclosed points without a label are added to the seed set; 4. Repeat Step 3 until all point in the seed set are chosen and this repetition is illustrated in Figure 5.2(c). Then, increase the cluster ID by 1 and go back to Step 1 if any points are unlabeled. Figure 5.2(d) illustrates a grouping result in which all points are labeled with ID=1, ID=2, or NOISE. The grouping result can be used to identify congested 114 (layout) regions in the layout because regions occupied by points with the same ID tend to experience higher routing resource pressure than other regions. The time complexity of DBSCAN is O(Nlog(N)), where N is the number of points in the complete layout space. Algorithm 1 DBSCAN 1: Input: Points, R, MinPts 2: ID = 1 3: for p in Points do 4: if p is UNCLASSIFIED then 5: enclosedPts = RangeQuery(Points, p, R) 6: ifjenclosedPtsj < MinPts then 7: label p as NOISE 8: continue 9: label p with ID 10: seeds = enclosedPtsnfpg 11: for p s in seeds do 12: enclosedPts = RangeQuery(Points, p s , R) 13: ifjenclosedPtsj MinPts then 14: label p s with ID 15: for p n in enclosedPts do 16: if p n is UNCLASSFIFIED then 17: seeds = seeds S fp n g 18: label p n with ID 19: else if ResultP is NOISE then 20: label p n with ID 21: ID =ID + 1 5.2.3 Critical Path Optimization The objective of critical path optimization is to reduce the wirelength of critical nets without violating any design rules. The wirelength reduction is expected to be realized without ripping up many non-critical nets because signicant layout 115 changes could compromise the objective achieved by specialized SFQ routers. e.g. via count minimization in [15]. This section describes the proposed critical path optimization following Figure 5.1. Critical Path Identication using Breath-rst Search Breadth-rst search (BFS) is a graph search algorithm which explores outward from a source node in all possible directions, adding nodes one layer at a time [67]. The BFS produces a BFS tree rooted as the source node on the set of nodes reachable from the source node. In this chapter, the nodes refer to the primary inputs, the primary outputs, the clocked cells, and the clockless cells of an SFQ circuit. The source node in a BFS tree is a primary input and the leaf nodes should be primary outputs unless there is a cell with no output connection. Since there are multiple primary inputs in an SFQ circuits, we need to run the BFS for each input to explore all connections for critical path identication. For each run, there are minor dierences when the primary input is a data input (including a control signal input) or a clock input. If the BFS starts from a data input, the net connection which provides a clock signal to a clocked cell will be ignored during search. After each search, we can create a BFS tree in which the explored cells are connected by directed edges. Following the directed edges, we nd all paths in the produced BFS tree through identifying all propagation paths which start from a clocked cell to another clocked 116 cell without passing other clocked cells. The same procedure of the BFS applied to all primary inputs to identify all propagation paths in an SFQ circuit. These identied paths are the data paths. If the BFS starts from a clock input, the outward exploration stops when the reached node is a clocked cell. This stop criterion comes from an assumption that each clock signal is totally absorbed by a clocked cell [29]. To our best knowledge, this assumption is true for standard SFQ circuits. Since the BFS is a simple algorithm, the stop criterion can be revised easily if advanced SFQ circuits are considered. Dierent from the data path, a clock path refers to the whole propagation path starting from the source node (the primary clock input) to the leaf node (the reached clocked cell). Based on Equation (2.11), we can nd the critical data and clock paths which put the constraint on the minimum clock period of an SFQ circuit. The following procedures of the critical path optimization is to reduce the path delay of the critical data and clock paths through a ripup-and-reroute process. Ripup-and-Rerouter for Critical Data Paths The primary target of the critical path optimization is the critical data paths in an SFQ circuit because, in general, the magnitude of the clock skew of two critical clock paths is much smaller than that of the path delay of the critical data path. As discussed in Section 2.2.4, the critical data path can be formed either by a very 117 long wire or by multiple wires segmented by clockless splitters in between. The number of the critical data nets of the critical data path is 1 in the former case and is more than 1 in the latter case. The objective of the ripup-and-reroute process is to eciently reduce the wire- length of the critical data nets in a large-scale SFQ circuit without ripping up many non-critical nets. We rip up and re-route the critical data net one by one instead of ripping up all critical data nets at the same time to avoid signicant layout changes. In the re-routing process, a maze routing algorithm is performed to nd an alternative short wire for each ripped-up critical net [15]. There are three main steps of the maze routing algorithm: (1) bonding box formation, (2) wave propagation, and (3) trace back. We explain these steps on a 7 by 7 bin array shown in Figure 5.3 in which the maze routing algorithm creates a wire connection for a net with two pins labelled by b. Figure 5.3: Maze routing algorithm. (a) Bounding box formation. The dotted line represents the previous routing wire. b: source vertex. X: blockage. (b) Fanout wave propagation. (c) Trace back. 118 To begin with, a search space is specied by forming a minimal rectangular bounding box by which the previous routing wires of the critical data net can be bounded, as shown in Figure 5.3(a). This bounding box guarantees that an alter- native wire for this net is either the original routing wire with the same wirelength or another shorter wire. Then, the maze routing algorithm starts searching the wire for routing the net with a minimum cost. Routing costs of creating a wire are calculated from the bin where the source pin are located to other reachable bins using a wave (frontier) propagation method. Routing cost calculation is skipped for the bins occupied by blockages which forbid wire deployments. The cost of building a wire to reach a bin b l1 j in thel1-th layer from the source bin of a net n i is given by C(b l1 j ;n i ) = min b l2 k f(b l1 j ;b l2 k ) +g(b l1 j ;n i ) +C(b l2 k ;n i ); (5.1) where b l2 j is the adjacent bin of b l1 i . The value of f(b l1 j ;b l2 k ) depends on three parameters [15]: SegCost, JogCost, and ViaCost. The initial value of f(b l1 j ;b l2 k ) is 0. SegCost is added to f(b l1 j ;b l2 k ) if l1 =l2 and b l1 j is aligned with b l2 k following the preferred direction of the l1-th routing layer. JogCost is added tof(b l1 j ;b l2 k ) if l1 =l2 andb l1 j is not aligned withb l2 k following the preferred direction of the l1-th 119 routing layer. ViaCost is added to f(b l1 j ;b l2 k ) if l1 is not equal to l2. g(b l1 j ;n i ) is zero if b l1 j is not passes by a routing wire. Otherwise, g(b l1 j ;n i ) is given by g(b l1 j ;n i ) =c 15 IDCount(b l1 j ) +c 16 WL(b l1 j ) MaxWL +c 17 1 1 +exp(1PreWL(n i )=MaxWL) ; (5.2) wherec 15 ,c 16 andc 17 are weighting coecients. IDCount(b l1 j ) returns the number of bins which are labelled with the same cluster ID as b l1 j and the return is zero if b l1 j is labeled with NOISE. For example, if b l1 j is occupied by a blue point marked withID2 in Figure 5.2(d), IDCount(b l1 j ) returns 27 for cost calculation. WL(b l1 j ) is the length of the wire which passes b l1 j and MaxWL is the maximal wirelength of all nets in Manhattan distance. PreWL(n i ) is the previous physical wirelength of the given net n i before the rip-up. We provide more details about Equation (5.2) in this paragraph. The rst term in the equation helps our router to predict how many bins could be aected in the worst case if the net in b l1 j is ripped up becauseIDCount(b l1 j ) returns the number of bins sharing the same cluster ID as b l1 j . Specically, re-routing the nets with wires passing congested regions is challenging because routing algorithms usually need to create alternative wires for these nets using the scarce routing resources within the congested regions. The price of granting such re-routing is the signicant change of the routing layout compared to the initial routing layout. To minimize the probability of mass resource reallocation, we make IDCount() the dominate 120 factor during the ripup-and-reroute process for the critical data nets. WL(b l1 j ) in Equation (5.2) is divided by MaxWL for normalization because WL(b l1 j ) of long wires is generally larger than that of short wires. If we do not divide WL(b l1 j ) by MaxWL, the value ofg(b l1 j ;n i ) for the bin occupied by a long wire could be much larger than that for the bin occupied by a short wire. As a result, only the bins occupied by short wires are usable. The third term in Equation (5.2) is to add a constant cost of utilizing an occupied bin for re-routing n i . The constant cost for re-routing a large net is bigger than that of re-routing a small net because we want to reduce the bin count for net re-routing. The cost is not proportional to the previous physical wirelength of the net n i because we do not want to emphasize the previous routing result. Finally, the wire connecting the pins of n i with the minimum routing cost is identied by performing trace back from the terminal pin to the source pin in a backward propagation way, as shown in Figure 5.3. To control the number of the nets to be ripped up, the routing cost of creating a wire for n i cannot be larger than an upper bound, which is given by Ubound(n i ) =c 18 PreWL(n i )SegCost; (5.3) wherec 18 is a weighting coecient. As an example, in Figure 5.3(c), ifUbound(n i ) of the shown net is 6 or higher, we will accept the traced wire; otherwise, we will 121 reject it. The accepted wire is committed to be a physical routing wire for n i and the nets shorted with this committed wire are ripped up. These ripped-up nets will be re-routed in a subsequent ripup-and-reroute process for non-critical nets. Freezing Re-routed Critical Data Paths The alternative wires of the re-routed critical data nets should not be modied during the ripup-and-reroute process for non-critical nets because the wirelength reduction will be invalid. A straightforward strategy to hinder invalidating the wirelength reduction is to add a huge penalty to the routing cost for ripping up any critical data nets. However, a huge penalty can still be an acceptable penalty when a large upper bound is set for exploring feasible wires in a large-scale SFQ circuit. Thus, we propose a rigorous control strategy. The wires of the re-routed critical data nets are temporarily frozen as blockages. We term the condition of the wires as frozen because it is not a permanent condition. The frozen wires can be thawed for a small movement if special conditions are observed. We will rst describe how the the wires are frozen and then explain when the frozen wires can be thawed. We realize freezing the wires of the re-routed critical data nets through placing temporary blockages along these wires. Take Figure 5.3(c) as an example. If the bins located by trace back are utilized for creating a wire for the critical data net, these bins are placed with \X" after the wire creation. As a result, the 122 ripup-and-reroute process for non-critical nets cannot touch the wires of any re- routed critical data nets. Moreover, this rigorous control strategy can empower the proposed framework to escape from the potential convergence problem due to repetitive ripup-and-reroute processes. A lane of temporary blockages can be viewed as a barricade which, unfortu- nately, can hinder the routing algorithm from creating a wire for a net. Figure 5.4(a) depicts the eective exploration space for a net with two pins but there are two parallel barricades in the middle of the space for both bottom and top routing layers. Given this scenario, the routing algorithm cannot create a wire connect- ing the pins without crossing the barricades. If this special condition (or other similar conditions) is observed, we will thaw a small segment of the frozen wire and then move the thawed segment. We call the movement a plowing operation which sweeps the frozen wire segments while maintaining the wire connection as prior work [68]. The special condition is not limited to the scenario in which the top barricades is just placed above the bottom one. Any parallel barricades which hinder the re-routing process are recognized as the special condition and we specify more details in the next paragraph. The plowing operation is called when the exploration space for re-routing a net is partitioned by parallel barricades. To begin with, a L-shape wire for this net is deployed temporarily in the top routing layer (i.e. M3 layer) and this L-shape wire 123 Figure 5.4: Plowing operations. will pass the bin(s) occupied by the frozen wire. The bin shared by both the L- shape and frozen wire is called an intersection bin, as shown in Figure 5.4(a). Then, the plowing operation thaws a short wire segment centered at the intersection bin and then moves the thawed wire segment to the direction orthogonal to the barricades. The top layer is considered rst because its wire density is usually less than that of the bottom layer [15]. The plowing operation will be applied to the bottom layer (i.e. M1 layer) if there is no free space to move the short wire segment located in the top barricades. Notice that the plowing operation is called several times if there are multiple parallel barricades which partition the whole exploration space into several parts. Ripup-and-reroute for Critical Clock Paths The proposed critical path optimization not only reduces the path delay of the critical data paths but also increases the clock skew (skew i;j = t clk j t clk i ) of the critical clock paths. There are two possible ways to increase the clock skew: (1) increase the arrival time of the clock signal at the input of the capture cell and (2) 124 decrease the arrival time of the clock signal at the input of the launch cell. If we are only allowed to control the wirelength of SFQ circuits, the rst way requires increasing the wirelength of the critical clock nets, which is counter-intuitive and generally not preferred. Thus, the latter way is pursued in this chapter. We locate the critical clock nets for wirelength reduction by the assumption that the clock distribution topologies in large-scale SFQ circuits are built with clockless cells (e.g. splitters). Applicable clock distribution topologies can be found in [12, 13,15,69]. The assumption can be relaxed for other topologies but details will not be addressed here. Each clock signal travels from a primary input to a clocked cell after passing several clock nets and clockless cells. The whole traveling path is the clock path of the clocked cells. Figure 5.5(a) illustrates an interconnection between two clocked cells and the clock paths of these two clocked cells. The black triangles represent the clockless cells along the clock path. In most SFQ clock distribution topologies, clock nets are shared between clock paths because the fanout of the clockless cells is typically more than 1. Thus, we only apply the ripup-and-reroute process to the clock net which directly connects to the launch cell of the critical data path. Let the interconnection in Figure 5.5(a) be the critical path, the targeted clock net is the net connecting the clock input of the left clocked cell. The applied ripup-and-reroute process is the same one for the critical data nets. We perform a conservative optimization strategy here because minor wirelength reduction in 125 some clock nets can result in many hold time violations due to clock skew increase. Aggressive clock skew improvements for critical clock paths are attempted in the comprehensive path rectication stage to minimize the likelihood of hold time violation growth. Figure 5.5: Interconnections between clocked cells. (a) An interconnection between two clocked cells. The black triangles represent clockless cells for forwarding clock signals. (b) Two interconnections between pairs of clocked cells. Ripup-and-reroute Process for Non-Critical Nets The objective of the ripup-and-reroute process for non-critical nets is to create wires for the nets whose wires are ripped up for re-routing critical nets. Same as the ripup-and-reroute process for critical nets, the maze routing algorithm is implemented to nd feasible wires. Unlike prior work [15], we don't use a com- parison function to decide the routing order of the nets because the net count is small. The routing order of the nets is simply the order in which these nets were ripped up. Since the wirelength control is not necessary for non-critical nets, we do not construct an explicit bounding box to limit the exploration space. The exploration 126 space is implicitly limited by the upper bound of the routing cost because the wires with the routing cost higher than the upper bound will not be accepted. Equation (5.1) is modied by replacing g(b l1 j ;n i ) with h(b l1 j ;n i ) for calculating the routing cost of building a wire for a non-critical net n i and is written as C(b l1 j ;n i ) = min b l2 k f(b l1 j ;b l2 k ) +h(b l1 j ;n i ) +C(b l2 k ;n i ) (5.4) wheref(b l1 j ;b l2 k ) remains the same as that in Equation (5.1). h(b l1 j ;n i ) is zero ifb l1 j is not passes by a routing wire. Otherwise, h(b l1 j ;n i ) is h(b l1 j ;n i ) =c 15 IDCount(b l1 j ) +c 16 WL(b l1 j ) MaxWL +c 19 ; (5.5) wherec 15 andc 16 are the same weighting coecients used in Equation (5.2) andc 19 is a new weighting coecient. Given Equation (5.4), the maze routing algorithm is run for 10 passes to nd a feasible wire for a net. The upper bound of the routing cost in the k-th pass for routing a net n i is Ubound(n i ;k) = 2Ubound(n i ;k 1) (5.6) Ubound(n i ; 1) =c 20 +c 21 ViaCost +c 22 min(width(n i );length(n i ))SegCost (5.7) 127 where c 20 ;c 21 and c 22 are weighting coecients. width(n i ) and length(n i ) denote to the Manhattan distance of two pins in y-direction and x-direction, respectively. The ripup-and-reroute process for non-critial nets is repeated until there is no unrouted net. If the path length of the critical paths is improved, the ripup-and- reroute process for the critical data path will be executed again. Otherwise, the framework proceeds to comprehensive path rectication. 5.2.4 Comprehensive Path Rectication The objective of comprehensive path rectication is to reduce the delay of the critical clock path and resolve the hold time violations. The critical clock path in this subsection denotes to the clock path that forwards a clock signal to the launch cell of the critical path. All clock nets of the critical clock path are considered for wirelength reduction to achieve aggressive clock skew improvements. We x the hold time violations in the end because we can reserve as many routing resources as possible for critical paths. Critical Clock Net Selection Clock net optimization is a delicate optimization process because clock delay changes can result in both clock skew increase for a pair of cells and clock skew decrease for another pair of cells, simultaneously. Take Figure 5.5(b) as an exam- ple in which a cell i and a cell k share all clock nets except the clock net directly 128 connecting to their clock input. Let the interconnection between a celli and a cell j be the critical data path. To improve the working frequency, we decrease the clock arrival time of a celli through reducing the wirelength of multiple nets along the critical clock path. At the same time, the clock skew between a cell k and a celll also increases. However, if the clock skew increase is too large, the hold time condition between a cell k and a celll can be violated. To avoid the possible hold time violations, we estimate the risk of optimizing a clock net n i by risk(n i ) =ChildNetCnt(n i ) +c 23 CurWL(n i ) WLL(n i ) ; (5.8) where ChildNetCnt(n i ) is the number of the child nets of n i in the clock tree, CurWL(n i ) is the current wirelength ofn i ,WLL(n i ) is the wirelength lower bound of n i , and c 23 is a weighting coecient. We select the critical clock net with the lowest risk and then apply the ripup-and-reroute process for the selected net. Ripup-and-reroute for the Selected Clock Net The objective of the ripup-and-reroute process for the selected clock net remains the wirelength reduction, so the same ripup-and-reroute process for critical nets is applied to the selected clock net. The critical clock net selection and the ripup- and-reroute process are repeated for a few times. The critical data and clock paths in an SFQ circuit are re-evaluated each time because the critical paths may change 129 after the wirelength reduction of a clock net. The selected clock net will not be re-selected. Fixing Hold Time Violations There are three possible approaches to x hold time violations of an SFQ circuit: decreasing clock skew, inserting clockless cells, and increasing data path delays. The rst approach needs to increase the path length of the clock path connecting to the launch cell or decrease the path length of the clock path connecting to the capture cell. This approach, however, implies a high risk of losing control on clock skew when the magnitude of the negative hold slack is very large. The cell insertion approach is usually implemented before or in the placement step because placement tools typically produce compact cell placement results without reserving spaces for extra cells. Moreover, timing adjustments using extra cells such as JTLs could cause operation margin reduction and even yield rate drop [65]. Thus, data path delay increase is considered here because it can x the hold time violations with low risk of compromising any clock skew or operation margins. We propose an innovative strategy for resolving hold time violations given an SFQ circuit. There are four steps in the proposed strategy and we explain these steps using Figure 5.6. 130 1. Identify a data path with negative hold slack and select the shortest wire of the identied data path for re-routing. Figure 5.6(a) illustrates a wire as the selected wire; 2. Find the mid point of the selected wire and create a temporary blockages lane orthogonal to the wire segment passing the mid point, as shown in Figure 5.6(b). If the mid point is a turning point, the segment closer to the launch cell is chosen; 3. Rip up the selected wire and perform a maze routing algorithm to rebuild a new wire for the original connection, as shown in Figure 5.6(c). If the hold time violation is not xed, we will repeat Step 2 and 3 for up to 10 times; 4. Remove the temporary blockages lanes created in Step 2 and repeat Step 1 until all data paths with negative hold slack are identied. Figure 5.6: Fixing hold time violations by temporary blockage placement. The length of the blockage lane created in Step 2 is the equivalent bin length of the hold slack minus one bin length. In Figure 5.6, the hold slack is equivalent to 131 6 bin length, so a blockage lane of length 5 is placed. The length of a rebuilt wire is expected to be 6 if the rebuilt wire goes around the blockage lane. The maze routing algorithm performed in Step 3 follows Equation (5.4) and (5.5) except thatc 20 is increased by ten times. The optimized routing result with a given SFQ circuit is generated in the design exchange format (DEF) after the comprehensive path rectication. 5.3 Experimental Results and Discussions We realize the post-routing optimization framework in about 6,000 lines in C lan- guage. The hardware environment for the experimental results is a Linux machine with the Intexl(R) Xeon(R) CPU E7-8837 @2.67 GHz. For the testing of the proposed optimization framework in two layer routing, we synthesize Kogge-Stone adders (KSA), array multipliers (Mul), and integer dividers (IntDiv) using state-of-the-art SFQ EDA tools. These SFQ EDA tools covers the steps of logic synthesis [33], cell placement [12], clock topology generation [69], and net routing [15, 64]. Moreover, some of the ISCAS c-series benchmarks with dierent net counts are generated by the same EDA tools to verify the robustness of our framework. The synthesized SFQ circuits are path-balanced after the logic synthesis step. We run the placement tool for SFQ logic netlists to place SFQ cells. An optimized H-tree with minimum clock skew is generated as the clock distribution topology of each SFQ circuit. More details are specied in Table 132 5.2. IdealTotalWL is the ideal total wirelength obtained by routing all nets with L-shaped wires regardless of any blockages or routing resource limitations. All netlists of SFQ cells including both data nets and clock nets are routed using a general router for both CMOS and SFQ designs, Qrouter, and a specialized router for SFQ designs, qGDR. Neither Qrouter nor qGDR is empowered with the relevant knowledge to identify SFQ paths and optimize SFQ timing behaviors. Input le formats of the proposed framework are open standard library exchange format (LEF) and design exchange format (DEF) les. LEF elaborates design rules and cell layouts, whereas DEF describes circuit netlists and corresponding circuit layouts. Table 5.2: Circuit Specication Circuit #Cells #Nets Area (mm 2 ) IdealTotalWL (m) 4-bit KSA 171 258 1.75 4.83e4 8-bit KSA 534 776 3.87 1.44e5 16-bit KSA 1215 1847 8.88 4.03e5 32-bit KSA 3753 5311 30.9 1.62e6 4-bit Mul 526 771 4.24 1.32e5 8-bit Mul 3458 4815 25.1 9.51e5 4-bit IntDiv 1092 1636 9.41 3.68e5 8-bit IntDiv 7363 10555 86.0 2.87e6 c432 2291 3500 25.0 9.70e5 c499 2091 3045 18.2 9.52e5 c880 3649 5129 33.8 1.59e6 c1355 2130 3124 21.5 1.12e6 c1908 3706 5255 30.3 1.45e6 Table 5.3 reports the routing results generated by Qrouter without and with the post-routing optimization given the testing SFQ circuits. Qrouter with the 133 post-routing optimization is marked as Qrouter*. TotalWL is the total wirelength of PTLs. The values ofTotalWL in the third column are almost the same as those in the second column, which suggests that the routing wires of most nets are not changed. #Vias denotes the total number of vias used for routing. If we compare the values of #Vias between the fourth and the fth columns, we can observe a minor increase in #Vias for most circuits after the post-routing optimization. This increase is reasonable because extra vias might be needed for building detour wires. For example, building the detour wire in Figure 5.6(c) requires 4 vias instead of zero via because there are four turning points. Table 5.3: Routing Results of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks by Qrouter w/o and w/ Post-routing Optimization Circuit TotalWL (m) #Vias #MaxVia Frequency (GHz) Running Qrouter Qrouter* (Ratio) Qrouter Qrouter* (Ratio) Qrouter Qrouter* (Ratio) Qrouter Qrouter* (Ratio) Time (s) 4-bit KSA 5.72e4 5.64e4 (0.99) 375 367 (0.99) 8 6 (0.75) 22.4 27.1 (1.22) 0.55 8-bit KSA 1.76e5 1.77e5 (1.00) 1185 1199 (1.01) 12 11 (0.91) 23.9 29.3 (1.23) 13.70 16-bit KSA 5.12e5 5.32e5 (1.03) 3776 3999 (1.05) 20 34 (1.70) 18.7 18.7 (1.00) 519.4 32-bit KSA 1.86e6 1.88e6 (1.01) 9500 9678 (1.01) 28 28 (1.00) 11.6 12.3 (1.06) 336.0 4-bit Mul 1.57e5 1.59e5 (1.01) 1000 1006 (1.00) 10 14 (1.40) 18.1 21.6 (1.19) 3.31 8-bit Mul 1.07e6 1.08e6 (1.00) 5414 5494 (1.01) 16 16 (1.00) 15.3 17.4 (1.14) 175.8 4-bit IntDiv 4.35e5 4.49e5 (1.03) 2578 2708 (1.05) 20 20 (1.00) 17.9 18.8 (1.05) 232.1 8-bit IntDiv 3.08e6 3.08e6 (1.00) 11416 11416 (1.00) 24 24 (1.00) 7.4 7.5 (1.01) 225.2 c432 1.09e6 1.10e6 (1.00) 5346 5446 (1.01) 16 20 (1.25) 11.4 11.7 (1.03) 44.65 c499 1.11e6 1.12e6 (1.00) 6357 6447 (1.01) 22 32 (1.45) 11.4 12.2 (1.07) 173.0 c880 1.79e6 1.80e6 (1.00) 8317 8491 (1.02) 26 20 (0.76) 10.2 11.9 (1.17) 479.0 c1355 1.29e6 1.29e6 (1.00) 6424 6470 (1.00) 28 28 (1.00) 11.3 11.8 (1.04) 75.19 c1908 1.65e6 1.68e6 (1.01) 8238 8408 (1.02) 22 22 (1.00) 10.4 10.8 (1.04) 208.4 Average Ratio - 1.00 - 1.01 - 1.09 - 1.09 - #MaxVia is the maximum number of vias used for routing a net. Similar to the prior work [15], the values of #MaxVia in the sixth and the seventh columns are large because there are only two routing layers for routing large-scale SFQ circuits. #MaxVia is prone to increase after the optimization for some circuits 134 because few interconnections are rebuilt with many vias to reduce their wirelength. However, signicant #MaxVia increase is observed for some circuits. We explain the reason of the large increase in #MaxVia of the 16-bit KSA but the explanation applies to other circuits (e.g. c499) with large #MaxVia increase. Referring to the routing result of the 16-bit KSA in the fourth row of Table 5.3, the large increase in #MaxVia is caused by re-routing a net with a signicant hold time violation. The initial hold slack of this net is6:2ps and the pins of this net are in congested regions. Consequently, 34 vias are used to create a detour wire within the congested regions to x the hold time violation of this net. Frequency in Table 5.3 is the maximum working frequency of an SFQ circuit and is the inverse of the minimum clock period specied in Equation (2.11). The ninth column in Table 5.3 shows the improvements in Frequency by ratio val- ues for all SFQ circuits. The frequency improvements reach 9% on average and the maximum working frequency can be boosted by more than 20% for small cir- cuits (e.g. 4-bit and 8-bit KSAs). The improvements conrm the strength of the proposed post-routing optimization framework on improving maximum working frequencies. Moreover, we demonstrate the fact that wirelength reduction for fre- quency improvements is possible because not all critical nets are routed with the shortest wires. The tenth column in Table 5.3 shows that the largest running time is 519s for improving the 16-bit KSA. 135 Table 5.4: Routing Results of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks by qGDR w/o and w/ Post- routing Optimization Circuit TotalWL (m) #Vias #MaxVia Frequency (GHz) Running qGDR qGDR* (Ratio) qGDR qGDR* (Ratio) qGDR qGDR* (Ratio) qGDR qGDR* (Ratio) Time (s) 4-bit KSA 5.55e4 5.52e4 (0.99) 353 353 (0.98) 6 6 (1.00) 24.9 26.5 (1.11) 0.14 8-bit KSA 1.71e5 1.71e5 (1.00) 1086 1076 (0.99) 14 14 (1.00) 24.3 24.6 (1.01) 0.56 16-bit KSA 4.72e5 4.76e5 (1.00) 3025 3067 (1.01) 16 16 (1.00) 19.7 22.1 (1.12) 5.97 32-bit KSA 1.80e6 1.81e6 (1.00) 7703 7785 (1.01) 26 26 (1.00) 10.5 11.9 (1.13) 119.1 4-bit Mul 1.49e5 1.49e5 (1.00) 893 887 (0.99) 8 9 (1.12) 21.2 22.2 (1.05) 0.52 8-bit Mul 1.04e6 1.04e6 (1.00) 4523 4555 (1.00) 18 18 (1.00) 15.7 16.2 (1.03) 26.95 4-bit IntDiv 4.15e5 4.18e5 (1.00) 2187 2211 (1.01) 19 19 (1.00) 15.7 17.9 (1.14) 12.49 8-bit IntDiv 3.05e6 3.05e6 (1.00) 9959 9959 (1.00) 24 24 (1.00) 6.8 7.1 (1.04) 31.32 c432 1.06e6 1.07e6 (1.00) 4560 4582 (1.00) 16 16 (1.00) 11.9 12.5 (1.05) 14.53 c499 1.07e6 1.07e6 (1.00) 5031 5069 (1.00) 24 24 (1.00) 11.7 12.6 (1.08) 84.78 c880 1.75e6 1.75e6 (1.00) 6969 7071 (1.01) 24 24 (1.00) 10.7 11.9 (1.11) 162.1 c1355 1.25e6 1.25e5 (1.00) 5575 5597 (1.00) 28 28 (1.00) 11.9 12.3 (1.03) 16.79 c1908 1.60e6 1.60e6 (1.00) 6783 6851 (1.00) 24 24 (1.00) 10.7 11.5 (1.07) 37.78 Average Ratio - 1.00 - 1.00 - 1.00 - 1.07 - Table 5.4 reports the routing results generated by qGDR without and with the post-routing optimization for all generated SFQ circuits. qGDR with the post- routing optimization is marked as qGDR*. The values of TotalWL in the third column are almost the same as those in the second column. There is no change in #MaxVia for all circuits except the 4-bit Mul whose #MaxVia increases by 1 after the post-routing optimization. Furthermore, the growth of #Vias in Table 5.4 after the optimization is less than 1% of the initial value. If we compare the values of Frequency in the eighth columns with those in the ninth columns, the maximum working frequency of all circuits is improved. The frequency improve- ments is 7% on average instead of 9% because the frequency reached by qGDR is about 1.02X of that reached by Qrouter on average. Thus, the frequency improve- ments on the Qrouter and qGDR routing results are closed to each other but the 136 overheads of optimizing the latter results are negligible based on our reports. The tenth column in Table 5.4 shows that the whole optimization process can nish in 200s for all circuits. We achieve the comparable frequency improvements with lower overheads for qGDR routing results because, in general, qGDR produces routing results with superior wire distributions to that produced by Qrouter. We use the 16-bit KSA as an example to illustrate our observations. The top and bottom two gures of Figure 5.7 are the wire density graphs of the 16-bit KSA routing results generated by Qrouter and qGDR, respectively. If we compare the M1 wire densities in Figure 5.7(a) and (c), we can see the area of high wire density regions in the former gure is much larger than that of the latter. These high wire density regions are the congested regions. In addition, more routing resources in the M3 routing layer are consumed by Qrouter compared to qGDR, as shown in Figure 5.7(b) and (d). The in ation of congested regions and the overutilization of routing resources are also observed in other routing results generated by Qrouter. Our observations imply that CMOS-based routers might not be the best choice for some SFQ circuits because of the strict two-routing layer limitation. We further analyze the importance of each timing factor of a critical path. Specically, we illustrate the relative scale of the timing factors using the smallest and largest SFQ circuits after optimizing the routing results generated by qGDR. The result is shown in Fig. 5.8. The cell delay includes the clock-to-q delay and 137 Figure 5.7: Routing results of a 16-bit Kogge-Stone adder. (a) M1 wire density graph using Qrouter. (b) M3 wire density graph using Qrouter. (c) M1 wire density graph using qGDR. (b) M3 wire density graph using qGDR. the setup time of the critical path while the wire delay is the summation of delays of all wire segments of the critical data signal path. Other timing factors are self-evident. The non-zero skew value in Fig. 5.8 is due to a minor path length dierence after routing. Fig. 5.8(a) suggests that the critical timing factor for small circuits includes the splitter delay and the wire delay because there could be multiple splitters on a data signal path. The former factor becomes less important when the number of nets increases as shown in Fig. 5.8(b). The large value of 138 the wire delay conrms that the path length minimization for the critical path is crucial for realizing large-scale SFQ circuits with high performance. Figure 5.8: Pie charts of the timing factors of a critical path. (a) 4-bit KSA. (b) 8-bit IntDiv. The unit of the number is picoseconds. Given the benchmark results generated by qGDR, we report the execution time of our post-routing optimization in Fig. 5.9. This gure shows that the execution time generally increases when the number of cells (or nets) increases. However, the execution time of some circuits can be less than expected (e.g. 8-bit IntDiv) or more than expected (e.g. c499 and c880). We attribute this irregularity to the large variations of the M1 wire distribution and explain details using Fig. 5.10. We emphasize the M1 wire distribution because the routing resource demand of the M3 layer (used mainly for vertical wire segments) is far lower than the demand for the M1 layer (used mainly for horizontal wire segments). Fig. 5.10(a), which is the M1 wire density graph of the 8-bit Mul, is used as a typical M1 wire density graph for comparisons. Fig. 5.10(b) is the M1 wire density graph of the 8-bit IntDiv. Obviously, the area of high wire density regions is lower for the 8-bit 139 IntDiv compared to that of the 8-bit Mul. As a result, fewer nets are ripped up in the 8-bit IntDiv to build alternative short wires for critical data signal paths, and therefore, the execution time is lower. Similarly, comparing Fig. 5.10(a) and Fig. 5.10(c), we observe that the number of re-routed nets in (c) is much higher than that in (a), which explains the large execution time of c499. The execution time of c880 is also considerably increased due to the large area of high wire density regions as shown in Fig. 5.10(d). These results point to the large eect of the area of high wire density regions on the execution time of the post-routing optimization. Figure 5.9: execution time of benchmark circuits with dierent numbers of nets. Next we report a full timing analysis of all routing results for the SFQ circuits in Table 5.5. The post-routing optimization reduces the number of hold time violations and increases the hold slack. In practice, hold time violations in some of Qrouter routing results could not be xed, as shown in the third column of 140 Figure 5.10: M1 wire density graph after qGDR. (a) 8-bit Mul. (b) 8-bit IntDiv. (c) c499. (d) c880. this table. Let's consider hold time violations of the 16-bit KSA in the fourth row as an example. There are 32 hold time violations and the worst hold slack is10:7ps. If we want to x the path with the worst hold slack, we need to increase the data signal path length of this path by 1,070m. Adding wires with such a large length increase is undesirable and impractical in the post-routing step because the addition of such elongated wires will change the routing result signicantly, causing design convergence problems. A better approach is to insert clockless cells (e.g. JTLs or splitters) on this path in order to increase the signal path delay [13, 35, 65]. As explained before, a cell insertion changes the circuit 141 netlist and necessitates cell re-placement and re-routing, which is undesirable in the post-routing step (although it has been suggested, see for example [58]). Table 5.5: Timing Analysis of Kogge-Stone Adders, Array Multipliers, Integer Dividers and C-series Circuits of ISCAS Benchmarks Circuit #HoldViolations WorstHoldSlack (ps) Frequency (GHz) Qrouter Qrouter* qGDR qGDR* Qrouter Qrouter* qGDR qGDR* Qrouter* qGDR* (Ratio) 4-bit KSA 1 0 0 0 -0.7 0 0 0 27.1 26.5 (0.98) 8-bit KSA 4 0 4 0 -4.3 0 -2.2 0 29.3 24.6 (0.84) 16-bit KSA 32 2 7 0 -10.7 -2.6 -2.1 0 18.7 22.1 (1.18) 32-bit KSA 24 0 16 0 -6.3 0 -6.8 0 12.3 11.9 (0.97) 4-bit Mul 4 0 1 0 -4.0 0 -0.1 0 21.6 17.9 (0.83) 8-bit Mul 22 0 8 0 -8.4 0 -4.3 0 17.4 17.9 (1.03) 4-bit IntDiv 17 1 8 0 -7.9 -2.6 -3.1 0 18.8 17.9 (0.95) 8-bit IntDiv 74 4 14 0 -12.7 -3.8 -3.2 0 7.5 7.1 (0.95) c432 25 0 12 0 -7.6 0 -3.3 0 11.7 12.5 (1.07) c499 12 0 1 0 -8.2 0 -1.0 0 12.2 12.6 (1.03) c880 26 2 13 0 -8.1 -5.7 -6.1 0 11.9 11.9 (1.00) c1355 12 1 4 0 -4.6 -3.6 -1.8 0 11.8 12.3 (1.04) c1908 29 1 12 0 -1.3 0 -4.6 0 10.8 11.5 (1.06) Average Ratio - - - - - - - - - 0.99 The fourth and the eighth column in Table 5.5 suggest that qGDR can eec- tively control both hold time violations and the worst hold slack of all SFQ cir- cuits. The largest number of the hold time violations is 16 and the worst hold slack is -6.8. Given the qGDR routing results with the slight congestion issue, our post-routing optimization framework successfully creates alternative detour wires without changing the routing results signicantly for resolving all hold time vio- lations, as indicated in the ninth column. While the hold time violations cannot be fully resolved without the proposed framework in collaboration with the state- of-the-art SFQ routers, the proposed framework can still eectively resolve most hold time violations and reduce the worst hold slack of SFQ routing results. The last two columns in Table 5.5 certify the comparable working frequency between the optimized Qrouter and the optimized qGDR. The frequency dierence of an 142 SFQ circuit between two routing results are due to the dierent values of the clock skew which are mainly determined by the routers instead the proposed framework. 5.4 Conclusion We present a post-routing optimization framework for large-scale SFQ circuits fab- ricated in the MIT-LL SFQ5ee process technology. Given only 2 PTL routing lay- ers, the proposed framework not only enhances the maximum working frequency of a routed SFQ circuit but also resolves the hold time violations. The frame- work comprises of three stages: machine learning, critical path optimization, and comprehensive path rectication. Machine learning eciently identies congested regions in a circuit layout through clustering algorithms. With the congestion anal- ysis result, critical path optimization reduces the path length of the critical paths in the circuit by a ripup-and-reroute process for a working frequency improvement. Comprehensive path rectication controls the clock skew of the critical paths by an aggressive ripup-and-reroute process and xes the hold time violations by build- ing detour paths. Given 13 routing results generated by the state-of-the-art SFQ router, our post-routing optimization framework improves the working frequency by 7% on average and resolves all hold time violations. The running time of all 13 routing result is no more than 200 seconds including an 8-bit integer divider with 10,555 nets. 143 Chapter 6 Status, Plans, and Conclusions This chapter concludes the thesis. First, we summarize the achieved contributions and the work discussed in the thesis. Second, we describe problems that are worth studying in the future. Third, we give a conclusion about the discussed problems in this thesis including a generalized retiming, a complete routing procedure, and post-routing optimization for large-scale SFQ circuits. 6.1 Summary SFQ technologies show the potential of meeting the recent demands for electronics with the characteristics of lower power consumption and higher operation speeds compared to CMOS technologies. Many SFQ studies have worked on attaining the benets of three orders of magnitude lower power and an order of magnitude higher performance. The rapid scale growth of SFQ circuits encourages researchers to reevaluate the weakness of conventional design approaches. While many tradi- tional EDA tools for CMOS designs are utilized for SFQ designs, we argue that these EDA tools especially logic synthesis and routing tools are not the best can- didate for large-scale SFQ designs due to the working mechanism dierence. To 144 prove our viewpoint, we propose a generalized retiming transformation which aims at minimizing register energy consumption while meeting performance constraints. The energy minimization problem is formulated as an integer linear programming (ILP). We then solve the ILP through dual transformation. Moreover, we describe a two-stage routing tool, qGDR which consists of a global routing stage and a detailed routing stage. The post-routing optimization framework is further intro- duced to empower qGDR with timing behavior optimization. We hereby provide a summary of the achieved contributions. We describe a constrained register energy minimization (CREM) problem for SFQ designs in which register insertions are performed in the post- synthesis step for building a high-performance pipeline circuit with minimal register energy consumption. An arbitrary synthesized circuit without path- balancing registers is abstracted as a directed graph model based on which we formulate the CREM problem as an integer linear programming (ILP) prob- lem. The formulated ILP problem is shown to correspond to a well-known minimum cost ow problem with polynomial algorithms through dual prob- lem transformation. We test our optimization approach using 8 benchmark circuits including Kogge-Stone adders (KSA), array multipliers (Mul), inte- ger dividers (IntDiv), ISCAS c-series benchmarks, and EPFL benchmarks. The experimental results show 4% and 5% improvements on register count 145 and register energy consumption on average compared to the start-of-the-art approach. We extend the CREM problem to building an arbitrary circuit under an advanced dual clocking architecture with an imbalance bound. We prove the extended problem to be NP-complete by reducing the vertex cover problem to it in polynomial time and space. To tackle the extended CREM problem, we propose a polynomial-time algorithm with a proven bounded error. Given 14 benchmark circuits, our experimental results demonstrate that our approach reduces 38% of register count and 50% of register energy consumption on average compared to the prior work. Moreover, the solution acquired by our approximation algorithm is on average only 1.08X away from the optimal solution. We present an integrated global and detailed router, qGDR, for large-scale SFQ circuits fabricated in the MIT-LL SFQ5ee process technology. qGDR realizes the full-chip routing procedure using only 2 PTL routing layers. The router comprises of a global router and a detailed router. The global router eciently allocates routing resources of PTL wires in a single-layer routing graph. Furthermore, the global router minimizes the total number of vias to mitigate the impedance mismatch by a dynamic programming algorithm. The detailed router follows the global routing results to maximize the number 146 of routed nets by a ripup-and-reroute process. Moreover, a feedback system in qGDR keeps adding more routing tracks in an automated manner until fully routed layouts are generated. We propose a post-routing optimization framework to rene routing results of large-scale SFQ circuits. Given only 2 PTL routing layers, the proposed framework not only enhances the maximum working frequency of a routed SFQ circuit but also resolves the hold time violations. The framework com- prises of three stages: machine learning, critical path optimization, and com- prehensive path rectication. Machine learning eciently identies congested regions in a circuit layout through clustering algorithms. With the conges- tion analysis result, critical path optimization reduces the path length of the critical paths in the circuit by a ripup-and-reroute process for a working frequency improvement. Comprehensive path rectication controls the clock skew of the critical paths by a aggressive ripup-and-reroute process and xes the hold time violations by building detour paths. We test qGDR with post-timing optimization using 13 SFQ circuits including Kogge-Stone adders (KSA), array multipliers (Mul), integer dividers (IntDiv) and some of the ISCAS c-series benchmarks. qGDR is capable of generating a compact layout for each circuit including an 8-bit integer divider with more than 45,000 Josephson junctions within no more than 0.5 CPU hour 147 without any manual intervention. Afterward, the post-routing optimization framework improves the working frequency by 7% on average and resolves all hold time violations. The presented framework can optimize each routing result in 2 minutes with negligible overheads. 6.2 Future Work We describe a few potential research directions which are worth exploration as future work. A retiming transformation is applicable to optimizing circuits with intercon- nection delays and we highly encourage the integration of interconnection delays into our retiming models. These delays can be acquired after the placement and routing step. The most challenging part for the integration should be realistic delay approximations after registers are inserted in individ- ual interconnections. To conquer the challenge, we expect the development of rigorous mathematical models or the implementation of reliable machine learning models. The fundamental factors for building these models may include but not limited to local cell densities, local wire densities, and the distance between connected cells. Moreover, we anticipate the application of innovative retiming to advanced architecture designs such as a dual clocking architecture with interconnection delays. 148 While a retiming transformation has reached a high level of maturity in CMOS designs, there are still potential research directions related to super- conductive circuits for further exploration. The described retiming transfor- mation in this thesis focuses on combinational circuits but sequential circuits with feedback loops are also popular circuit designs. To cope with the sequen- tial circuits using our retiming transformation, we would need to extend the denition of cell level (or vertex level in an abstracted graph model) and reformulate the denition of the minimum clock period of a superconductive circuit. The rigorous derivations of minimum register energy consumption can be anticipated afterward. The reduction of the register energy consump- tion is expected to be signicant even for FPB sequential circuits because sequential circuits usually consist of many unnecessary registers generated by logic synthesis tools. We present an integrated global and detailed router, qGDR, for large-scale SFQ circuits fabricated in the MIT-LL SFQ5ee process technology. While the running time is improved compared to the prior work, it is still much larger than the commercial routing tools. Thus, large SFQ circuits with more than 20,000 nets (such as 64-bit ALU) are unlikely to be routed in a few hours. To resolve the running time issue, we would like to bring up some promising approaches to realize an ecient detailed router because the running time of 149 the detailed router is far more than that of the global router. The promis- ing approaches include improving data structures, implementing line routing algorithms, and simplifying cost functions. Other than the aforementioned approaches, we believe that there are still many promising and innovative routing approaches since detailed routing algorithms remain one of the most dicult problems in the active EDA areas. This thesis elaborates on the application of a clustering algorithm (an unsu- pervised learning algorithm) to post-routing optimization, which suggests the potential of applying other machine learning algorithms or even deep learning to diverse EDA areas. With the rapid evolvement of machine learn- ing and deep learning, many innovative space analysis approaches are pro- posed to process a lot of 2D or 3D images in short time. A complicated circuit layout obviously can be viewed as a 3D image. Note that most of these approaches are supervised learning with the requirement of an enor- mous number of training data. We thus encourage the exploration of semi- supervised learning or unsupervised learning approaches for improving con- ventional EDA approaches without the requirement of many training data. Moreover, these learning methods generally enjoy short running time because of their simplicity. 150 6.3 Conclusions The expectation of building high-performance superconductive systems using SFQ designs gives birth to specialized EDA tools. This thesis covers two key EDA tools, a retiming tool and a routing tool. Retiming, a post-synthesis tool, relocates the registers of a circuit to optimize performance, area, or energy consumption. We present a generalized retiming transformation, applicable to SFQ designs, which emphasizes on achieving full path balancing of the circuit with the objective of minimizing register energy and satisfying performance constraints. We term this optimization problem a constrained register energy minimization (CREM) prob- lem and precisely formulate it with a well-developed optimization algorithm. We further extend the CREM problem formulation to retime the circuit under a dual clocking architecture. The extended CREM problem is proven as NP-complete so we propose a polynomial-time approximation algorithm with a bounded error to solve this retiming variant. Routing tools, as one of the EDA tools, determine the exact path delays that limit working frequencies of circuits. However, heavy wire routing tasks used to be nished by commercial routing tools with few physical considerations for the SFQ circuits. Recently, the development of the routing tools specialized for SFQ circuits starts receiving high attention due to the appearance of large-scale SFQ circuits. In this thesis, we describe an advanced routing tool, qGDR which utilizes a multi- stage approach to route large-scale SFQ circuits in an automated manner. The 151 objectives of the developed routing tool include via usage minimization, frequency maximization, and hold time violation minimization. Given the extensive experimental results, this thesis sheds the light on the successful development of scalable SFQ EDA tools for facilitating the progress of high-performance SFQ designs. In closing, we would like to point out that although in this thesis the application target is merely superconductive SFQ circuits, the proposed frameworks of retiming and routing have general applicability e.g., CMOS wave-pipelined circuits and other (emerging) non-CMOS circuits. 152 References [1] S. K. Tolpygo, \Superconductor digital electronics: Scalability and energy eciency issues," Low Temp. Phys., vol. 42, pp. 361{379, May 2016. [2] S. Nishijima, S. Eckroad, A. Marian, K. Choi, W. S. Kim, M. Terai, Z. Deng, J. Zheng, J. Wang, K. Umemoto, J. Du, P. Febvre, S. Keenan, O. Mukhanov, L. D. Cooley, C. P. Foley, W. V. Hassenzahl, and M. Izumi, \Superconductivity and the environment: A roadmap," Supercond. Sci. Technol., vol. 26, 2013. [3] T. Duzer and C. Turner, Principle of Superconducting Devices and Circuits. Engle- wood Clis, NJ: Prentice-Hall, 1998. [4] O. A. Mukhanov, \Energy-ecient single ux quantum technology," IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 760{769, June 2011. [5] A. F. Kirichenko, I. V. Vernik, J. A. Vivalda, R. T. Hunt, and D. T. Yohanne, \Ersfq 8-bit parallel adders as a process benchmark," IEEE Trans. Appl. Supercond., vol. 25, no. 3, June 2015. [6] D. S. Holmes, A. L. Ripple, and M. A. Manheimer, \Energy-ecient superconduct- ing computing|power budgets and requirements," IEEE Trans. Appl. Supercond., vol. 23, no. 3, Feb. 2013. [7] D. Kirichenko, S. Sarwana, and A. Kirichenko, \Zero static power dissipation biasing of rsfq circuits," IEEE Trans. Appl. Supercond., vol. 21, no. 3, pp. 776{779, Jun. 2011. [8] M. H. Volkmann, A. Sahu, C. J. Fourie, and O. A. Mukhanov, \Implementation of energy ecient single ux quantum digital circuits with sub-aj/bit operation," Supercond. Sci. Technol., vol. 26, no. 1, 2013. [9] M. H. Volkmann, I. V. Vernik, and O. A. Mukhanov, \Wave-pipelined esfq circuits," IEEE Trans. Appl. Supercond., vol. 25, no. 3, June 2015. [10] I. I. Soloviev, N. V. Klenov, S. V. Bakurskiy, M. Y. Kupriyanov, A. L. Gudkov, and A. S. Sidorenko, \Beyond moore's technologies: operation principles of a supercon- ductor alternative," Beilstein J. Nanotechnol., vol. 8, pp. 2689{2710, June 2017. 153 [11] V. K. Semenov, Y. Polyakov, and S. Tolpygo, \Ac-biased shift registers as fabri- cation process benchmark circuits and ux trapping diagnostic tool," IEEE Trans. Appl. Supercond., vol. 27, no. 4, pp. 1{9, June 2017. [12] S. N. Shahsavani, T.-R. Lin, A. Shafaei, C. J. Fourie, and M. Pedram, \An inte- grated row-based cell placement and interconnect synthesis tool for large sfq logic circuits," IEEE Trans. Appl. Supercond., vol. 27, no. 4, June 2017. [13] C. J. Fourie, \Digital superconducting electronics design tools|status and roadmap," IEEE Trans. Appl. Supercond., vol. 28, no. 5, Aug. 2018. [14] K. Gaj, Q. P. Herr, V. Adler, A. Karniewski, E. G. Friedman, and M. J. Feldman, \Tools for the computer-aided design of multigigahertz superconducting digital cir- cuits," IEEE Trans. Appl. Supercond., vol. 9, no. 1, March 1999. [15] T.-R. Lin, T. Edwards, and M. Pedram, \qgdr: A via minimization oriented rout- ing tool for large-scale superconductive single ux quantum circuits," IEEE Trans. Appl. Supercond., vol. 29, no. 7, Oct. 2019. [16] C. E. Leiserson and J. B. Saxe, \Retiming synchronous circuitry," Algorithmica, vol. 5, pp. 5{35, 1991. [17] J. Monteiro, S. Devadas, and A. Ghosh, \Retiming sequential circuits for low power," in IEEE Int. Conf. on Computer-Aided Design (ICCAD), Nov. 1993, pp. 398{402. [18] N. Shenoy, \Retiming: Theory and practice," Integration, the VLSI journal, vol. 22, pp. 1{21, 1997. [19] P. Pan, \Continuous retiming: algorithms and applications," in Int. Conf. on Com- puter Design (ICCD), Oct. 1997, pp. 116{121. [20] C. Chu, E. F. Y. Young, D. K. Y. Tong, and S. Dechu, \Retiming with interconnect and gate delay," in IEEE Int. Conf. on Computer-Aided Design (ICCAD), Jan. 2003, pp. 221{226. [21] G. Pasandi and M. Pedram, \Pbmap: A path balancing technology mapping algo- rithm for single ux quantum logic circuits," IEEE Trans. Appl. Supercond., June 2019. [22] A. P. Hurst, A. Mishchenko, and R. K. Brayton, \Fast minimum-register retim- ing via binary maximum- ow," in Formal Methods in Computer Aided Design (FMCAD), Nov. 2007, pp. 181{187. [23] L.-T. Wang, Y.-W. Chang, and K.-T. Cheng, Electronic Design Automation: Syn- thesis, Verication, and Test (Systems on Silicon). Morgan Kaufmann, 2009. [24] G. Pasandi and M. Pedram, \An ecient pipelined architecture for superconduct- ing single ux quantum logic circuits utilizing dual clocks," IEEE Trans. Appl. Supercond., Mar. 2020. 154 [25] C. J. Fourie, C. Shawawreh, I. V. Vernik, and T. V. Filippov, \High-Accuracy Induc- tEx Calibration Sets for MIT-LL SFQ4ee and SFQ5ee Processes," IEEE Trans. Appl. Supercond., vol. 27, no. 2, pp. 1{5, 2017. [26] A. M. Kadin, R. J. Webber, and D. Gupta, \Current leads and optimized thermal packaging for superconducting systems on multistage cryocoolers," IEEE Trans. Appl. Supercond., vol. 17, no. 2, pp. 975{978, 2007. [27] A. Barone and G. Paterno, Physics and applications of the Josephson eect. Wiley Online Library, 1982, vol. 1. [28] T. Gheewala, \The Josephson technology," Proceedings of the IEEE, vol. 70, no. 1, pp. 26{34, 1982. [29] K. Likharev and V. Semenov, \RSFQ logic/memory family: A new josephson- junction technology for sub-terahertz-clock-frequency digital systems," IEEE Trans. Appl. Supercond., vol. 1, no. 1, pp. 3{28, 1991. [30] P. Bunyk, K. Likharev, and D. Zinoviev, \RSFQ technology: Physics and devices," International journal of high speed electronics and systems, vol. 11, no. 01, pp. 257{305, 2001. [31] H. Suzuki, S. Nagasawa, K. Miyahara, and Y. Enomoto, \Characteristics of driver and receiver circuits with a passive transmission line in rsfq circuits," IEEE Trans. Appl. Supercond., vol. 10, no. 3, pp. 1637{1641, Sept. 2010. [32] N. Katam, A. Shafaei, and M. Pedram, \Design of Complex Rapid Single-Flux- Quantum Cells with Application to Logic Synthesis," in Superconductive Electronics Conference (ISEC), 2017 16th International. IEEE, 2017. [33] G. Pasandi and M. Pedram, \A dynamic programming-based, path balancing tech- nology mapping algorithm targeting area minimization," in IEEE Int. Conf. on Computer-Aided Design (ICCAD), Nov. 2019. [34] K. Gaj, E. G. Friedman, and M. J. Feldman, \Timing of multi-gigahertz rapid single ux quantum digital circuits," Journal of VLSI signal processing systems for signal, image and video technology, vol. 16, no. 2{3, pp. 247{276, 1997. [35] K. Takagi, Y. Ito, S. Takeshima, M. Tanaka, and N. Takagi, \Layout-driven skewed clock tree synthesis for superconducting sfq circuits," IEICE Trans. Electron., vol. E94-C, pp. 288{295, 2011. [36] Y. Tukel, A. Bozbey, and C. A. Tunc, \Development of an optimization tool for rsfq digital cell library using particle swarm," IEEE Trans. Appl. Supercond., vol. 23, no. 3, pp. 288{295, Jun. 2013. [37] MIT, \Spotlight on superconducting integrated circuits," in Government Mircro- circuits Applications and Critical Technology Conference (GOMAC Tech), 2018. 155 [38] L. A. Abelson and G. L. Kerber, \Superconductor integrated circuit fabrication technology," in Proceedings of the IEEE, vol. 92, no. 10. IEEE, 2004, pp. 1517{ 1533. [39] S. K. Tolpygo, V. Bolkhovsky, T. J. Weir, A. Wynn, D. E. Oates, L. M. Johnson, and M. A. Gouker, \Advanced fabrication processes for superconducting very large- scale integrated circuits," IEEE Trans. Appl. Supercond., vol. 26, no. 3, pp. 1{10, 2016. [40] D. Yohannes, A. Kirichenko, S. Sarwana, and S. K. Tolpygo, \Parametric testing of HYPRES superconducting integrated circuit fabrication processes," IEEE Trans. Appl. Supercond., vol. 17, no. 2, pp. 181{186, 2007. [41] H. Numata and S. Tahara, \Fabrication technology for Nb integrated circuits," IEICE transactions on electronics, vol. 84, no. 1, pp. 2{8, 2001. [42] R. Harris, J. Johansson, A. Berkley, M. Johnson, T. Lanting, S. Han, P. Bunyk, E. Ladizinsky, T. Oh, I. Perminov, E. Tokacheva, S. Uchaikin, E. Chapple, C. Enderud, C. Rich, M. Thom, J. Wang, B. Wilson, and G. Rose, \Experimen- tal demonstration of a robust and scalable ux qubit," Physical Review B, vol. 81, no. 13, p. 134510, 2010. [43] A. El-Maleh, T. E. Marchok, J. Rajski, and W. Maly, \Behavior and testability preservation under the retiming transformation," IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 16, no. 5, pp. 528{542, May 1997. [44] S. Dey and S. Chakradhar, \Retiming sequential circuits to enhance testability," in Proc. IEEE VLSI Test Symposium, Apr. 1994, pp. 28{33. [45] S. Malik, E. M. Sentovich, R. K. Brayton, and A. Sangiovanni-Vincentelli, \Retim- ing and resynthesis: Optimizing sequential network with combinational techniques," IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 10, no. 1, pp. 74{84, Jan. 1991. [46] R. K. Ranjan, V. Singhal, F. Somenzi, and R. K. Brayton, \On the optimiza- tion power of retiming and resynthesis transformation," in IEEE Int. Conf. on Computer-Aided Design (ICCAD), Nov. 1998, pp. 402{407. [47] P. Peichen, A. K. Karandikar, and C. L. Liu, \Optimal clock period clustering for sequential circuits with retiming," IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 17, no. 6, pp. 489{498, June 1998. [48] J. Cong, H. Li, and C. Wu, \Simultaneous circuit partitioning/clustering with retim- ing for performance optimization," in Proc. DAC, June 1999, pp. 460{465. [49] J. Cong and S. K. Lim, \Physical planning with retiming," in IEEE Int. Conf. on Computer-Aided Design (ICCAD), Nov. 2000, pp. 2{7. 156 [50] O. Zografos, A. D. Meester, E. Testa, M. Soeken, P.-E. Gaillardon, G. D. Micheli, L. Amar u, and P. Ragha, \Wave pipelining for majority-based beyond-cmos tech- nologies," in Design, Automation and Test in Europe Conference and Exhibition (DATE), 2017, pp. 1306{1311. [51] E. Testa, O. Zografos, M. Soeken, A. Vaysset, M. Manfrini, and R. Lauwereins, \Inverter propagation and fan-out constraints for beyond-cmos majority-based tech- nologies," in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), July 2017, pp. 164{169. [52] (2020) Sun magnetics, rsfqlib. [Online]. Available: https://github.com/sunmagnetics/RSFQlib [53] C. J. Fourie, X. Peng, R. Numaguchi, and N. Yoshikawa, \Fabrication process and properties of fully-planarized deep-submicro nb/al-alo x /nb josephson junctions for vlsi circuits," IEEE Trans. Appl. Supercond., vol. 25, no. 3, June 2015. [54] C. E. Leiserson and J. B. Saxe, \Optimizing synchronous systems," in Proc. 22nd Annu. Symp. Foundations Comput. Sci. (FOCS), Oct. 1981, pp. 23{36. [55] J. B. Orlin, \A faster strongly polynomial minimum cost ow algorithm," J. Oper. Res., vol. 41, pp. 338{350, May 1993. [56] L. Amaru, P.-E. Gaillardon, and G. D. Micheli, \The ep combinational benchmark suite," in Int'l Workshop on Logic Synth. (IWLS), 2015. [57] J.-R. Gao, P.-C. Wu, and T.-C. Wang, \A new global router for modern designs," in Proc. ASP-DAC., 2008, pp. 232|-237. [58] M. Tanaka, K. Obata, Y. Ito, S. Takeshima, M. Sato, K. Takagi, N. Takagi, H. Akaike, and A. Fujimaki, \Automated passive-transmission-line routing tool for single- ux-quantum circuits based on a* algorithm," IEICE Trans. Electron, vol. E93.C, no. 4, pp. 435{439, Apr. 2010. [59] T. Jabbari, G. Krylov, S. Whiteley, E. Mlinar, J. Kawa, and E. Friedman, \Inter- connect routing for large scale rsfq circuits," IEEE Trans. Appl. Supercond., vol. 29, no. 5, Aug. 2019. [60] M. Tanaka, T. Kondo, N. Nakajima, T. Kawamoto, Y. Yamanashi, Y. Yamanashi, Y. Kamiya, A. Akimoto, A. Fujimaki, H. Hayakawa, N. Yoshikawa, H. Terai, Y. Hashimoto, and S. Yorozu, \Demonstration of a single- ux-quantum micropro- cessor using passive transmission lines," IEEE Trans. Appl. Supercond., vol. 15, no. 2, pp. 400{404, June 2005. [61] T.-H. Lee and T.-C. Wang, \Congestion-constrained layer assignment for via mini- mization in global routing," IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 27, pp. 1643{1656, Sep. 2008. 157 [62] C. Chu and Y.-C. Wong, \Flute: Fast lookup table based rectilinear steiner minimal tree algorithm for vlsi design," IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., vol. 27, no. 1, pp. 70{83, Jan. 2008. [63] K.-R. Dai, W.-H. Liu, and Y.-L. Li, \Ecient simulated evolution based rerouting and congestion-relaxed layer assignment on 3-d global routing," in Proc. Asia and South Pacic Design Automation Conf., Feb. 2009, pp. 570{575. [64] T. Edwards. (2016, Aug) Open circuit design. [Online]. Available: http://opencircuitdesign.com/qrouter/index.html [65] N. Kito, K. Takagi, and N. Takagi, \Automatic wire-routing of sfq digital circuits considering wire-length matching," IEEE Trans. Appl. Supercond., vol. 26, no. 3, Apr. 2016. [66] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, \A density-based algorithm for discovering clusters in large spatial databases with noise," in KDD, Aug. 1996, pp. 226{231. [67] J. Kleinberg and E. Tardos, Algorithm Design. Addison Wesley, 2005. [68] W. Scott and J. Ousterhout, \Plowing: Interactive stretching and compaction in magic," in Design Automation Con. Proc., June 1984, pp. 166{172. [69] S. Shahsavani and M. Pedram, \A minimum-skew clock tree synthesis algorithm for single ux quantum logic circuits," IEEE Trans. Appl. Supercond., vol. 29, no. 8, Dec. 2019. 158
Abstract (if available)
Abstract
Single-flux-quantum (SFQ) circuit technologies are promising digital circuit technologies with high-speed and extremely low-power advantages. With the emergence of large-scale SFQ circuits, it is desirable to develop generalized electronic design automation (EDA) tools to assist designers in maintaining the advantages of SFQ circuits. In this thesis, we present two EDA tools: a post-synthesis tool and a routing tool. The former tool focuses on power minimization using a generalized retiming transformation, whereas the latter tool emphasizes efficient wire routing with timing optimization. ❧ Retiming is a circuit transformation whereby registers are relocated to optimize performance, area, or energy consumption
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Analog and mixed-signal parameter synthesis using machine learning and time-based circuit architectures
PDF
Stochastic dynamic power and thermal management techniques for multicore systems
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Clocking solutions for SFQ circuits
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
High performance and ultra energy efficient computing using superconductor electronics
PDF
High level design for yield via redundancy in low yield environments
PDF
Power-efficient biomimetic neural circuits
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
Asset Metadata
Creator
Lin, Ting-Ru
(author)
Core Title
Development of electronic design automation tools for large-scale single flux quantum circuits
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/04/2020
Defense Date
08/05/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
computer aided design,electronic design automation,OAI-PMH Harvest,retiming,routing,single flux quantum,superconducting integrated circuits
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Gupta, Sandeep (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
b98504030@ntu.edu.tw,tingruli@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-367242
Unique identifier
UC11666173
Identifier
etd-LinTingRu-8941.pdf (filename),usctheses-c89-367242 (legacy record id)
Legacy Identifier
etd-LinTingRu-8941.pdf
Dmrecord
367242
Document Type
Dissertation
Rights
Lin, Ting-Ru
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
computer aided design
electronic design automation
retiming
routing
single flux quantum
superconducting integrated circuits