Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
(USC Thesis Other)
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Electronic Design Automation Algorithms for Physical Design and Optimization of Single Flux Quantum Logic Circuits by Soheil Nazar Shahsavani A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) May 2021 Copyright 2021 Soheil Nazar Shahsavani Dedication To my mom, my dad, and my brother for their endless love and support. ii Acknowledgements First and foremost, I would like to express my deep gratitude to my supportive and encouraging family. I am forever grateful for their undying love and unconditional support throughout my life. Thank you for always believing in me and giving me the strength and courage to make it here. I would like to express my special appreciation to my advisor, Professor Massoud Pedram, for his tremendous mentorship and continuous support of my Ph.D. studies. I truly appreciate the ideas and guidance he provided, which significantly helped my Ph.D. research. I would also like to thank my dissertation and qualifying exam committees, Professor Sandeep Gupta, Professor Aiichiro Nakano, Professor Peter A. Beerel, and Professor Pierluigi Nuzzo, for their insightful comments and consistent encouragement. A very special word of thanks goes to my friend and mentor Dr. Alireza Shafaei Bejestan for years of friendship and support. I am also sincerely grateful to my friends Mahdi Nazemi, Mohammad Saeed Abrishami, and Hassan Afzalikusha who have always been a major source of support. Lastly, I like to thank my friends and colleagues at the System Power Optimization and Regulation Technology (SPORT) Lab, for the helpful discussions throughout my Ph.D. studies. In particular, I am grateful to Arash Fayyazi, Amirhoseein Esmaeili, Dr. Mohammad Javad Dousti, Dr. Ghasem Pasandi, Dr. Ting-Ru Lin, Dr. Naveen Khatam, Dr. Ramy Tadros, and Mustafa Altay Karamuftuoglu. iii Table of Contents Dedication ii Acknowledgements iii List of Tables viii List of Figures x Abstract xiv Related Publications xvi Chapter 1: Introduction 1 1.1 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Chapter 2: Preliminaries 15 2.1 Single Flux Quantum Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1.1 Margin and Yield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 Physical Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2.1 Objective Functions . . . . . . . . . . . . . . . . . . . . . . 20 2.2.2.2 Wirelength Modeling . . . . . . . . . . . . . . . . . . . . . . 23 2.2.2.3 Net Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.2.3 Global Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.3.1 Quadratic Placement . . . . . . . . . . . . . . . . . . . . . . 28 2.2.3.2 Non-convex Optimization based Placement . . . . . . . . . . 31 2.2.4 Legalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.2.5 Detailed Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2.6 Clock Network Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.6.1 Clocking Structures . . . . . . . . . . . . . . . . . . . . . . 38 2.2.6.2 Clock Synthesis Algorithms . . . . . . . . . . . . . . . . . . 40 iv Chapter 3: Timing Driven Placement 42 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.1 Basics of ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.2 Prior Work on Timing Driven Placement . . . . . . . . . . . . . . . . 46 3.3 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Proposed solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.1 Net Upper-bounding . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.4.2 ADMM-based Timing Driven Placement . . . . . . . . . . . . . . . . 51 3.4.3 Application of TDP-ADMM to SFQ Placement . . . . . . . . . . . . 54 3.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Chapter 4: Hybrid Clock Networks: Clocking Structures and Algorithms for Placement 58 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.1 Standard Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 Clock Network Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.1 Clock Tree Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3.1.1 H-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3.1.2 HL-tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4 Customized Placement Algorithms . . . . . . . . . . . . . . . . . . . . . . . 66 4.4.1 Global Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4.2 HL-tree Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4.3 Same-level Cell Grouping . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4.3.1 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.4.3.2 Connectivity-based Graph Processing . . . . . . . . . . . . . 73 4.4.3.3 Distance-based Graph Processing . . . . . . . . . . . . . . . 77 4.4.3.4 Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.4.3.5 Super-cell Placement . . . . . . . . . . . . . . . . . . . . . . 79 4.4.3.6 Level Based Detailed Placement . . . . . . . . . . . . . . . . 80 4.4.3.7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . 81 4.4.4 Generic Cell grouping . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Chapter 5: Physical Synthesis of Clock Networks 87 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.2.2 Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2.3 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 Proposed Min-Skew Clock Tree Synthesis Methodology . . . . . . . . . . . . 95 5.3.1 Overall Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3.2 Clock Topology Generation . . . . . . . . . . . . . . . . . . . . . . . 99 v 5.3.3 Clock Tree Embedding and Splitter Insertion . . . . . . . . . . . . . . 102 5.3.4 Min-Skew Clock Tree Placement and Legalization . . . . . . . . . . . 105 5.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5 Clock Synthesis for Imbalanced Tree Topologies . . . . . . . . . . . . . . . . 117 5.5.1 Imbalanced Topology Generation . . . . . . . . . . . . . . . . . . . . 118 5.5.2 Splitter-Aware Clock Tree Embedding . . . . . . . . . . . . . . . . . . 119 5.5.3 Splitter-Aware Clock Tree Placement and Legalization . . . . . . . . 122 5.5.4 Simulation Results for Imbalanced Clock Tree Topologies . . . . . . . 122 5.6 Application to Asynchronous Clock Networks . . . . . . . . . . . . . . . . . 124 5.6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.6.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.6.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.6.4 (HC) 2 LC Topology Generation . . . . . . . . . . . . . . . . . . . . . 128 5.6.5 Clock Tree Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.6.6 Timing-Aware Clock Network Placement . . . . . . . . . . . . . . . . 130 5.6.7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Chapter 6: Timing Uncertainty-Aware Clock Topology Generation 134 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.2.2 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.3 qTopGen: A Timing Uncertainty-Aware Topology Generation Algorithm . . 143 6.3.1 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.4 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 Chapter 7: Uncertainty-Aware Timing Closure 153 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 7.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.2.1 Definitions and Notation . . . . . . . . . . . . . . . . . . . . . . . . . 156 7.2.1.1 Combinational and sequential cells in SFQ logic . . . . . . . 156 7.2.1.2 Data path delay . . . . . . . . . . . . . . . . . . . . . . . . 157 7.2.1.3 Insertion delay . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.2.1.4 Clock skew . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.2.1.5 Hold time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 7.2.1.6 Hold time margin . . . . . . . . . . . . . . . . . . . . . . . . 159 7.2.1.7 Clock tree topology . . . . . . . . . . . . . . . . . . . . . . . 159 7.2.1.8 Common clock path (CCP) . . . . . . . . . . . . . . . . . . 160 7.2.2 Timing Uncertainty-Aware Clock Topology Generation . . . . . . . . 160 7.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.3.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.3.3 Proposed Methodology - An Overview . . . . . . . . . . . . . . . . . 163 vi 7.3.4 Variation Aware Common Path Pessimism Removal . . . . . . . . . . 164 7.3.5 Placement Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 167 7.3.5.1 Incremental Placement . . . . . . . . . . . . . . . . . . . . . 168 7.3.5.2 Placement from Scratch . . . . . . . . . . . . . . . . . . . . 169 7.4 Evaluation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.4.1 Variation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 7.4.2 Dynamic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 7.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 Chapter 8: Margin and Yield Calculation 177 8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 8.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 8.3 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.3.0.1 Individual Parameter Update (IPU) . . . . . . . . . . . . . 182 8.3.0.2 Simultaneous Parameter Update (SPU) . . . . . . . . . . . 183 8.3.0.3 Hybrid Margin Calculation (HMC) . . . . . . . . . . . . . . 187 8.3.1 Feasible Parameter Region Calculation . . . . . . . . . . . . . . . . . 191 8.3.1.1 Parametric Yield Evaluation Results . . . . . . . . . . . . . 194 8.3.1.2 Feasible Parameter Region Expansion . . . . . . . . . . . . 198 8.3.1.3 Yield Modeling Results . . . . . . . . . . . . . . . . . . . . 200 8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 Chapter 9: Conclusion 202 References 207 vii List of Tables 3.1 Experimental Results for seven Benchmarks from EPFL Benchmark Suite [94]. 56 4.1 Widths of standard cells using MIT-LL SFQee5 [19]. All standard cells have a height of 120 μm, except for splitter cells used in the H-tree and HL-tree which have a height of 40 μm. . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2 Comparison of total number of clock sink nodes, percentage of HPWL and area improvement for 8 different SFQ logic circuits using different placement strategies. KSA stands for Kogge-Stone adder and ID stands for integer divider. 82 5.1 Notations and definitions used for formulating the clock tree placement and legalization problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2 Benchmark characteristics. KSA stands for Kogg-Stone adder [117], ArrMult stands for array multiplier, and ID stands for integer divider. Rest of the benchmarks are chosen from ISCAS85 benchmark suite [28]. Post-CTS columns report the number of cells and nets in the design after adding clock splitters and clock nets (using the proposed approach). . . . . . . . . . . . . . . . . . 115 5.3 Simulation results (clock skew, total negative hold slack (TNS), worst negative hold slack (WNS), and clock frequency) for several benchmarks using the proposed method and the baseline method [17]. Impr. stands for improvements over the baseline. Freq. denotes the max clock frequency. Post-pl. and post-rt. stand for post-placement and post-routing, respectively. Values for skew, WNS, and TNS are in ps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4 Results of the clock skew, total negative hold slack (TNS), worst negative hold slack (WNS) (in ps), the number of clock splitters, maximum clock frequency, imbalance degree (maximum level difference among sinks), and comparisons with the baseline solution [17]. Impr., Freq., and Imb. Deg. stand for improvement, maximum clock frequency, and the imbalance degree of the clock tree, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.5 Results of applying qHC2LC algorithm to five benchmarks from ISCAS85 benchmark suite [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 viii 6.1 Simulation results for ten benchmarks from ISCAS85 and EPFL benchmark suites [28] [94]. WL, Avg., and Impr. stand for wirelength, average, and improvement, respectively. We report the average total negative slack and the average number of hold-fixing buffers for all SSTA samples. . . . . . . . . . 151 7.1 Number of hold buffers with greedy topology generation vs ILP topology generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.2 Number of hold buffers (JTLs) and timing Yield (%) for the fixed margin (7ps) and the proposed variation-based 3σ and 2σ hold margin with incremental placement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.3 Number of hold buffers (JTLs) and timing yield (%) for the fixed margin (7ps) and the proposed variation-based 3σ and 2σ hold margin with placement from scratch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.4 Number of hold buffers (JTLs) for placement from scratch vs incremental placement approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.5 Clock cycle time (CCT) and area comparison for placement from scratch vs incremental placement (3σ). . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.6 Runtime for the fixed margin (7ps) and the proposed variation-based 3σ and 2σ hold margin with incremental placement. . . . . . . . . . . . . . . . . . . 175 8.1 Results of yield calculation for different margin calculation algorithms and various SFQ cells. OR2 and OR3 represent 2-input and 3-input OR gates. . 191 8.2 The parametric yield values, number of simulations and volumes of the margin polyhedrons for various SFQ cells calculated using the MHCMC and HMC methods. OR2 and OR3 denote 2-input and 3-input OR gates. . . . . . . . . 196 8.3 The accuracy of the RDF classifier on the test data set for various SFQ cells and hyper-parameter distributions. . . . . . . . . . . . . . . . . . . . . . . . 201 ix List of Figures 2.1 Circuit diagrams for (a) splitter cell. (b) 2-input OR cell (OR2). . . . . . . . 17 2.2 Physical Design flow of VLSI circuits. . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Wirelength models for a multi-pin net. (a) location of pins of a net (b) RMST (c) RSMT (d) HWPL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.4 Various net models for transforming a multi-pin net to two pin nets. (a) a multi-pin net (b) clique model creates a complete graph. (c) star model. Star node is shown in red. (d) Bound-to-bound model in x coordinates. Boundary pins are connected to all internal pins. . . . . . . . . . . . . . . . . . . . . . 26 2.5 A combinational logic connecting two flip-flops. . . . . . . . . . . . . . . . . 36 2.6 H-tree clock topology. A path from clock source to one of the sinks is shown in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.7 Counterflow clocking. Data and clock signals move in opposite directions. . 40 2.8 Concurrent flow clocking. Data and clock signals move in the same directions. 40 4.1 A sample standard cell composed of clock and logic parts. Clock part has different templates which are shown in Fig. 4.2. . . . . . . . . . . . . . . . . 59 4.2 Five templates for the clock part of our standard cells. . . . . . . . . . . . . 60 4.3 Proposed placement of logic cells (inside rows) and clock splitters (between rows). Rows are shown in red. . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.4 Proposed HL-tree clock structure. Global clock propagates the clock signal to all cell-groups. Local clock propagates clock signal to cells within a cell-group. 63 4.5 The maximum clock frequency evaluation for HL-tree clock network. . . . . . 64 4.6 The overall flow of the proposed clock tree synthesis algorithm, qCTS. . . . . 65 4.7 3 sequential elements with logic levels 1-3. Output of each cell is connected to the input of the previous cell. Clock signal is propagated using an L-tree clock network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 x 4.8 A placement solution compatible with HL-tree clock network obtained by groping cells of same logic level, for a group size of 4. . . . . . . . . . . . . . 72 4.9 Overall view of the proposed SFQ placement algorithm. LAP and B2B stand for linear assignment problem [99] and bound to bound net model [47], respectively. 73 4.10 Graph of nodes of level 3-6 for a 4-bit Kogge-Stone adder. Nodes of level 5 are shown in blue. Logic level of each node is added to its name. . . . . . . . . . 76 4.11 Graph of nodes of level 5 for a 4-bit Kogge-Stone adder after connectivity-based graph processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.12 Graph of nodes of level 5 for a 4-bit Kogge-Stone adder after distance-based graph processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.13 A placement solution compatible with HL-tree clock network obtained by grouping cells with different logic levels, for a group size of 4. . . . . . . . . . 84 4.14 Uniform grid over the layout area creating grid bins of size 500μm∗ 500μm. . 84 5.1 A pair of sequentially adjacent flip-flops i and j connected by a combinational gate and interconnects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2 The overall flow of the proposed clock tree synthesis algorithm, qCTS. . . . . 97 5.3 The proposed placement of logic cells (blue rectangles, placed inside the rows) and clock splitters (black rectangles, placed between the rows) for a circuit with 32 logic gates and 8 rows. Rows are shown using red rectangles. . . . . 98 5.4 Clock tree topologies for 8 leaf nodes (shown in blue). An imbalanced tree with a max level difference of 2 among leaves is shown. . . . . . . . . . . . . 100 5.5 Clock tree topologies for 8 leaf nodes (shown in blue). A balanced tree with a max level difference of 0 among leaves is shown. . . . . . . . . . . . . . . . . 100 5.6 The MMM clock topology generation algorithm applied to an example with 10 sinks [116]. Blue rectangles and black circles show the splitter cells and sinks of the clock tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.7 An example of a zero-skew clock tree produced by the DME algorithm for a circuit with 4 sinks. The topology of the tree. . . . . . . . . . . . . . . . . . 104 5.8 An example of a zero-skew clock tree produced by the DME algorithm for a circuit with 4 sinks. The location of sinks, merging segments, and the embedding points of the internal nodes of the tree. Black rectangles and yellow circles represent the clock sinks and splitters, respectively. . . . . . . . . . . 104 5.9 Illustration of a 4-bit Kogge-Stone adder circuit [117], after the placement and CTS steps. Logic cells, clock splitters, and I/O pads are shown using blue, red, and black rectangles, respectively. An illegal zero-skew solution is shown. . . 106 xi 5.10 Illustration of a 4-bit Kogge-Stone adder circuit [117], after the placement and CTS steps. Logic cells, clock splitters, and I/O pads are shown using blue, red, and black rectangles, respectively. A legal nonzero-skew solution is depicted. 106 5.11 An example of applying DME algorithm to a circuit with 3 sink nodes. An imbalanced tree topology is shown. . . . . . . . . . . . . . . . . . . . . . . . 120 5.12 An example of applying DME algorithm to a circuit with 3 sink nodes. The location of sinks, merging segments, and the embedding points of the internal nodes of the tree using (b) original DME algorithm is shown. . . . . . . . . . 121 5.13 An example of applying DME algorithm to a circuit with 3 sink nodes. The location of sinks, merging segments, and the embedding points of the internal nodes of the tree using the splitter-delay-aware DME algorithm are shown. . 121 5.14 The overall flow of the proposed clock network synthesis algorithm, qHC2LC. . . . 129 5.15 The graph model of HC 2 LC topology for circuit c1355, the cyclic graph [28]. 130 5.16 The graph model of HC 2 LC topology for circuit c1355, the reduced tree graph [28]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.1 A pair of sequentially adjacent flip-flops i and j connected by a combinational gate and interconnects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6.2 A balanced binary tree with 8 sinks. At the nominal condition, the max clock skew is zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.3 A balanced binary tree with 8 sinks. (a) At nominal condition, the max clock skew is zero. (b) If all the splitter delays on path S 0 →S 7 decrease to the min propagation delay and all the splitter delays on the S 0 →S 14 increase to the max propagation delay, a hold time violation occurs on data path S 7 →S 14 . A balanced binary tree with 8 sinks. If all the splitter delays on pathS 0 →S 7 decrease to the min propagation delay and all the splitter delays on the S 0 →S 14 increase to the max propagation delay, a hold time violation occurs on data path S 7 →S 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.4 A balanced binary tree with 8 sinks. Two buffers are added to data path S 7 →S 14 to fix the timing violation. . . . . . . . . . . . . . . . . . . . . . . 142 6.5 A balanced binary tree with 8 sinks. The clock tree is designed such that the number of non-common splitters on the clock path to S 7 and S 14 is zero, therefore no timing violations occur on the data path S 7 →S 14 . . . . . . . . 142 6.6 The overall flow of the proposed clock tree topology generation algorithm (qTopGen). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.1 Example of a timing path in a SFQ circuit. . . . . . . . . . . . . . . . . . . . 157 xii 7.2 A balanced binary tree with 8 sinks. (a) At nominal condition, the max clock skew is zero. (b) If all the splitter delays on path S 0 →S 7 decrease to the min propagation delay and all the splitter delays on the S 0 →S 14 increase to the max propagation delay, a hold time violation may occur on data path S 7 →S 14 . (c) Two buffers are added to the path betweenS 7 →S 14 to increase its hold slack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 7.3 Overview of the proposed variation-aware timing closure methodology. . . . . 163 7.4 Grid-based variation model with uniform grids. . . . . . . . . . . . . . . . . 170 8.1 Circuit diagram for AND2 cell. . . . . . . . . . . . . . . . . . . . . . . . . . 180 8.2 Margins for each parameter of AND2 (2-input AND) gate calculated using SPCMC method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 8.3 Margins for different parameters of AND2 gate calculated using HMC and SPCMC methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.4 Feasible parameter regions calculated using the MHCMC algorithm for Inverter cell. Corner points are shown using black spheres. U I , U J , and U L are hyper-parameters corresponding to biasing currents, JJ critical currents, and inductances, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 8.5 Feasible parameter regions calculated using the MHCMC algorithm for Two input AND cell. Corner points are shown using black spheres. U I ,U J , andU L are hyper-parameters corresponding to biasing currents, JJ critical currents, and inductances, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 194 8.6 Feasible parameter regions calculated using the MHCMC algorithm for Two input OR cells. Corner points are shown using black spheres. U I , U J , and U L are hyper-parameters corresponding to biasing currents, JJ critical currents, and inductances, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 195 8.7 An example of the original feasibility region of a cell in two-dimensions (purple region), along with the first two contour bands (light blue and pink regions) around it. Each contour band expands the corners of the previous one by 10%.198 8.8 The parametric yield values inside the original feasible parameter region (OFR) and the 2 contour bands (CB)s around it for 5 SFQ cells. . . . . . . . . . . . 199 xiii Abstract Josephson Junction (JJ)-based superconducting logic families have been recognized as a potential “beyond-CMOS” technology for their extremely low energy dissipation and ultra-fast switching speed. In particular, single flux quantum (SFQ) technology has appeared as an extremely promising technology with a verified speed of 370GHz for simple digital circuits and switching energy per bit in the order of 10 −19 J at 4.2 Kelvins. This dissertation presents algorithms and methodologies for electronic design automation of superconducting circuits, especially algorithms for the physical design and optimization of SFQ logic family. In particular, novel algorithms for automated global and timing driven placement, clock network design, topology generation and synthesis, timing closure, and margin and yield calculation are introduced. First, a timing-driven placement methodology is presented that optimizes the critical path delays as well as the total wirelength, taking advantage of the alternating direction method of multipliers (ADMM) mathematical framework. The proposed algorithm models the placement problem as an optimization problem with constraints on the maximum wirelength delay of timing-critical paths and employs the ADMM algorithm to decompose the problem into two sub-problems, one minimizing the total wirelength of the circuit and the other minimizing the delay of timing-critical paths of the circuit. Through an iterative process, a placement solution is generated that simultaneously minimizes the total wirelength and satisfies the setup time constraints. Next, two clock tree structures for SFQ circuit are presented: (i) H-tree clock network which maximizes the frequency of a circuit by propagating a near zero-skew clock network to all the sequential elements and (ii) HL-tree clock network which employs a combination xiv of H-tree and linear clock networks to propagate the clock signal globally to groups of cells and locally within each cell-group. These two clock structures offer a trade-off between the performance of the circuit and total chip area. Two placement solutions are then presented to generate placements compatible with the aforementioned clock structures. The first placement solution employs a global placement algorithm to minimize the total wirelength of a circuit. The second solution creates an HL-tree compatible placement using a novel cell clustering approach. Subsequently, novel linear programming based methodologies for generating clock topolo- gies and physical design of clock networks are presented. In the proposed methodology, a combination of objective functions such as the maximum clock skew and timing slacks in addition to process and timing-induced uncertainties are considered, aiming at optimizing the power, performance, and area of superconducting circuits. Moreover, a novel timing closure algorithm is presented that takes advantage of the bal- anced nature of clock trees in the SFQ circuits, common path pessimism removal methodology, and hold buffer insertion technique to increase the timing yield and reduce the power and energy consumption of SFQ circuits. Finally, two novel margin calculation methods are introduced. These methods calculate a set of parameter margins for each SFQ logic cell such that if all parameter values lie within the boundary of the calculated margins, parametric yield values are near one. The proposed multiple hyper-parameter change margin calculation (MHCMC) method improves the state-of-the-art by accounting for global sources of variation, clustering cell parameters into hyper-parameters, and considering the co-dependency of these hyper-parameters when calculating a feasible parameter region. xv Related Publications 1. S. N. Shahsavani, Xi Li, Xuan Zhou, Massoud Pedram, and Peter A. Beerel, “A Variation-Aware Hold Time Fixing Methodology for Single Flux Quantum Logic Cir- cuits”, ACM Transactions on Design Automation of Electronic Systems, 2021. 2. S. N. Shahsavani and Massoud Pedram. “TDP-ADMM: A Timing Driven Placement Approach for Superconductive Electronic Circuits Using Alternating Direction Method of Multipliers”, Design Automation Conference (DAC), 2020. 3. S. N. Shahsavani, Bo Zhang, and Massoud Pedram. “A Timing Uncertainty-Aware Clock Tree Topology Generation Algorithm for Single Flux Quantum Logic Circuits”, Design, Automation & Test in Europe (DATE), 2020. 4. S. N. Shahsavani, Ramy N. Tadros, Peter A. Beerel, and Massoud Pedram. “A Clock Synthesis Algorithm for Hierarchical Chains of Homogeneous CloverLeaves Clock Networks for Single Flux Quantum Logic Circuits”, International Superconductive Electronics Conference (ISEC), 2019. 5. S. N. Shahsavani, and M. Pedram. “A Minimum-Skew Clock Tree Synthesis Al- gorithm for Single Flux Quantum Logic Circuits”, IEEE Transaction on Applied Superconductivity (TAS), 2019. 6. S. N. Shahsavani, and M. Pedram. “A Hyper-Parameter Based Margin Calculation Algorithm for Single Flux Quantum Logic Cells”, IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 2019. xvi 7. S. N. Shahsavani, A. Shafaei, and M. Pedram. “A placement algorithm for super- conducting logic circuits based on cell grouping and super-cell placement”, Design, Automation & Test in Europe (DATE), 2018. 8. S. N. Shahsavani, B. Zhang, and M. Pedram. “Accurate Margin Calculation for Single Flux Quantum Logic Cells”, Design, Automation & Test in Europe (DATE), 2018. 9. S. N. Shahsavani, T.R. Lin, A. Shafaei, CJ. Fourie, and M. Pedram. “An integrated row-based cell placement and interconnect synthesis tool for large SFQ logic circuits”, IEEE Transactions on Applied Superconductivity (TAS), 2017. xvii Chapter 1 Introduction The demand for high computing performance and energy efficiency has been driving the development of the semiconductor technology for decades [1]. Until recently, conventional computing technology based on CMOS devices and standard metal interconnects has been able to increase computing performance and energy efficiency fast enough to keep up with the increasing demand. Unfortunately, with increasing challenges to physical scaling of CMOS devices and the conclusive end of Moore’s law in sight, there is a significant need to search for new technologies and design methodologies that would allow continuation of performance and energy efficiency scaling to well beyond the end-of-scaling CMOS nodes (e.g., an all-around, 5nm gate-length transistor). Superconducting computing based on the Josephson Effect has the potential to be one such solution. This is because Josephson junctions (JJs), basic circuit elements in superconductor technology, have an ultra-fast switching speed of (∼ 1ps) and dissipate ultra-low switching energy (∼ 10 −19 J) at 4 Kelvins [2] . In particular, K. Likharev et al. introduced Rapid Single Flux Quantum (RSFQ) technology in the 1980s, which uses quantized voltage pulses in digital data generation, reproduction, amplification, memorization, and processing [3]. They demonstrated that RSFQ circuits are functional at operating frequencies of up to 750 GHz [4]. Recent developments include various approaches, such as new Single Flux Quantum (SFQ) logic families, including dual-rail RSFQ [5], Self-Clocked Complementary Logic (SCCL) [6], Reciprocal Quantum Logic (RQL) [7], re-design of the current biasing 1 network for RSFQ [8–10], and application of low supply voltage for RSFQ circuits [11]. These techniques significantly reduce the power and energy consumption of SFQ logic circuits [12]. Although extraordinary characteristics have been observed, many problems, including architectures, design automation methodologies and tools, and device fabrication require advanced solutions for the SFQ logic to become a realistic option for realizing large-scale, high-performance, and energy-efficient computing systems of the future [13]. A comprehensive study on the status and capabilities of software design tools for SCE (Superconductive Electronics) was published in 1999 [14], and some of the main shortcomings identified then were lack of uniform tools used and lack of standardized data formats. Despite design challenges of SFQ logic and tool development did not receive much attention in the decade thereafter, and a follow-up review published in 2013 [15] concluded that the status of SCE software tools was little better than in 1999. The increase in integration density of modern superconducting circuit processes allows the design of increasingly complex circuits. The layout of such large circuits requires automated placement, clock network synthesis, and routing tools. 1.1 Dissertation Contributions The focus of this dissertation is to propose new design methodologies for electronic design automation of SFQ logic circuits considering the objectives and constraints associated with this logic family. In particular, this dissertation presents novel methodologies on three main topics (i) Placement (ii) Clock Network Synthesis and (iii) Margin and Yield Calculation for superconducting circuits. In the overall flow of the electronic design automation tools for automatic chip design, the output of logic synthesis step, which is a gate-level netlist, is the input to the placement tool. In addition, the placement tool requires dimensions and pin locations of each cell (gate). The placement tool then assigns cells to positions on the layout area such that no two 2 cells overlap with each other and a cost function (e.g., chip area, total wirelength, critical path delay) is minimized. An important consideration in the placement problem is that the placement solution should satisfy timing requirements such as the meeting the target frequency of the circuits. Accordingly, to avoid multiple costly iterations between placement and timing optimization steps, timing-aware placement algorithms are of significant interest but difficult to achieve. After the placement, clock synthesis engine creates a clock network to propagate the clock signal to all the sequential elements in the circuit, considering timing constraints, and optimizing objective functions such as clock skews and timing slacks. The process of transforming an abstract circuit description (a gate level netlist) into physical layout comprising of geometric shapes (logic cells) and metal interconnects (logic nets) among them, that will ensure the required functioning of the circuit, is called physical design. Placement and clock network synthesis are two of the primary steps in the design cycle of digital circuits. Some of the unique characteristics of SFQ logic, enumerated below, motivate rethinking and reformulating physical design problems for SFQ circuits: 1. The delay of passive transmission lines (PTL)s, used as interconnects in SFQ circuits, is a linear function of their length [16], so, a linear wire length model captures the wire propagation delay in SFQ circuits very well. 2. Size of the basic logic gates in SFQ technology is large (i.e., INV and OR gates have a total area of 0.36 mm 2 and 0.6 mm 2 respectively.) Area of an AND gate in SFQ is 600× larger than that of even a 45-nm CMOS technology [17][18]. 3. Current technology only handles a limited number of metal layers for routing (currently only 2 metal layers are available for both clock and signal routing [19]). This limitation extremely complicates placement and routing of logic gates and synthesis of the clock network. 4. While in CMOS-based designs typically near 15% – 20% of the cells need a clock signal [20], in SFQ all the cells (except for splitter) receive a clock signal. 3 5. SFQ circuits typically require full path-balancing (i.e., each logic cell that provides an input signal to another logic cell which is at sequential level d is exactly at level d− 1). In SFQ circuits, the logic depth between clocked gates is much lower than CMOS, and thus the longest wirelength (typically implemented with PTLs) will largely determine whether the target > 50GHz clock frequency can be achieved. Additionally, full path-balancing increases the number of cells in the design, as D-flip-flop (DFF) gates should be inserted for balancing all the paths. 6. In contrast to CMOS technology in which each gate can drive multiple fan-outs, in current SFQ technology each cell can only drive one fan-out. Hence, splitter cells should be used to propagate the output signal to multiple fan-outs. Unlike CMOS technology in which splitting a signal (such as clock signal) does not have any cost in terms of area or timing, clock splitters used in the clock network increase the total area and clock insertion delay. 7. As a result of ultra-deep gate-level pipelining in SFQ circuits, a signal path is defined between two consecutive clocked logic gates and comprises only of the logic cell driver, possibly some splitters, point-to-point connections, and the receiver logic cell. As a result, the number of paths in the circuit is simply the number of primary inputs plus the number of clocked logic cells, plus the number of splitters. 8. Building a perfect H-tree (a common method for realizing a zero-skew clock network) is nearly impossible. Large number of clock splitters in the H-tree network increase the clock skew. Process variations result in increased clock skew and jitter. Furthermore, the layout rules are more constrained in SFQ technology than CMOS which lead to asymmetries in clock network [3]. 9. Given current cell libraries, physical synthesis of the designs including gate resizing is not a viable option. In SFQ technology, buffer insertion adds DFFs to the circuit which 4 increases the total number of nodes and the area significantly, due to full path-balancing requirement. Above reasons lead to high design complexity, large area consumption especially in clock networks, and significant performance degradation. The minimum clock period of a circuit is a function of the largest delay between two sequential elements (comprises delay of combinational gates, interconnect delay, setup-time, and clock-Q delay of flip-flops). In SFQ technology, most of the cells are sequential elements. Due to the large cell sizes, a few number of metal layers available for routing, and the large size of clock network, interconnect delay is the dominant element in the critical path delay. Therefore, minimizing the interconnect delay is an important objective function to achieve high performance. As the minimum feature size is scaled down and the chip sizes are increased, the length of interconnects and the ratio of wire delay to logic delay increase significantly. Therefore, it has become necessary to find cell locations that minimize not only the total wirelength but also wire delays of the timing-critical circuit paths, both for semiconductor as well as superconducting circuits. This setup has resulted in a two-stage placement process, where the first placement stage targets the total wirelength (or track density) minimization, whereas the second placement stage targets circuit timing optimization (or at least mitigation of timing violations) in an ad hoc manner. State-of-the-art global placement algorithms first generate a dense placement where the total wirelength has been minimized, and subsequently, improve the circuit timing, generally resulting in an increase in the total wirelength. The rationale is that although in the initial stages the timing information corresponding to interconnect lengths tend to be inaccurate, in the final stages where the relative ordering of the cells are determined and cell overlaps are removed, a more accurate estimation of the circuit timing (including a list of timing-critical paths in the circuit) can be obtained and used. In Chapter 3, we present a timing driven global placement algorithm targeting super- conductive electronic circuits. The proposed approach minimizes the total wirelength of 5 the circuit while targeting a specific clock frequency by imposing hard constraints on the maximum interconnect length for all the nets on the timing-critical paths of a circuit. This is achieved by utilizing the powerful framework of alternating direction method of multipliers (ADMM) [21]. Precisely, the proposed timing driven placement algorithm (TDP-ADMM) converts the placement problem with constraints on the path delays into an unconstrained problem by decomposing the timing driven problem into two sub-problems, one optimizing the total wirelength and the other optimizing the circuit timing. Through an iterative process, the solution of the two sub-problems are modified to close the gap between a wirelength driven and a timing driven placement. The cost of clock routing in SFQ logic circuits is quite significant and any placement algorithm which ignores this cost and focuses only on signal wirelength minimization will not perform well on SFQ logic designs. Construction of a perfect zero-skew H-tree in SFQ logic circuits is not a viable solution due to the resulting very high routing overhead and the in-feasibility of building exact zero-skew clock routing trees. Instead a hybrid clock tree may be used whereby higher levels of the clock tree (i.e., those closer to the clock source) are based on H-tree construction whereas lower levels of the clock tree follow a linear (i.e., chain-like) structure. In Chapter 4, we present a novel clock tree structure called HL-tree clock network. This network is a combination of H-tree (a zero-skew clocking method) and L-tree (a linear clock propagation mechanism) networks. In HL-tree, an H-tree is adopted to distribute the clock to cell-groups, and within each group, a linear path composed of splitters provides the clock to all cells in that group. Furthermore, cells in a group are horizontally abutted which reduces the total area significantly. To generate a placement solution compatible with the proposed HL-tree clock network, we propose a clock tree-aware placement approach that simultaneously minimizes the total wirelength of the signal nets and area overhead of the clock routing. The proposed placement algorithm, recognizing the need for the said hybrid clock tree realization, generates a cell placement that intrinsically matches the structure of the clock tree. It does so by first 6 grouping cells into a set of “super-cells”, building a modified netlist capturing dependencies among these super-cells, and then using a force-directed placement engine to find a global placement of these cells. The detailed placement is subsequently constructed by performing local reordering of cells within each super-cell so as to minimize a cost function comprising of the signal wirelength and the total chip area. Physical design of logic circuits, especially the placement and synthesis of clock distribution network (CDN), plays an important role in designing high-performance circuits robust to process-induced variations. Clock network synthesis is a crucial task in physical design of logic circuits as the clock network takes up substantial routing resources, consumes significant power, and determines the maximum frequency of the circuits. Minimizing the nominal clock skew (i.e., the maximum difference in the arrival time of the clock signal at two different clock sinks) is of great importance since the clock skew directly limits the maximum achievable frequency of a circuit [22]. In SFQ logic circuits, the clock signal should be delivered to nearly all logic cells in the design. Therefore, to maximize the performance, a well-balanced minimum-skew clock tree structure is an absolute requirement. Previous zero-skew clock tree synthesis methods for SFQ circuits fail to produce high- quality solutions because they do not consider the delay of splitter cells (which are required to distribute the clock signal to sequential gates) and placement blockages (already placed logic cells) [17]. Additionally, the population density of cells in different regions of the chip can be very different which can result in a highly-imbalanced clock tree topology, i.e., one where the maximum difference between splitter counts from the root of the clock tree to any pair of leaf nodes is large. In Chapter 5, we tackle this problem by developing an algorithm for a fully-balanced clock tree topology construction and a min-skew clock tree placement and legalization algorithm using a mixed integer linear programming (MILP) formulation to perform clock tree construction, splitter insertion, and skew minimization under the given placement blockages, considering both splitter and interconnect delays. The proposed clock tree topology generation 7 algorithm guarantees that the maximum difference between the number of splitters from the clock source to any pair of sinks is zero. The effectiveness of the proposed clock synthesis algorithm is demonstrated using multiple SFQ circuits. Furthermore, we extend the proposed algorithm to carry out the physical synthesis of asynchronous clock networks, indicating its capability in handling a variety of scenarios and clocking methodologies. In the physical design of logic circuits, especially the synthesis of a robust clock distribution network (CDN), which is resilient against process-induced sources of variability, timing convergence and timing yield are two of the most important design considerations. A CDN should be designed to not only eliminate timing violations (by controlling clock skews at various launch and capture flops under nominal conditions) but also to reduce mismatches between the target clock skews under nominal conditions and those achieved under process variations. Clock network synthesis algorithms aimed at reducing the process-induced timing uncertainties improve the timing yield of the circuit and require less timing margins to tolerate timing variations. Furthermore, they require lower effort in the final timing closure steps, which translates into savings in the area and wirelength utilization as well as the total power consumption and an overall faster design cycle. Clock network synthesis for SFQ circuits is more challenging than it is for state-of-the-art CMOS circuits because of the following reasons: (i) Clock frequencies in SFQ circuits are in 25-50GHz range, which call for much tighter control of the clock skew; (ii) Nearly all SFQ combinational gates (exceptions being splitter cells and Josephson junction (JJ) transmission line cells) receive a clock signal in order to pass out their internal state information as a voltage pulse to the output; (iii) Splitter cells are required to distribute the clock signal from the clock source to all combinational gates that receive the clock signal as an input. Process-induced variations in SFQ circuits include variations in: (a) inductance values inside the gates and those associated with the passive transmission lines, (b) biasing current levels that affect both gate and clock splitter delays, and (c) critical current levels of JJs which in turn affects the JJs’ switching characteristics. These variations can thus change the delay of 8 clock splitters employed in the CDN as well as delays of combinational gates in the circuit, which can in turn result in a large number of setup or hold time violations. Consequently, a large portion of timing margins should be dedicated to process-induced clock skews. These constraints translate into a large timing uncertainty in the clock insertion delays and a low timing yield. Synchronous clock networks, designed for high-performance systems, typically use tree structures in which there exists a unique path from the clock source to every clock sink (i.e., flip-flop). Although clock trees consume less power compared with the other clock topologies, they are more susceptible to variations in process parameters and operating conditions. Process-induced clock skews are not known until the clock tree synthesis is done. Accordingly, to increase the resiliency of clock trees to process variations and improve the timing yield, advanced clock topology generation algorithms are required which account for process-induced timing uncertainties. Chapter 6 presents a timing-aware clock tree topology generation algorithm which considers the signal flow and timing in the data path and the worst-case timing slacks to any clock sink considering both wire and splitter cell delays and the total wirelength of the clock tree, while also accounting for the process variations and timing uncertainties. More precisely, we propose an algorithm that generates a balanced binary clock tree topology (CTT), level by level, and in a bottom-up manner. Targeting a near zero-skew clock signal, generated clock trees are height balanced, i.e, there is an equal number of splitters from the root of the tree to any of the leaf nodes (note that in some cases a single output of an splitter cell is used i.e., the splitter will simply function as a delay element). At each level of the tree, the proposed algorithm solves an integer linear programming (ILP) problem to determine which nodes in the clock tree should be paired up (i.e., become siblings by assigning them the same parent node). The objective function of this ILP formulation is the weighted sum of the total wire length and total negative slack of the clock tree. By minimizing this objective function, we simultaneously minimize the routing cost of the clock tree and optimize the assignment 9 of sink nodes to appropriate branches of the clock tree to increase the efficacy of common path pessimism removal (CPPR) technique [23], which in turn help control the adverse and uncertain effects of process-induced variations on the worst-case clock skews. Even after deploying timing uncertainty-aware CDN generation algorithm such as the one presented in Chpater 6, EDA flows still needs to utilize efficient techniques to close the timing of the circuit, i.e., fix setup and hold timing violations. In particular, potential hold-time violations are typically resolved by adding clockless buffers into data paths with a negative hold slack. The high operating clock frequencies and gate-level pipelining in SFQ circuits complicates this task. Moreover, during addition of these hold-buffers, one should account for timing variations as well as incurred overheads through some form of static timing analysis (STA) [24] or classic Monte Carlo simulation [25]. In Chapter 7, we present a physical design methodology for timing closure (hold buffer insertion) in SFQ circuits. Our approach is timing variation-aware and reduces the total layout area and performance overheads by applying common-path pessimism removal (CPPR) to remove the pessimism associated with the common clock paths to pairs of sequentially adjacent gates [26] [27]. Furthermore, we present an incremental placement algorithm to place the added buffers and minimize the perturbations to the original placement solution to further preserve the layout area and minimize the overheads. In particular, (i) we develop the first timing variation-aware hold time fixing approach for SFQ circuits. Our approach considers both local and global timing uncertainties and worst- case scenarios in terms of hold slacks and effectively takes advantage of the common path pessimism removal technique to reduce the number of inserted hold buffers on each timing path. (ii) We present an incremental placement methodology for hold buffer to generate high quality solutions in terms of placement metrics, such as the layout area and maximum clock frequency. (iii) We evaluate the presented approach using dynamic timing analysis with a grid-based placement-aware variation model [25] on multiple ISCAS’85 benchmark circuits and the functionalities of circuits are verified via Monte Carlo co-simulation with 10 behavioral netlists. Our methodology enables a trade-off between timing yield and layout area by tuning algorithmic parameters. The efficacy of the proposed method is evaluated using ten benchmarks from ISCAS’85 benchmark suite [28]. Compared with a methodology utilizing fixed constant margins for fixing all timing paths [25], our method significantly reduces the number of inserted hold buffers with competitive timing yields. Robustness of an SFQ gate may be defined as the tolerance of the gate to variations in electrical parameters of various underlying components. Accordingly, margins are defined as the amount of acceptable variations in external parameters (e.g., bias current) and internal parameters (e.g., JJ critical current and inductance values) for which the cell continues to function correctly. In other words, margins of a parameter are upper and lower limits for which the cell will produce the correct output while all other parameters are held at their nominal values. Margin calculations can be used for optimizing parameter values of library cells, where the objective will be to design SFQ standard cells of minimum delay, area, or power dissipation with acceptable margins. Given the margin for a circuit parameter p, and equipped with the knowledge of the (marginal) probability distribution of parameter values, one can calculate the error rate of the SFQ standard cell i with respect to changes in the said cell parameter as err(i,p). SFQ gates are implemented using two-terminal Josephson junctions. This characteristic decreases the input and output impedance, increases the sensitivity of the circuit to variations of the component parameters and consequently complicates the design of a single gate [29]. Fabrication process is also significantly different from conventional CMOS process, and process variations are larger in current fabrication processes [29]. Additionally, in contrast to CMOS designs which primarily use standard cell libraries, SFQ logic requires custom design of cell structures and parameters for specific applications. Also, cell variants are required in cell libraries to allow automated place-and-route, with support for different layouts depending 11 on the configuration of inputs/outputs. Cell variants are used based on trade-offs between propagation speed, operating margins and bias current [17] [30]. To develop reliable cell structures prone to process variations, accurate methods for reliability evaluation are necessary. Critical margin calculation and yield analysis using Monte-Carlo (MC) based simulations have been extensively used to estimate the robustness of a logic cell [29] [31]. Yield of a cell is defined as the ratio of number of cells operating correctly to the total number of cells. Assuming each parameter in the cell structure to have a normal distribution with specified mean and standard deviation, yield can be quantified using MC simulations. Although calculated yield may be a good indicator of cell reliability, it is computationally expensive. Furthermore, using MC based yield estimation throughout the cell optimization process is inefficient. This is primarily due to the fact that a large number of simulations should be performed at each iteration of the optimization process and after each change in the component parameters. Therefore, it may only be used at the final stage of the optimization process to quantify the robustness of the optimized cells. A simpler method of robustness evaluation called Critical Margin calculation has been widely used in the literature [12] [29] [31]. In this method, a binary search over a predefined range of values for each parameter is performed while all other parameters are fixed at their nominal values. Every time a parameter is changed, cell is tested. Upper and lower bound values for each parameter calculated in this manner are denoted as parameter margins. Critical margin is calculated as the smallest margin among all parameters in the cell. However, as this method only considers alteration of each parameter while other parameters are fixed at their nominal values, it does not capture the interdependence among different parameters. Hence, critical margin fails to evaluate the robustness of a cell accurately [31]. The focus of Chapter 8 is to present new margin calculation methods for SFQ cells with large number of parameters. The primary goal of these methods is to calculate a set of 12 margins for which the cell yield will be nearly one if all parameters lie within the specified margins. 13 1.2 Dissertation Organization This dissertation is organized as follows. Chapter 2 reviews some of the basic concepts and background materials about single flux quantum logic, physical design, and margin and yield calculation. In chapters 3 – 7, we present our methodologies for automated timing-driven and global placement and clock tree topology generation and physical synthesis, and timing closure for large scale single flux quantum circuits. The proposed algorithms for margin calculation and yield estimation for superconducting circuits are presented in Chapter 8. The dissertation is concluded in Chapter 9. 14 Chapter 2 Preliminaries 2.1 Single Flux Quantum Logic The main advantages of superconductor technology can be summarized as high operation speed combined with low power consumption. A basic T flip-flop was demonstrated at 750 GHz. The power consumption of superconductor circuits is a few orders lower than that of the semiconductor circuits [32]. The switching energy of a typical 200μA junction is 4×10 −19 J. Superconductor technology has applications in ultra-fast digital signal processing (DSP) circuits, network switching and supercomputing. A 20 GHz microprocessor based on the 4 kA/cm 2 , 1.75μm low-Tc niobium process, which includes 25,000 Josephson junctions on a 5mm× 5mm chip has been designed [33]. Superconductor integrated circuits need to operate under special conditions (e.g., low Tc superconductor (LTS) circuits operate at a few degrees Kelvin with a cryocooler or immersed in liquid helium) [33]. Information in SFQ technology is represented by short pico-second voltage pulses of quantized area rather than dc voltage (as in CMOS). These SFQ pulses can be generated, reproduced, memorized and amplified using elementary components called Josephson junctions (JJ) [12]. Josephson junction, an active device in superconductor electronics, is the basic circuit elements in SFQ logic. It is a two-terminal device which is an electrically weak contact between two superconductor electrodes. 15 The area of the pulses in SFQ logic, current-phase and voltage-phase relations in JJs can be quantified as follows. Z V (t)dt = Φ 0 ∼ = h 2e ∼ = 2.07mV.ps (2.1) ∂φ ∂t = 2π Φ 0 V (t) , J s (φ) =J c sin(φ) (2.2) In Eq. 2.1, Φ 0 represents a single quantum of superconducting flux, V (t) denotes voltage across the junction,h is the Planck constant ande is the electron charge. In Eq. 2.2,φ andJ c denote the phase of the junction and the critical current density, respectively. If the current density through a JJ is greater than J c , a voltage pulse (V (t)) is formed across the junction. Then, JJ exits the superconducting state and enters the normal state. Once JJ returns to the superconducting state, junction goes through a 2π-leap. For a complete description of magnetic flux quantization, structure and operation of the SFQ cells please refer to [12]. The RSFQ circuits are composed of Josephson junctions, inductors and bias current sources. Also, each junction is shunted with an external resistor. All the basic RSFQ circuit components can be divided into two categories of asynchronous and synchronous components. Asynchronous components are not clocked and include simple elements such as active Josephson transmission lines (JTLs), splitters, buffers, and confluence buffers. They are used as the connections, the forks and the mergers in the logic. The asynchronous circuits are transparent to the input signals; the signals ripple through them. The outputs are generated shortly after the inputs arrive. They are used for connections and in sequential logic [32]. All the synchronous components are clocked and contain internal memory. The incoming data set the logic states of the internal memories. The information is stored there until the arrival of a clock pulse releases it to the output. The basic synchronous components are the latches. Most synchronous RSFQ gates are formed as combinational logic followed by a latch [32]. 16 Fig. 2.1(a) depicts the schematic view of a splitter cell. It consists of several JJs (J), inductors(L), and DC-currents (I). An input pulse triggers a 2π-leap in J1 which generates a voltage pulse. This pulse flows through J2 and J3 and generates pulses at the outputs B and C. Schematic diagram of a 2-input OR gate is shown in Fig. 2.1(b). This gate is more complicated than the splitter cell as it consists of a merger and a flip-flop cell and receives a clock input. (b) L1 L1 J1 A I1 L2 L2 J2 I2 L3 L3 J3 I3 B C L1 J1 A I1 L2 J2 I2 L3 J3 I3 B C (a) L2 L2 L3 L3 J6 J6 J6 J7 J7 J7 I1 L4 J1 J5 J3 J2 L1 L B A Clock Out J4 J8 L5 L5 I2 I3 J8 J9 L6 L7 L8 L9 Figure 2.1: Circuit diagrams for (a) splitter cell. (b) 2-input OR cell (OR2). SFQ pulses can be transferred using either Josephson transmission line (JTL) or passive transmission line (PTL). JTL is used to propagate SFQ pulses in short distances, while PTL is used to transport signal to longer distances [3]. Hence, PTL is used for data and global clock routing, while JTL is used for short-distance signals, such as local clock networks. JTL cells should be abutted to each other to be able to transfer data. Additionally, JTL cells occupy area and can not be placed on top of each other, as they contain active elements. Hence, they may be used for short distance data propagation. 2.1.1 Margin and Yield Margins define ranges of the process and circuit parameter values under which a logic cell functions correctly. For instance, a splitter cell will operate correctly when the biasing current level of the JJ J1 is between 90μA and 115μA with a nominal value of 100μA. Margins 17 are typically reported as the percentage of deviation from the nominal values, so for the splitter cell, margins of J1 are specified as [−10%,+15%]. Assume a cell with n parameters hv 1 ,...,v n i. Each parameter v i has a nominal value of V i and margins of [NM i ,PM i ], where NM and PM stand for negative and positive margins, respectively. Then positive critical margin (PCM) and negative critical margin (NCM) are defined follows. PCM = min i PM i and NCM = min i |NM i | (2.3) For instance, consider the margins of the inductanceL1 in the splitter cell to be [−8%,+17%]. The PCM and NCM for splitter cell (assuming only two parameters J1 and L1) are thus, −8% and +15%, respectively. Parametric yield of a cell can be defined as the percentage of the correctly functioning cell instances. In practice, each parameter tends to exhibit a probability distribution function (PDF) for its values. To calculate the parametric yield of a cell, Monte-Carlo (MC) simulations are performed where the cell parameter values are randomly chosen according to their respective PDFs. Although the calculated yield based on MC simulations can estimate the robustness quite accurately, MC simulations are computationally expensive. 2.2 Physical Design 2.2.1 Introduction Physical design is one of the main steps in the overall design flow of very large scale integrated (VLSI) circuits. It determines the locations of all design components (devices and interconnects) while satisfying overall functionality of the circuit given a set of manufacturing constraints [34]. The overall flow for physical design of VLSI circuits is depicted in Fig. 2.2. Logic synthesis generates a netlist comprising of the circuit components (i.e. cells) and their connections (i.e., nets). This gate-level netlist along with a technology files is passed to the 18 physical design flow. Technology file contains information about circuit components such as their shape, dimensions, location of their Input/Output pins, information about metal layers, and a set of design rules and manufacturing constraints that should be satisfied. The output of the physical design flow is a layout which contains the locations of all cells, I/O pads and routing of interconnects. A common approach to decrease the design complexity of large-scale circuits is to parti- tion the circuit into smaller modules, while minimizing the number of connections among modules. Accordingly, each module (part) can be placed and routed later with some degree of independence in parallel [35]. Circuits may also include pre-defined modules such as cache memories and IP cores. Each of these modules will have a shape function, capturing various (horizontal and vertical) extent combinations for the module. Floorplanning determines positions and shapes of modules so as to minimize the chip area and interconnects, and hence improve timing [36]. Placement determines the location of cells of each module (netlist) on a layout surface such that manufacturing constraints are satisfied [37]. Clock network synthesis connects the clock signal to all the sequential circuit components and timing closure makes sure all the timing constraints are met [38]. Finally, routing lays out the nets in the netlist trying to minimize the total wirelength while satisfying manufacture-ability constraints [39]. The inputs to a placement problem are circuit elements (i.e., cells), their connections to each other (i.e., nets), their geometrical properties (i.e, width, height, location of their corresponding pins) and the properties of the layout surface (i.e., total width, height, number of rows, etc.). The goal is to find the best location for each circuit component such that some cost function (such as total wirelength) is minimized. Placement is a key step in the overall flow of physical design of the VLSI circuits as it directly impacts the routability, performance, and power consumption of a design [40]. In the next section the placement problem is formally defined and its steps are described in details. 19 Netlist Global Placement Detailed Placement Legalization Clock Tree Synthesis Global Routing Detailed Routing Figure 2.2: Physical Design flow of VLSI circuits. 2.2.2 Placement 2.2.2.1 Objective Functions Placement determines the locations of circuit elements on a layout surface by optimizing objective function(s) while satisfying several constraints. The main objective functions for placement can be categorized as follow: • Total wirelength: This is the most commonly used objective function as it indi- rectly affects the routability of the design, the timing and power consumption. By minimizing the total wirelength, routing tool can finish laying out all the nets more easily. Additionally, lower wirelength improves the delay of different paths in the circuit and enhances the overall performance, as it is a function of path delays. Moreover, minimizing wirelength reduces the total capacitive load of the wires which in turns leads to a reduction in the dynamic power consumption of a circuit. It should be noted that minimizing the total wirelength may not necessarily reduce the congestion in all parts of the chip, or reduce the length of critical timing paths. However, by adding 20 weights to some nets (i.e., critical nets) the total weighted wirelength may lead to better results in terms of routability, timing and power consumption [40]. • Routability: One of the basic characteristics of a good placement solution is that should be able to finish the routing. A good placement should distribute the components evenly to balance the cell congestion and routing demand through out the layout area. However, estimating the routability of a placement solution is a difficult task as it is router dependent and requires computationally expensive algorithms such as global routing or probabilistic models to estimate the congestion on different parts of the chip [40]. Therefore, it is not used as the main objective function for placement and is typically optimized as a secondary objective function using techniques such as cell bloating and whitespace allocation in the congested areas of the chip [41][42]. • Timing: Performance of a circuit depends on the delay of logic cells and interconnects. With advancements in VLSI technology and reductions in feature size, gate delay continues to decrease. As a result, interconnect delay becomes the determining factor in the overall performance of a circuit and therefore minimizing the interconnect delay is crucial for improving the performance of a circuit. However, performance also depends on the size of the gates, number and size of buffers inserted in a path, width of the interconnects, clock skew, etc. It is computationally expensive to include all these factors in an objective function. Additionally, by minimizing the total wirelength, length of some of the nets is reduced which indirectly improves the timing. Therefore, timing becomes a secondary objective function and is typically improved after placement and clock tree synthesis stages using timing-driven detailed placers [43][40]. The most common approach in modern placement is minimizing the total wirelength as the main objective function [37]. A wirelength-driven placement provides a good initial solution for a multi-objective problem with several constraints. Other objective functions are typically optimized in the next steps. 21 The main constraint for a placement problem is the legality of the final solution. There should be no overlaps between the placed cells as they violate the manufacture ability constraint. Placement is a computationally difficult problem [40]. Minimizing the total wirelength of a circuit given unit-size cells and 2-pin nets along a straight line is NP-complete [44]. Assume a netlist N = (V,E) where V denotes the set of all cells and E denotes the set of all nets connecting these cells together and to the I/O pins. Placement seeks to find location of all the cells while minimize the total wirelength given that all cells are placed within the boundaries of the layout area and there are no overlaps between the cells. The placement problem can be formulated as follows. minimize |E| X j=1 WL n j (2.4) subject to 0<x i <W A −W i , ∀c i ∈V (2.5) 0<y i <H A −H i , ∀c i ∈V (2.6) x i +W i ≤x j , ∀c i ,c j ∈V wherei6=j, x j ≥x i (2.7) y i +H i ≤y j , ∀c i ,c j ∈V wherei6=j, y j ≥y i (2.8) where x i and y i denote the location of the lower left corner of cell i and W i and H i denote width and height of cell i, respectively. It is assumed that layout area is a rectangular region with its lower left corner at (0,0) and upper left corner at (W A ,H A ). The objective is to minimize the total wirelength (2.4). Constraints (2.5) and (2.6) ensure that all cells are placed inside the layout area. Constraints (2.7) and (2.8) ensure there are no overlap between cells. Note that constraints (2.7) and (2.8) are crucial for the legality of the solution 22 as without them, assuming there are no fixed cells (i.e., macros), the optimal solution would be to place most of the cells on top of each other at the same place (wirelength for most of the nets would be 0). 2.2.2.2 Wirelength Modeling Exact wirelength of a net can not be predicted as it depends on the routing algorithm, congestion, etc. Therefore, multiple approaches have been proposed to model the wirelength. One approach is based on rectilinear minimum spanning tree (RMST). Given a set of nodes and two-pin connections between them, a RMST connects them with minimum total wirelength in Manhattan distance. Thus, wirelength can be modeled by constructing such a tree and calculating its total wirelength. However, this method is computationally expensive (O(nlogn) where n is the number of nodes) and is inexact for multi-pin nets [45]. A highly accurate method is based on rectilinear Steiner minimal tree (RSMT). A RSMT is a tree that connects a set of n points with minimum total edge length in Manhattan distance [40]. RSMT is the preferred way to route a net as it minimizes the wirelength. Therefore, RSMT wirelength is highly correlated with routed wirelength in the absence of heavily congested areas and the need for detour wires in routing. RSMT construction is NP-complete [46] and proposed heuristics are time-consuming. One simple method for wirelength modeling is by half-perimeter wirelength (HPWL) which is equal to half of the perimeter of the smallest bounding rectangle which encloses all the pins of a net. If the set of cells connected to a net N is denoted as C, and each cell c i has coordinates x i ,y i , HPWL of net N is defined as: HPWL N =HPWL Nx +HPWL Ny (2.9) where HPWL Nx = |C| max i=1 (x i )− |C| min i=1 (x i ) (2.10) 23 HPWL Ny = |C| max i=1 (y i )− |C| min i=1 (y i ) (2.11) This method calculates the wirelength for L-shaped routed nets with two and three pins exactly, and its time complexity is O(k) where k is the number of pins connected to each net. However, HPWL can underestimate the wirelength for nets with more than three pins significantly [40]. To estimate the HPWL for multi-pin nets (k≥ 4) various approaches are proposed which will be explained in the next subsection. An example of a multi-pin net, along with different wirelength estimation techniques for that net are depicted in Fig. 2.3. a b c d W Figure 2.3: Wirelength models for a multi-pin net. (a) location of pins of a net (b) RMST (c) RSMT (d) HWPL. 2.2.2.3 Net Models Various net models have been proposed to transform multi-pin nets to two pin nets to improve the HPWL estimation. Three of these models are explained below. 24 • Clique Model In this model, a k-pin (k≥ 4) hyper-net is transformed into a complete graph withk nodes and k(k−1) 2 edges. Assuming the original hyper-net to have a weight of c, the weight of each 2-pin net in the clique model is set to c k−1 or 2 k . This model creates a large number of edges especially for nets with k≥ 5 pins, and it’s runtime is a quadratic function of k (i.e, O(k 2 ))). The problem with the clique model is the connections between inner pins, and the fact that lengths of these connections contribute to the clique length. However, the HPWL is just the distance between the boundary pins [47]. • Star Model Star model transforms a k-pin hyperedge into k edges by introducing a star node and connecting it to all the nodes. This model calculates all the edges in linear time (i.e., O(k)). • Bound-to-Bound (B2B) Model The basic idea of Bound-to-Bound model is to remove all inner two-pin connections (as created by clique model) and to utilize only connections to the boundary pins. This approach models the HPWL better than the two previous approaches as the HPWL models the distance between boundary pins connected to a net [47]. Bound-to-Bound creates different graphs (same number of edges, but with different connections and edge weights) in x and y directions. In each direction, it first identifies the nodes at the extreme coordinates (e.g., nodes with x min and x max ). We refer to these two pins as boundary pins. Then B2B model creates edges from all internal nodes to each of these two boundary pins (2∗(k−2) edges), and one edge between the two boundary pins (a total of 2k− 1 edges). Edge weights in this model are calculated as follows. w x ij = 2 k− 1 1 |x i −x j | (2.12) w y ij = 2 k− 1 1 |y i −y j | (2.13) 25 a b c d Figure 2.4: Various net models for transforming a multi-pin net to two pin nets. (a) a multi-pin net (b) clique model creates a complete graph. (c) star model. Star node is shown in red. (d) Bound-to-bound model in x coordinates. Boundary pins are connected to all internal pins. With these connection weights, the quadratic cost function of a net is exactly the HPWL in the each coordinate [47]. Examples of a multi-pin net, clique, star and b2b net model in x direction are depicted in Fig. 2.4. To overcome the complexity of solving the placement problem, row-based placement techniques are widely used. These techniques place (fixed-height, but variable width) cells in rows on the chip. Furthermore, to efficiently place netlists with millions of cells, the following three-step placement algorithm is typically used: • Global placement: the non-overlapping cell constraint is ignored in this step and approximate locations of cells are obtained by placing cells in global bins. The main focus of the global placer is to optimize the cost function by iterating between the solution of some mathematical program (e.g., a constrained quadratic or linear programming problem) and a cell spreading (or bi-partitioning) step. 26 • Legalization: the output of the global placement must be legalized to remove any cell overlaps. • Detailed placement: legalization is further refined by using local adjustments such as cell movement or swapping to reduce wirelength. The placement legality must be preserved during the detailed placement. Legalization and detailed placement are sometimes grouped as one step. The global placement is the most important step among the the three as it has the largest impact on the quality of the final solution. Legalization aims at removing overlap between the cells while perturbing the global solution as small as possible. Detailed placement improves the legalized solution via local moves. Each of these steps are described in the following sections. 2.2.3 Global Placement Global placement tries to generate a solution by minimizing the objective function (i.e., total wirelength) allowing small amount of overlap between the cells. Location of some of the large cells (i.e., macros) such as memory or IP cores may be fixed on the layout area, which further constraints the problem. State-of-the-art wirelength driven global placement algorithms are classified into three main categories of stochastic, partitioning-based and analytical approaches. • Stochastic methods such as Timberwold utilize non-deterministic algorithms such as simulated annealing to minimize the total wirelength [48]. Timberwolf uses three kinds of moves, moving a cell to a new location, swapping two cells and mirroring a cell’s horizontal location. Simulated annealing based algorithms such as Timberworlf produce high quality placement solutions for small circuits (with up to a few thousand cells). However, they tend to be increasingly inefficient for larger circuits. To address this issue, a bottom-up hierarchical structure based on recursive clustering is performed 27 which combines the simulated annealing algorithm with partitioning to improve the runtime. • Partitioning-based methods such as CAPO recursively divide a circuit into two sub-circuits (consequently the placement region into two sub-regions) trying to minimize the cut cost (i.e., total cost of the nets between two parts) [49]. They then assign each sub-circuit to one sub-region of the placement area and continue partitioning in each sub-region until total number of cells in each part is less than a specific number. Finally, it uses high quality placers to place a small number of cells in each region. • Analytical methods transform the placement problem to a mathematical program by defining the objective function and constraints as functions ofx andy coordinates of the cells and I/O pins. Analytical methods are the most popular approaches to cell placement due to their scalability and quality of solution [37]. Analytical methods can further be categorized into two categories of force-directed quadratic placers such as SimPl [50], and non-convex optimization techniques, such as mPL6 [51]. In the quadratic placers, non-overlapping constraint is relaxed to minimize the total wirelength, and then cells are gradually moved towards the empty regions of the chip area to remove the overlaps. On the contrary, non-linear optimization techniques solve the wirelength minimization problem in the presence of the non-overlapping constraint. Optimization techniques solve the wirelength minimization problem using more sophisticated numerical analysis methods. Although non-linear techniques achieve higher quality results compared with quadratic techniques, their run-time is significantly slower than their counterparts [37]. In the next subsection these two categories are described in more details. 2.2.3.1 Quadratic Placement Using the HPWL to model the wirelength (cf. Section 2.2.2.2), the objective function is not differentiable. Additionally, there are O(n 2 ) constraints with the number of cells n being up 28 to millions in modern designs. Therefore, Equation (2.4) is not a practical formulation for the placement problem. In practice, the wirelength is approximated by some differentiable and convex functions and non-overlapping constraints are relaxed by simpler constraints to make the cell and whitespace distribution roughly even through out the lay out area [40]. In quadratic placements, the objective function of Equation (2.4) is transformed into quadratic function of x and y coordinates of the cells. This leads to a sequence of convex quadratic programs, mathematical program with a convex and quadratic objective function and linear constraints [40]. Quadratic based wirelength formulation is used in many placers such as SimPL [50], ComPLx [52], Kraftwerk2 [47], and POLAR [53]; Wirelength of a two-pin net n can be transformed to a quadratic function as follows: WL n j = (x i −x j ) 2 + (y i −y j ) 2 (2.14) where i and j are the cells connected by net n. As a result, placement formulation can be re-written as follows. minimize WL(x) = |E| X k=1 w k (x i −x j ) 2 + (y i −y j ) 2 (2.15) where w k denotes the weight of net k, and x i and x j denote the x coordinates of the cells connected to netn k . Note that multi-pin nets in a design can be transformed to two-pin nets using the models presented in Section 2.2.2.3. This quadratic wirelength function is convex and differentiable in contrast with the linear wirelength function which is non-differentiable: WL n j =|x i −x j | +|y i −y j | (2.16) Additionally, since in the quadratic formulation of Equation 2.14 there are no common terms between x and y variables, it can be solved separately for x and y coordinates. 29 Assume cells 1,...,m to be movable and cellsm+1...|V| to be fixed. LetD be a diagonal matrix of size m×m where d ii = m P k=1 w ik and C be the connectivity matrix among movable cells such that c ij =c ji =w ij . DefineQ =D−C, where it represents the connection between movable modules. Assume B x to be a matrix of size (m× 1) such that B i1 =− |V| P k=m+1 w ki x i (this matrix represents connection between movable and fixed cells). Formulation of 2.15 can be re-written in x coordinate as follows: minimize WL(~ x) =~ x T Q~ x +B~ x +e (2.17) where e is a constant term. The objective function in the above formulation is convex and global minimum can be obtained by by setting the partial derivatives to zero and solving a system of linear equations. The x coordinate of all the movable cells can be calculated as follows. ∂WL(~ x) ∂x = 0 (2.18) Q~ x +B = 0 (2.19) As shown, by relaxing the non-overlapping constraints, the placement problem is reduced to solving a system of linear equations. Q is a positive and semi-definite matrix and thus invertible which results in a unique global optimal solution. The quadratic formulation is similar to the classical mechanics problem of finding the equilibrium configuration for a system of objects attached to zero-length springs. In this system, springs represent the nets between nodes. Movable nodes create attractive forces while fixed nodes create repulsive forces to movable objects. Minimizing the quadratic function for wirelength is equivalent to finding the minimum potential energy configuration for a system of springs. Extra forces can also be applied to discourage the overlap between cells. These 30 algorithms are referred to as force-directed placement algorithms [54]. For more information please refer to Section 11.5.2.2 of [40]. Although using this method one can quickly calculate the optimal solution in terms of wirelength, generated location for cells includes a lot of overlaps. Therefore, to discourage overlaps between the cells in an analytical framework, an even distribution of cells is targeted [40]. There are two methods of module distribution through out the layout area. The first way is to add center-of-mass constraints to prevent modules from clustering together [55]. The second way is to add forces to pull modules from dense regions to sparse regions [50]. In both methods, the constraints/forces are added gradually in an iterative fashion to spread out the modules. The basic requirement for additional constraints/forces is that they should enable the wirelength minimization problem to be formulated as a convex quadratic program. 2.2.3.2 Non-convex Optimization based Placement The second category of analytical methods is to formulate the placement problem as a nonlinear programming problem by substituting exact wirelength metrics and/or non-overlapping constraints by approximate metrics and/or constraints [40]. For instance, the non-overlapping constraints may be replaced by bin density constraints. In this method the placement region is divided into multiple bins by uniform grids, and a target density in terms of ratio of cell area to whitespace area is specified. Examples of placers in this category are mPL [51] and NTUPlace [56]. These methods use the log-sum-exponential wirelength function [57] to calculate the total wirelength more accurately, compared with the popular HPWL method. To smooth the density function, NTUPlace use a bell-shaped function and mPL uses inverse Laplace transformation [40]. Using the objective function and constraints which are continuously differentiable, the placement problem can by solved using nonlinear programming techniques. As mentioned earlier, these methods suffer from high runtime and are not scalable to millions of cells. Therefore, [51] proposes a multi-level approach that clusters the cells and creates a corase-graine netlist 31 with fewer number of cells compared to the original netlist, places the cell clusters using the non-linear programming approach and then gradually un-clusters the netlist and places the cells in each cluster to minimize the total wirelength. 2.2.4 Legalization The legalization is performed to remove remaining overlaps between the cells while minimizing the perturbation on the global placement and satisfying several constraints. Legalization is required for all the three global placement approaches mentioned earlier. In the stochastic methods such as simulated annealing, overlaps are allowed especially in the initial steps. In the partitioning-based methods, after multiple rounds of paritioning modules that end up in the same part (sub-regions of the original region) may still overlap with each other. In analytical approaches, the non-overlapping constraints are replaced by density constraints. Legalization methods are categorized into two groups of (i) heuristic methods that solve the problem using a local view and (ii) formal approaches that formulate the problem as a mathematical programming. Tetris uses a greedy approach to legalize the cells in a design [58]. It first assigns cells to rows such that total displacement is minimized. Then sorts cells in each row by their x coordinate and greedily packs the cells from left to right. Authors in [59] further improved Tetris by discouraging uneven rows (in terms of total number of cells in each row). They enable moving a cell to different rows by adding a penalty functions for vertical displacement and gradually reducing that penalty to avoid getting stuck in the presence of completely full rows. Heuristic methods may lead to large cell movements as a result of trying to resolve violations locally and ignoring cell density and timing. Formal approaches formulate the legalization as a mathematical program using methods such as network flow [60] or diffusion [61] [62]. In BonnPlaceLegal [60], a modified version of the successive shortest path algorithm computes the shortest paths from source (overflowed) bins to sink bins. In EhPlacer [62], a flow-based approach is presented that legalizes the design while minimizing the maximum and average cell movements. 32 2.2.5 Detailed Placement In detailed placement step, solution is further refined based on by locally rearranging the standard cells (using cell movements and/or cell swaps). Detailed placement should maintain the legality of the placement solution (i.e. cell overlaps are not allowed); cells should remain aligned to the sites in rows and the required space between adjacent cells should be maintained and density constraints should not be violated. Based on the quality of the global placement and legalization, there may be room for wirelength improvement in a legalized solution for several reasons [40]. First, global placement typically uses inaccurate wirelength models (e.g., cut cost, quadratic wirelength with clique net model, log-sum-exponential function). Second, global placement algorithms often place each cell into a subregion without paying much attention to the location of the cell within the subregion. Third, during legalization, the wirelength is likely to be worsened by the perturbations [40]. Algorithms such as simulated annealing and branch-and-bound have been used to refine the quality of the placement [63]. However these methods are slow and they can only handle a small number of cells (typically less than 10) one at a time. A fast and scalable analytical detailed placement algorithm using cell shifting, iterative local refinement is presented in [64]. It uses 4 major moves to refine the placement. Global swap move first calculates the optimal location of a cell which is a function of the location of all the cells connected to that cell. If this optimal location is empty and by moving the target cell no overlap is introduced FastPlace places the target cell there. Otherwise, it assigns a cost functions for swapping a cell with an existing cell as a function of wirelength and the introduced overlap. Vertical swap is similar to global swap, but only moves a cell vertically to above/below rows. Local reordering calculates the optimal ordering of cells inside a small window. Finally, single-segment clustering technique minimizes the horizontal wirelength by shifting the cells in a segment without changing the cell order. A combination of these moves iteratively improves the quality of the solution. 33 2.2.6 Clock Network Synthesis In a synchronous system, a globally distributed clock signal coordinates the order in which data are processed. The clock signal (distributed to all sequential elements such as flip-flops or latches) synchronizes the circuit operation by allowing all data to pass through the sequential elements simultaneously [40]. Every signal transition between sequential elements is referenced to a specific clock edge. If a sequential element receives a clock edge at the wrong time it would capture the wrong data and the system will fail. As the complexity of system increases, the number of sequential elements increases and the clock network (including buffers, nets) becomes larger. The clock signal in modern synchronous circuits has the largest fan-out and travels over the longest distances, which imposes a lot of challenges to deliver the clock to all flip-flops at certain times. Furthermore, clock signal is affected by the technology scaling, in that long global interconnect lines become much more resistive as line dimensions are decreased. Additionally, the clock network (which typically includes a large number of buffers used to improve the timing) is prone to process variations which affects both interconnect and gate delays. Process variations create reliability issues, make the circuit performance deviate from the design specification, and cause timing yield losses. Above reasons, complicate the task of clock network synthesis which is crucial to the operation of a circuit. Clock tree is the most commonly used structure for the distribution of clock networks. Although clock meshes or clock grids are used in some microprocessor designs [65], they are less preferred due to the high routing cost and power consumption. The input to a clock tree synthesis (CTS) engine is the location of all sequential elements and their capacitive load, as well as characteristics of interconnects, such as capacitance and resistance values (specific to a technology), and information about the available buffers (such as delay, area, and input capacitance). The first step is to generate a topology through partitioning the clock sinks and assigning each sequential element to a leaf node in a binary tree. It is then followed by clock tree 34 embedding, which determines the location of splitting points of a clock tree in the layout area, buffer insertion to balance the delay to different nodes, and finally routing and optimization steps. A clock distribution network can be characterized by the following criteria. • Clock phase delay is the delay from the clock source to any of the clock sinks (i.e., sequential elements such as flip-flops or latches). Phase delay increases as the size of the feature size decreases and chip size increases. Phase delay is typically a combination of gate delay (buffers, clock gating elements, clock dividers) and interconnect delay. As feature size decreases the effect of process variations on phase delay increases which in-turn impacts the clock uncertainty. • Clock Skew is defined as the difference between clock arrival times (phase delay) to each two sink node i,j in a clock network. skew i,j =D i −D j (2.20) skew max = max 1≤i,j≤n D i −D j (2.21) Where D i denotes the phase delay at the sink i. Clock skew reduces the available positive time slack and directly limits the maximum clock frequency of a circuit. Additionally, minimizing skew is crucial to prevent hold time violations. Hold time violations cause the flip-flops to operate in meta-stable state and result in random circuit failures. Additionally, these violations can not be fixed by increasing the clock period. • Jitter is the random variations in clock cycle time caused by the clock generation source and delay variation of clock buffers induced by power supply noise. Although skew caused by static variations stays constant cycle to cycle, time-varying variations in 35 a circuit create temporal variation of the clock period at a given point on the chip. In particular, the worst-case deviation (absolute value) of the arrival time of a clock edge at a given flip-flop with respect to an ideal reference clock edge is called absolute jitter. • Timing constraints: Assume the circuit of Fig 2.5 in which a combinational logic connects two flip-flops. Setup and hold time are defined as follows. setup time: is the amount of time that the input to the capturing flip-flop should stay valid before the next triggering clock edge arrives. The following inequality summarizes the relation between clock skew, clock period and setup time. Data Clock FFj Comb. FFi Figure 2.5: A combinational logic connecting two flip-flops. T p ≥skew i,j +t max c2Q +t max comb +t max setup (2.22) where t max c2Q denotes the maximum clock-Q delay of a flip-flop, t max comb accounts for maximum delay through combinational logic (which also includes the interconnect delay) and finally t max setup denotes the maximum setup time for a flip-flop. As shown, a positive clock skew increases the cycle time. On the other hand, a negative clock skew (if clock signal is received at the launching flip-flop sooner than the capturing flip-flip) increases the effective clock period. In other words, propagation delay of combinational logic block between two sequentially adjacent flip-flops may be longer than the given clock period. hold time: 36 To ensure proper propagation of input signal through a flip-flop, the input must remain valid or hold steady for a short duration after the clock edge, referred to as the hold time [40]. The hold time of a capturing flip-flop imposes an additional constraint on the total propagation delay of a signal through the launching flip-flop and the combinational logic as follows. skew i,j ≥t max hold −t min c2Q −t min comb (2.23) In the worst case, the input signal at the capturing flip-flop (j) should remain stable fort max hold after the clock edge of the same clock cycle arrives. In other words, if the clock signal is arrived to the FF i earlier than the FF j , it causes the input signal to FF j to change before FF j can capture that. • Process variations can be classified into two categories of global systematic and local random variations. In the presence of variations, interconnect and gate delays change. Therefore, although clock skew may be zero in the nominal condition, due to the unpredictable variations, clock skew may be induced which may result in setup/hold time violations. Common approaches to deal with process variation problem in clock design include sensitivity minimization or non tree clock routing [66]. Clock networks synthesis methods can be divided into three categories: tree networks, non-tree (i.e. mesh) networks and hybrid networks that combine both the tree and non-tree approaches [66] [67]. Non-tree structures are generally robust against process variations because several paths from the clock source to the same sequential element exist that compensate for difference of delay time. However, these network consume significantly more power than traditional trees because of their large wiring overhead. Additionally, clock gating (turning off parts of circuit) is more difficult in these structure, since there exists several paths to a clock sink. Consequently, to reduce the power consumption of clock networks (which is a large portion 37 of total chip power) tree structure are more popular for clock distribution as they save power and routing resources compared with other non-tree networks. In the next section, main clock tree structures and synthesis methods are described. Clock signal can be distributed by a tree network characterized by the unique paths that deliver the clock signal to every sequential element in the design. A balanced clock tree is simple to construct and analyze using mathematical models. Additionally, unique source to sink clock path in tree topologies enables skew to be intentionally used to improve the performance (i.e., useful skew [68]). With no redundancy in source-sink paths (such as the ones in non-tree topologies), clock tree introduces less wiring capacitance and therefore consumes lower power to distribute the clock signal while enabling clock gating. However, the tree topology is vulnerable to jitter and process variation induced skew. Authors in [69] show that even an exact zero skew clock tree network can have a large skew induced by variability of interconnects and buffers along paths. In the next section, popular synchronous clock tree structures are introduced. 2.2.6.1 Clocking Structures Assuming a binary tree structure, a clock tree with h levels of internal nodes can reach 2 h flip-flops at the leaf nodes. The most popular clock tree topology is an H-tree. H-tree uses a recursive H shape routing pattern to balance the delay from source to all the clock sinks. As shown in Fig. 2.6, the path from clock source to all sink nodes in H-trees are perfectly balanced, resulting in a zero skew clock signal in absence of process variations. However, this regular shaped H-tree is a only suitable for uniform sink distribution in the layout area and equal capacitive loads for all sinks. In practice, this is never the case as sink loads and locations are rarely uniform. Hence, H-trees are only typically useful for the top level of clock distribution hierarchy. 38 The minimum clock period in a circuit using an H-tree topology can be calculated as follows. T min =t max c2Q +t max setup +t max prop +skew max (2.24) CLK Cell Splitter PTL Figure 2.6: H-tree clock topology. A path from clock source to one of the sinks is shown in red. An alternative to H-tree clocking is straight-line clocking, in which the clock path is distributed in parallel to the data path [70][71]. There are two methods for straight-line clocking: (i) counter-flow and (ii) concurrent flow clocking. Counterflow clocking structure is shown in Fig. 2.7. In this structure, an skewed clock signal is passed to all the cells in the reverse order of data flow. In other words, in an array of n gates in which the output of gate i− 1 is connected to the input of gate i, clock is distributed in reverse direction of data propagation. This clocking scheme has the advantages of high robustness to timing parameter variations as there will not be any hold time violations, as the launching flip-flop always receives the clock signal later than the capturing flip-flop. Therefore, the circuit timing should always be correct at a frequency low enough to satisfy the setup time constraint, even 39 if there are large timing parameter variations [70]. However, this comes at the expense of drop in throughput, which degrades the performance. In concurrent clocking the clock and the data flow in the same direction (cf. Fig. 2.8, in contrast with counterflow clocking. The data released by the clock from the first cell of the data path travels simultaneously with the clock signal in the direction of the second cell. The clock arrives at the second cell earlier than the data. The clock releases the result of the cell operation computed during the last clock cycle, preparing the cell for the arrival of the new data [70]. This clocking structure prefers higher clock frequency compared with counterflow clocking, however, it is prone to timing violations. Clock Data Clock Figure 2.7: Counterflow clocking. Data and clock signals move in opposite directions. Clock Data Clock Figure 2.8: Concurrent flow clocking. Data and clock signals move in the same directions. 2.2.6.2 Clock Synthesis Algorithms Clock tree synthesis is done in two phases: (i) Topology generation, in which an abstract topology is generated that maps the leaf nodes in a design to nodes in a binary tree. (ii) 40 Embedding, in which exact location of internal nodes of the abstract topology are determined. In the next section, two popular clock tree synthesis algorithms are described. Clock topology generation is the process of assigning sink nodes in a design (sequential elements) to nodes in a binary tree. Various clock tree topology methods have been proposed. Two of these methods are breifly explained below. Geometric Matching clock tree synthesis method uses a recursive bottom-up approach to merge the best geometrically matching candidates in a clock tree construction. Assuming n sink nodes, GMA selects n/2 pairs of nodes using bipartite matching algorithms. Then, the algorithm connects each pair of nodes and creates a set of n/2 such that no two segments share an endpoint. On every constructed segment, tapping point is determined such that the skew between the pairs on that segment is minimized. These tapping points becomes the new endpoints for the next iteration of GMA algorithm. Worst case run time complexity of GMA algorithm is O(n 2 logn) where n denotes the number of clock sinks. Deferred Merge Embedding (DME) algorithm was developed according to the observation that there are multiple locations for a merging node to satisfy skew specifications [72]. DME constructs the clock tree in two phase: a bottom-up pass performs merging of the sinks to find all potential zero skew merging locations. In this step, instead of committing a merging node to particular location, DME identifies the locus of the points for merging two node such that skew between these two nodes is zero (called merging regions). Next, in a top-down traversal of the tree DME picks one location among possible options on merging segments for every internal node such that the total wirelength is minimized. The same basic concept was later generalized to bounded skew clock tree construction [72]. The DME algorithm requires an initial topology along with a set of the input sinks. The algorithm has linear time complexity with the given input topology. However, often greedy clustering and matching algorithm is used to find the topology at the same time the bottom up phases is done. 41 Chapter 3 Timing Driven Placement This chapter presents a novel timing driven global placement approach utilizing the alternating direction method of multipliers (ADMM) targeting superconductive electronic circuits. The proposed algorithm models the placement problem as an optimization problem with constraints on the maximum wirelength delay of timing-critical paths and employs the ADMM algorithm to decompose the problem into two sub-problems, one minimizing the total wirelength of the circuit and the other minimizing the delay of timing-critical paths of the circuit. Through an iterative process, a placement solution is generated that simultaneously minimizes the total wirelength and satisfies the setup time constraints. Compared to an state-of-the-art academic global placement tool, the proposed method (called TDP-ADMM) improves the worst and total negative slack for seven single flux quantum benchmark circuits by an average of 26% and 44%, respectively, with an average overhead of 1.98% in terms of the total wirelength. 3.1 Overview Placement is a key step in physical design of the electronic systems. Algorithms for global placement, the core step in the placement process, typically minimize the total wirelength of a design as the main objective as it indirectly affects the routability, power consumption, and timing of circuits [73][74]. Although minimizing the total wirelength improves the timing of the circuit in general, it does not directly target optimizing the delay of timing-critical 42 paths. Timing and routability driven placement methods are, therefore, needed. Timing and routability driven optimizations are typically performed after global placement, as incremental steps, to improve the solution quality. The reason is that integrating timing and routability constraints into the global placement flow extremely complicates the problem. As the minimum feature size is scaled down and the chip sizes are increased, the length of interconnects and the ratio of wire delay to logic delay increase significantly. Therefore, it has become necessary to find cell locations that minimize not only the total wirelength but also wire delays of the timing-critical circuit paths. This setup has resulted in a two-stage placement process, where the first placement stage targets the total wirelength (or track density) minimization, whereas the second placement stage targets circuit timing optimization (or at least mitigation of timing violations) in an ad hoc manner. Such a two-stage placement process cannot generate a globally optimal solution in terms of the maximum clock frequency [75]. Timing driven placement techniques may be divided into two groups: (i) Global techniques that focus on reducing the length of the timing-critical nets in the design, through methods such as net-weighting or delay budgeting and (ii) Incremental techniques that modify the location of cells on timing-critical paths, aiming to reduce timing metrics such as total and worst negative slack [76][77][78][79]. Generally speaking, path-based techniques that directly target a set of timing-critical paths have proven to be more effective than net-weighting techniques. However, they suffer from the need to deal with a large number of critical paths in circuits (it is well known that the number of circuit paths is exponential in circuit size) and the fact that timing criticalities of circuit paths tend to change as a result of the placement itself, resulting in a “timing convergence” issue. State-of-the-art global placement algorithms first generate a dense placement where the total wirelength has been minimized, and subsequently, improve the circuit timing, generally resulting in an increase in the total wirelength. The rationale is that although in the initial stages the timing information corresponding to interconnect lengths tend to be inaccurate, in 43 the final stages where the relative ordering of the cells are determined and cell overlaps are removed, a more accurate estimation of the circuit timing (including a list of timing-critical paths in the circuit) can be obtained and used. In this work, we present a timing driven global placement algorithm targeting superconduc- tive electronic circuits. The proposed approach minimizes the total wirelength of the circuit while targeting a specific clock frequency by imposing hard constraints on the maximum interconnect length for all the nets on the timing-critical paths of a circuit. This is achieved by utilizing the powerful framework of alternating direction method of multipliers (ADMM) [21]. Precisely, the proposed timing driven placement algorithm (TDP-ADMM) converts the placement problem with constraints on the path delays into an unconstrained problem by decomposing the timing driven problem into two sub-problems, one optimizing the total wirelength and the other optimizing the circuit timing. Through an iterative process, the solution of the two sub-problems are modified to close the gap between a wirelength driven and a timing driven placement. With a small number of ADMM steps performed during the final stages of the global placement, the proposed algorithm generates a high-quality solution in terms of the total wirelength which meets the timing requirements in terms of a target clock frequency. Primary contributions of this chapter can be summarized as follows. • We present a timing driven global placement algorithm that generates a high-quality solution in terms of the total wirelength while meeting timing constraints. • We present a heuristic delay-budgeting algorithm that transforms the target frequency of the circuit to a set of upper-bound values for the wirelength of each net in the design. • The proposed algorithm improves the worst and total negative (setup) slack for seven benchmarks by 26% and 44%, on average, with a 1.98% overhead in terms of the total wirelength. The remainder of the chapter is organized as follows. Background materials on ADMM and the prior art in timing driven placement are discussed in section 3.2. The timing driven placement problem is formulated in section 3.3. The proposed algorithm along with the 44 results of applying the TDP-ADMM method to multiple benchmark circuits are detailed in sections 3.4 and 3.5, respectively. The chapter is summarized in section 3.6. 3.2 Preliminaries 3.2.1 Basics of ADMM Alternating direction method of multipliers (ADMM) is an extremely powerful algorithm for solving constrained optimization problems, especially those that are difficult to solve directly or have non-convex and combinatorial constraints [21]. It decomposes an optimization problem to two sub-problems, solves each sub-problem separately, and through an iterative process ensures that the solution of two sub-problems converge to a high quality (feasible) solution. ADMM can be employed for both convex and non-convex problems, even with combinatorial constraints [21]. Generally, ADMM solves an optimization problem with the following form: min x,z f(x) +g(z) s.t. Ax +Bz =c (3.1) where f and g are convex functions, x∈R n 1 ×1 , z∈R n 2 ×1 , A∈R m×n 1 , B∈R m×n 2 and c∈R m×1 . Using the augmented Lagrangian function, the above problem can be broken down into two sub-problems, one in the form of f(x) +q 1 (x) and the other one in the form of g(z) +q 2 (z), where q 1 and q 2 are quadratic functions. Finally, through updating the Lagrangian multipliers, solutions x and z are modified such that the algorithm converges to a globally optimum solution. Assume a generic convex constrained optimization problem: min x f(x) s.t. x∈X (3.2) 45 ADMM transforms this problem to an unconstrained problem, using the indicator function defined as follows. I X (x) = 0 x∈X +∞ x6∈X (3.3) Consequently, problem (3.2) is transformed to the problem min x f(x) +I X (x). ADMM introduces an auxiliary variable z and decomposes the objective function into two parts: min x f(x) +I X (z) s.t. x = z Finally, the Lagrangian multipliers are added to the problem to remove the constraints and the problem becomes minimizing the augmented Lagrangian function L ρ : L ρ (x,z,λ) =f(x) +I X (z) +hλ,x-zi + ρ 2 ||x-z|| 2 2 (3.4) where λ is the Lagrangian multiplier and ρ is a positive scalar, multiplied by a quadratic penalty term that increases the objective function when x and z variables are different from one another. The problem is solved iteratively for x and z, until the solution can not be further improved. At each iteration r, the first sub-problem minimizes L ρ (x,z r−1 ,λ r−1 ), using the z and λ values obtained in the previous iteration (i.e. z r−1 and λ r−1 ). The second sub-problem solves L ρ (x r ,z,λ r−1 ) with the updated x values, i.e., x r . Finally, λ values are updated. 3.2.2 Prior Work on Timing Driven Placement Ref. [80] presents an ADMM-based placement algorithm that considers fogging and proximity effects as well as the total wirelength. This approach does not target a specific bound for fogging or proximity measures. A global placement algorithm is proposed in [81] that uses the augmented Lagrangian function to transform the bin density constraints to terms in the 46 objective function and solve an unconstrained problem. However, this approach does not adopt the ADMM algorithm. A wide variety of algorithms for incremental timing driven placement (ITDP) have been proposed in [76][77][78][79] [82]. Recently, authors of [83] have proposed an ITDP algorithm that relocates and replicates the latches to improve the timing of the circuits with an small overhead in terms of the added area. An ITDP approach is presented in [84] that uses a Lagrange Relaxation-based timing optimization, local moves, and flip-flop (FF) clustering to iteratively improve the timing of the circuit. All the previous methods focus on optimizing the timing of the circuit with ad hoc optimizations and postpone the timing adjustments until the final stages of the RTL-GDSII flow, which limit their effectiveness in achieving a highly optimized solution in terms of timing or total wirelength. 3.3 Problem definition To formulate the timing driven placement problem, we model a netlist as a graph (G) where nodes (V ) and edges (E) correspond to cells and nets in the netlist, respectively. The timing driven global placement is formulated as: min X e∈E WL(e;x,y) +η T B(x,y) (3.5) s.t. D(p;e;x,y)≤T min , ∀p∈P (3.6) where WL(e;x,y) function estimates the wirelength of each edgee, as a function of thex and y coordinates of the nodes connected by that edge, using the weighted-average model [85], similar to state-of-the-art approaches such as [86][87][88]. The weighted-average wirelength model calculates the wirelength of an edge e k in the horizontal direction as follows. WA e k (x) = P v i ∈e k x i e x i γ P v i ∈e k e x i γ − P v i ∈e k x i e −x i γ P v i ∈e k e −x i γ (3.7) 47 whereγ is an smoothing parameter [85]. The weighted-average function is smooth and convex, therefore, a suitable option for modeling the wirelength [85]. To generate a legal solution, the layout area is divided into grid bins with equal sizes and the placement algorithm aims to keep the occupation density of each grid cell below a predefined threshold, similar to [86]. In the above formulation, B(x,y) is a penalty term on the density of grid bins which controls the overlaps, thereby generating a nearly legal placement solution. D(.) calculates the delay of each timing path (a path starts at a primary input or the output of one DFF and ends at the input of another DFF or a primary output) as a function of the gate delays along the path and wire delays of the edges on that path. P and T min denote the set of timing paths and the minimum clock cycle time for the circuit, respectively. Note that the wire delay of an edge is a function of the locations of the gates connected by the edge (in our formulation, the wire delay is calculated as some technology specific constant multiplied by the half-perimeter length of the bounding box of the corresponding edge.) In the above formulation, if we assume that the gate delays are scalar values that are independent of the wire loads, then delay of a path can be calculated as a weighted summation of lengths of all connections (edges) on that path. Each edge length involves an absolute value calculation (i.e., HPWL calculation), which can be approximated and made into a smooth function using equation (3.7). One can use any mathematical programming package to solve the aforesaid constrained nonlinear optimization problem. However, in practice, this approach cannot be used for CMOS logic circuits. The reason is that the number of paths in a CMOS logic circuit is, in general, an exponential function of the number of logic gates in that circuit, which implies an exponential number of constraints and hence an intractable problem. One solution to this problem is to limit the number of paths for which explicit timing constraints are included in the above mathematical programming problem. This can be done, for example, by doing a static timing analysis on an un-placed CMOS circuit (or even on a circuit placement obtained by solving (3.5) without the timing constraints of (3.6)) to identify the potential 48 timing-critical circuit paths and then solve (3.5) s.t. (3.6), where the timing constraints are written only for the said timing-critical paths. Additionally, the depth of each timing-critical path (i.e., the number of gates on a path) is still a large number. Consequently, as pointed out in previous work [76] and based on our own experience, the timing constraint set is highly degenerate, which tends to result in standard nonlinear program solvers floundering for many steps without being able to improve the cost function. Fortunately, in our target technology (which is a gate-level pipelined circuit, see section (3.4.3)), the situation we encounter is one where each combinational path in the circuit consists of wires and at most a few combinational splitters. Consequently, the number of paths in the circuit is linear in the number of logic gates and each path delay is simply the addition of a small number of wire delays. In this case, the number of constraints in (3.6) is linear, the number of absolute-value terms in each constraint is bounded by the maximum depth of any splitter tree in the circuit (which is typically a small number) plus one. This then allows us to replace equation (3.6) with the following equation: D(p;e;x,y)≤u p , ∀p∈P, ∀u∈U (3.8) where u p denotes the upper bound delay on circuit path p, which is in turn calculated as follows: u p =T min − t c2Q,drvr − X s∈spls on p d(s)− t setup,rcvr − t clk,skw (3.9) where d(s) denotes the delay through a splitter. Let’s define: U ={u 1 ,u 2 ,...,u |p| } as the set of upper-bound delays for all timing paths in the circuit. In solving (3.5), one of the challenges is to calculate the maximum length for each edge as a function of the target frequency of the circuit, the number of edges on each path from one clocked logic cell (or D flip-flop) to another, and the degree of each edge. To transform the constrained optimization problem of (3.5) to an unconstrained problem, we adopt the powerful mathematical optimization framework of alternating direction method 49 of multipliers (ADMM) [21]. Additionally, we use a net upper-bounding algorithm to assign upper bound limits to the net lengths on timing-critical paths. In this way, setup time constraints are satisfied and the total wirelength of the circuit is minimized, simultaneously. 3.4 Proposed solution 3.4.1 Net Upper-bounding To address the delay-budgeting problem, we employ a heuristic algorithm. First, all the timing paths and their corresponding edges are detected. For a target frequency f, an upper- bound value is found which corresponds to the max delay of each timing path, which may include multiple nets and combinational logic. Using a linear delay model (i.e., path-length delay model in which the delay of each net is a linear function of its HPWL), upper-bound delay value for each path is transformed to an upper-bound value on the summation of the half-perimeter length of the bounding boxes of the nets on that timing path. To decompose the delay of each path to the delay corresponding to horizontal and vertical wire lengths, we divide the delay equally between the horizontal and vertical segments. Note that although this approach does not consider the initial placement of the gates and the distribution of delays over horizontal and vertical segments (i.e., approaches similar to [79]), the proposed solution assigns the maximum budget to both the horizontal and vertical segments of each net. This approach facilitates the projection of the placement solution to the feasible region, set by the maximum length of each edge, satisfying all the timing constraints. Finally, this approach enables a uniform distribution of total wirelength in x and y directions which facilitates routing. If there are n nets on a timing path p with a maximum length l, the maximum length for each net is set to be l/n. If a net is on multiple timing paths, to ensure all the timing constraints are met, the minimum of all the upper-bound values for that net is imposed as its maximum length. Accordingly, a set of constraints for the delay of all the path are formed. 50 3.4.2 ADMM-based Timing Driven Placement To solve the timing driven placement problem (3.5) in the horizontal direction, we use the indicator function to move the constraints into the objective function and form the augmented Lagrangian function: L ρ (x,z,λ) =f(e;x) +I U (WL(e;z)) +hλ,x-zi + ρ 2 ||x-z|| 2 2 (3.10) where x denotes the horizontal coordinate vector of the logic cells (i.e., nodes in the circuit graph), z is an auxiliary vector, f(x) is the objective function in problem (3.5), andh.,.i denotes the dot product of the two operand vectors. Using the proposed net upper-bounding algorithm, separate constraints for the maximum wirelength of nets in the horizontal and vertical directions are produced. Thereby, the placement problem in the vertical direction can be formed similarly to determine the y coordinates of all the nodes, independently of their x coordinates. The steps to minimize the augmented Lagrangian function (3.10) are: x r+1 = argmin x f(e;x) + ρ 2 ||x−z r +ρ −1 λ r || 2 2 (3.11) z r+1 = argmin z ||x r +ρ −1 λ r −z|| 2 2 s.t. WL(e;z)≤U (3.12) λ r+1 =λ r +ρ(x r+1 −z r+1 ) (3.13) The first sub-problem (i.e., (3.11)) can be solved using standard global placement algorithms utilizing techniques such as stochastic gradient descent (SGD) or Nesterov’s algorithm [86] [87] [88]. Note that the second term in the objective function of (3.11) is convex. The first sub-problem minimizes the total wirelength and maximum bin-density similar to (3.5), with the difference that it tries to generate x coordinates that are close to z coordinates to also minimize the second term in the objective function. 51 Solving the second sub-problem (i.e., (3.12)) ensures that the placement solution corre- sponding to z vector meets the timing constraints. We transformed the indicator function (i.e., I X ) in (3.10) back to a set of constraints on the length of the nets. Essentially, sub-problem (3.12) solves a minimum perturbation problem that generates a z vector with minimal changes to the x r +ρ −1 λ r vector and projects this vector into the feasible region of wire lengths. Note that although a single iteration of ADMM generates a solution (i.e., z vector) that meets the timing requirements, the solution may be sub-optimal in terms of the total wirelength or the maximum bin density. Therefore, multiple iterations of the ADMM algorithm ensure that a high-quality solution in terms of both the total wirelength and timing is obtained. For estimating the wirelength of each net, i.e, WL(e;z), there are two possible cases; (i) A net is connected to two cells (pins) i and j; then WL(e;z) =|z i −z j | and each constraint in (3.12) can be decomposed into two separate constraints: z i −z j ≤u e (3.14) −u e ≤z i −z j (3.15) (ii) A net is connected to more than two pins. In such a case, the we use the weighted-average function to estimate the wirelength [85]. Additionally, we can use Lagrangian multipliers to move the constraints into the cost function and solve an unconstrained problem. To accelerate solving sub-problem (3.12), one can use algorithms such as gradient descent (GD) to generate an approximate solution with a small degradation in the objective function, while meeting the wire length constraints. Note that if there are no feasible solutions to sub-problem (3.12), we should decrease the target clock frequency and modify the set of upper-bound values, i.e., U, accordingly. To find the maximum frequency at which the circuit works, one only needs to do a binary search on the target frequency and for each target frequency check the feasibility of sub-problem 52 (3.12). This can be achieved efficiently, before the global placement algorithm is solved, by solving sub-problem (3.12) where the objective function is set to be a constant value. The third sub-problem updates the λ values based on the obtained x and z vectors to enforce the convergence of the algorithm. Ifx r+1 i andz r+1 i corresponding to the coordinates of nodei are close to each other, thenλ r+1 i will be close toλ r i . Otherwise, the by updatingλ r+1 i we try to minimize the divergence of x i and z i values in the next iteration of the algorithm. The stopping criteria for the TDP-ADMM algorithm is defined as follows. ||x r+1 i −x i r || 2 2 ≤,||z r+1 i −z r i || 2 2 ≤ (3.16) Alternatively, one can monitor the gap between x and z solutions in terms of the total wirelength and terminate the algorithm once the gap drops below a certain threshold. In our experience, 20− 40 iterations of the ADMM algorithm are enough to generate high-quality solutions in terms of the total wirelength, maximum bin density, overflow, and the critical path delay. Intuitively, in the z-update step (cf. sub-problem (3.12)) the algorithm contracts certain nets to satisfy the timing requirements and modifies the corresponding z elements with minimum changes to x vector. On the other hand, the x-update step (cf. sub-problem (3.11)) generates an optimized solution in terms of the total wirelength and bin density, which is similar to z solution, but may violate the timing constraints. Overtime, the critical path delay corresponding to x vector should converge to that of the z vector. Similarly, the total wirelength corresponding to z vector should converge to that of the x vector. In the final step of the TDP-ADMM algorithm, if z vector is chosen as the final solution, the timing requirements are met, yet the solution may be sub-optimal in terms of global placement metrics. Alternatively, x vector solution generates an optimized solution in terms of global placement metrics but with negative timing slacks. At this step, since the output placement corresponding to x or z vectors is not legalized yet, a timing driven 53 legalization algorithm may be utilized to ensure the final legal solution also satisfies the timing requirements. 3.4.3 Application of TDP-ADMM to SFQ Placement Single flux quantum (SFQ) technology is one the most promising replacements for CMOS technology, with advantages such as ultra-fast pico-second range switching speed and ultra-low energy consumption, with a promise of at least an order of magnitude improvement in terms of both clock frequency and energy-consumption [89]. To fully exploit the extraordinary characteristics of SFQ logic family, including rapid single flux quantum (RSFQ) and energy- efficient rapid single flux quantum (ERSFQ) [89][90], it is crucial to develop a timing driven global placement framework that optimizes the timing of circuits while minimizing the total wirelength. Some of the unique characteristics of SFQ logic, enumerated below, motivate rethinking and reformulating timing driven global placement problem for SFQ circuits: (i) The delay of passive transmission lines (PTL)s, used as interconnects in SFQ circuits, is a linear function of their length [16], so, a linear wire length model captures the wire propagation delay in SFQ circuits very well. (ii) As a result of ultra-deep gate-level pipelining in SFQ circuits, a signal path is defined between two consecutive clocked logic gates and comprises only of the logic cell driver, possibly some splitters, point-to-point connections, and the receiver logic cell. As a result, the number of paths in the circuit is simply the number of primary inputs plus the number of clocked logic cells, plus the number of splitters. (iii) SFQ circuits require full path-balancing (i.e., each logic cell that provides an input signal to another logic cell which is at sequential leveld is exactly at leveld−1). In SFQ circuits, the logic depth between clocked gates is much lower than CMOS, and thus the longest wirelength (typically implemented with PTLs) will largely determine whether the target > 50GHz clock frequency can be achieved. (iv) Given current cell libraries, physical synthesis of the designs including gate resizing is not a viable option. In SFQ technology, buffer insertion adds DFFs to the circuit which increases 54 the total number of nodes and the area significantly, due to full path-balancing requirement. Accordingly, a timing driven placement that optimizes the maximum path delay is the most effective approach to maximizing the performance of circuits in terms of clock frequency. Due to the aforementioned characteristics of SFQ family, all the nets in an SFQ netlist are two-pin nets. Therefore, assuming an edge e to be connected to cells i and j, the wirelength in horizontal direction is calculated as WL(e;x) =|x i −x j |. Accordingly, the sub-problem (3.12) is an instance of quadratic programming with linear constraints, which can be solved efficiently using state-of-the-art quadratic programming solvers. 3.5 Simulation Results We added the support for the TDP-ADMM algorithm to the open-source global placement tool DREAMPlace [88] and used the IBM CPLEX v12.8 package for solving the quadratic programming (QP) problems [91]. As the global placement algorithms generate a nearly legal solution, we used the FastPlace 3.0 detailed placement tool and Capo for legalization [92] [93]. For each benchmark, chosen from the EPFL benchmark suite [94], we used the qSyn tool for logic synthesis and qSTA tool for performing the static timing analysis after a legal solution is generated and before the clock tree is synthesized [95] [96]. Table 3.1 lists the benchmarks, the total number of cells, number of clocked logic gates, number of DFF-DFF paths, and the target clock period, the half-perimeter wirelength (HPWL), and the total and worst negative (setup) slack for all the benchmark. The target clock period for each design is set as the delay of the longest critical path after the logic synthesis and prior to placement. We compare our results to that of DREAMPlace [88]. We use the same set of configurations for both the proposed and baseline design (i.e., target overflow, number of grid bins, target density). In our experiments, we set the ρ = 1e−6. We set the target overflow and target density values to 0.07 and 1.0, respectively. For our proposed method, we start the ADMM steps when the target overflow is reached and continue until the stopping criteria are met, 55 for a maximum of 40 ADMM iterations. We run the DREAMPlace placer [88] for the same number of iterations to have a fair comparison. As shown in Table 3.1, the average worst negative slack (WNS) and the average total negative slack (TNS), over seven benchmarks, are reduced by 26% and 44%, respectively, compared with the baseline solution. The increase over the average wirelength is 1.98%. Additionally, among the seven benchmarks, the maximum improvement for WNS and TNS are 34.9% and 47.4% respectively. Note that in some of the benchmarks, the maximum number of splitters on a path are large, which result in a large post-synthesis clock period. Moreover, the large size of logic cells in the current cell library (the area of each AND gate is about 50μm× 120μm) increases the layout area and the maximum wirelength significantly, which yields large negative slacks after the placement step. It should be mentioned that in all of our experiments, the generated z solution, prior to detailed placement and legalization steps, meets the target (post-placement) clock frequency. The proposed approach can be further extended to also minimize the negative hold slacks. SFQ circuits require all the input to output paths to have the same sequential depth (i.e, require full path-balancing). Therefore, a large number of DFF-DFF paths, only containing a single wire, are created in the netlist which are prone to hold time (i.e., short-path) violations. A timing driven placement approach that considers hold-time constraints during the placement can generate a solution where the distance between any pair of directly connected DFFs is larger than the required hold time, eliminating the need for inserting timing-fix buffers during the timing closure flow. Table 3.1: Experimental Results for seven Benchmarks from EPFL Benchmark Suite [94]. DREAMPlace [88] TDP-ADMM Improvement (%) Benchmark # Paths # Cells # Clocked Cells Clk Period (ps) HPWL WNS (ps) TNS (ps) HPWL WNS (ps) TNS (ps) HPWL WNS TNS dec 1264 1521 512 102.0 8.12E+05 -84.7 -3219.5 9.40E+05 -63.1 -2484.7 -15.75 25.5 22.8 cavlc 2301 2313 1559 91.7 1.14E+06 -70.5 -705.4 1.13E+06 -45.9 -501.5 1.02 34.9 28.9 priority 13755 13764 12859 26.0 2.68E+06 -80.1 -5103.2 2.81E+06 -78.9 -4896.5 -5.01 1.5 4.1 voter 24853 24855 16806 32.0 2.70E+06 -71.8 -3091.9 2.95E+06 -61.1 -2689.9 -9.11 14.9 13 adder 27329 27459 26112 30.0 5.62E+06 -95.7 -3697.1 5.68E+06 -94.3 -1943.7 -0.98 1.5 47.4 sin 29856 29882 29832 187.0 2.05E+07 -215.4 -12496.8 2.21E+07 -197.8 -7163.6 -7.87 8.2 42.7 max 66463 66594 62661 246.0 4.01E+07 -588.6 -48799.6 3.74E+07 -393.6 -31922.4 6.68 33.1 34.6 Average - - - - 1.30E+07 -192.1 -18107.0 1.32E+07 -140.4 -10089.7 -1.98 26.9 44.3 56 3.6 Summary In this chapter, we presented a novel timing driven global placement algorithm (called TDP-ADMM) that optimizes the total wirelength of the circuits and reduces the critical path delays, simultaneously. This is achieved through using the powerful framework of ADMM algorithm, in an iterative process, where two sub-problems are solved to close the gap between two placement solutions, one minimizing the total wirelength and one minimizing the critical path delays. We also presented a heuristic delay-budgeting algorithm that transforms the path-delay requirements to maximum net-length constraints. Using a small number of ADMM iterations, the TDP-ADMM algorithm generates solutions that outperform those obtained by an state-of-the-art global placement tool, on average, by 26% and 44% in terms of the worst and total negative (setup) slacks, respectively. The overhead of the proposed solution in terms of the average wirelength is 1.98%. 57 Chapter 4 Hybrid Clock Networks: Clocking Structures and Algorithms for Placement 4.1 Overview In this chapter, we present novel ideas for generating clock network topologies as well as efficient and enhanced techniques for placement of both logic and clock elements realizing the said clock networks, for superconducting circuits. The rest of this chapter is organized as follows. Our SFQ design methodology including placement, different clock tree synthesis methods, and signal routing are introduced and discussed. Upcoming subsections present analytical models for calculating the maximum clock frequency of different clocking schemes accompanied by simulation results. 4.2 Preliminaries 4.2.1 Standard Cell Our proposed placement algorithm utilizes a row-based methodology. That is, the chip is partitioned into several rows for placing standard cells. Each SFQ standard cell in our library is composed of two parts (cf. Fig. 4.1): (i) A logic design part (logic part for short), which implements a Boolean function such as AND, OR, INV, etc. (ii) A built-in clock distribution 58 i Clock In Clock Out Output Input 2 Input 1 Logic part Clock part (Splitter/JTL) JTL JTL Pin PTL Pin Splitter Figure 4.1: A sample standard cell composed of clock and logic parts. Clock part has different templates which are shown in Fig. 4.2. part (clock part for short), which is placed above the logic part. The clock part contains a splitter which provides the clock signal to the corresponding logic part, and also passes the clock pulse to the next cell. When the logic part does not need a clock signal (e.g., splitter or merger cells), the clock part implements a JTL to pass the clock pulse to the next cell. As a side note, if cells in a row are sorted based on their logic level, the built-in clock part without requiring a PTL can distribute the clock signal to all cells in that row. For the clock part, five different templates exist which are shown in Fig. 4.2. This built-in clock is used to linearly (sequentially) propagate the clock pulse to all cells in a group. Accordingly, in a group with k cells, from left to right, Fig. 4.2 (c), Fig. 4.2 (a), and Fig. 4.2 (e) are picked for clock parts of the first cell, k− 2 intermediate cells, and the last cell, respectively. Fig. 4.2 (d) shows the clock part for a special case where k = 1 (i.e., when a group has only one cell.) Furthermore, we do not use JTLs for data signal routing, since (i) JTLs require JJs and hence occupy the active layer, which in turn complicates the placement problem, and (ii) JTLs are slower than PTLs especially for long-distance communications. Therefore, data nets are routed using PTLs. 59 JTL JTL Pin PTL Pin Splitter Pass-through templates: End-of-line templates: a b c d e Figure 4.2: Five templates for the clock part of our standard cells. 4.3 Clock Network Topologies Clock network synthesis is a process that makes sure that the clock signal is properly distributed to all sequential elements in a circuit. Clock routing is sometimes performed before signal routing to avoid competition for resources occupied by signal nets. In SFQ circuits, a D flip-flop (DFF) is attached to each cell (except for splitters and mergers). As a result, the clock tree network is significantly larger in SFQ circuits compared with that of CMOS circuits. Accordingly, the clock tree synthesis must be given a very high priority in SFQ designs. In this section, we propose methods for synthesizing a clock network for SFQ circuits. Our adopted row-based design methodology not only simplifies the placement algorithm but also results in a well-structured clock tree network. In our row-based design, standard cells are placed in fixed-height rows. Clock splitters are placed in the routing channel, the empty space between the rows. Note that in a circuit with n cells (where n = 2 m ), there are a total number of n− 1 clock splitters (assuming a balanced clock tree). Therefore, placing clock splitters in the channel between rows reduces the total width of the rows. 4.3.1 Clock Tree Structures In this section we propose two clock tree structures for SFQ logic circuits. 60 Cell Splitter Row Figure 4.3: Proposed placement of logic cells (inside rows) and clock splitters (between rows). Rows are shown in red. 4.3.1.1 H-tree H-tree clock structure propagates the clock signal to all the cells in the design. Such a network can be synthesized using zero-skew clocking methods (for a complete description refer to section 2.2.6.1.) In this clocking structure, we add horizontal space between each two consecutive cells in a row. This spacing helps reducing the routing congestion in different parts of the chip and improves the routability of the circuit. An example of H-tree clock structure is shown in Fig. 4.3. Using H-tree clock structure, one may directly use the output of any CMOS global placement for synthesizing the clock tree. Since almost all SFQ cells require a clock signal, a huge H-tree network is needed especially in large circuits, which in turn decreases the chip density (i.e., portion of the chip used for logic cells). On the other hand, such clock structure minimizes the skew as clock signal tick almost simultaneously arrives at all registers. 61 The minimum clock period (T CLK ) using the H-tree clock structure can be calculated using the Equation 2.22 configured for SFQ logic as follows. T min CLK =max i,j∈V {skew max i,j +t max c2Q i +t max comb +t max PTL +t max setup j } (4.1) where i and j are any two sequentially adjacent elements (i.e., flip-flops), t max c2Q denotes the maximum delay of any sequential gate (such as AND, OR, etc.), t max comb represents the maximum delay of splitter (trees) between each two sequential gates, and t max PTL accounts for the delay of longest PTL between sequential elements. The setup time of the capturing FF is denoted as (t max setup ). 4.3.1.2 HL-tree To alleviate the high cost of the clock network in large SFQ circuits, we present a novel clock network called HL-tree. HL-tree clock structure reduces the area dedicated to clock network by grouping cells together and abutting cells in each cell-group. After grouping at most k cells in each cell-group, the global clock is simultaneously transfered to the first cell of each group using a partial H-tree. Inside each group, the clock signal is propagated from the initial cell to the rest of the cells in that group using the built-in splitter on the top portion of cells, similar to concurrent flow clocking structure (cf. 2.2.6.1). Using the proposed HL-tree structure, the total chip area is reduced since horizontal spaces between cells in a group are removed. As a result, total wirelength may also decrease as total width of the chip is reduced. This is especially important if the maximum wirelength that results in the longest interconnect delay is decreased. However, due to the sequential distribution of the clock inside groups, a clock skew is introduced. Hence, the total number of cells in each group k creates a trade-off between chip area and maximum clock frequency. Essentially, this clock structure is a combination of an H-tree that globally propagates the signal to cell-groups and an skewed linear clock network (L-tree) that propagates the clock signal to cells within a group. Due to the fact that cells within a cell-group in HL-tree 62 network receive a skewed clock signal, constraints should be set to avoid race conditions (i.e., hold-time violations). An example of an HL-tree clock network is shown in Fig. 4.4. Figure 4.4: Proposed HL-tree clock structure. Global clock propagates the clock signal to all cell-groups. Local clock propagates clock signal to cells within a cell-group. Note that for k = 1 HL-tree is equivalent to an H-tree clock structure. The minimum clock period (T CLK ) of the HL-tree clock network is calculated as follows (cf. Fig. 4.5): T CLK = max u,v∈G,i,j∈C,(i,j)∈N u6=v,i6=j {skew u,v + (p u,i − 1)t split + t c2Q i +t pd i,j +t setup j }, (4.2) where G, C, and N denote the sets of all groups, cells, and nets, respectively, in the netlist, skew u,v is the difference between clock arrival times of the first cells in groups u and v, p u,i is the position of cell i in group u (assuming that the first cell from the left in a group is in position 1), t split denotes the delay of propagating clock through the built-in clock part of cells which is equal to the splitter delay, t c2q i represents the clock to Q delay of cell i, t pd i,j is the propagation delay (including PTL receiver and transmitter delays) from cell i to cell j, t setup j represents the setup time of cell j. The aforesaid equation calculates propagation 63 delays for all pairs of connecting cells and returns the maximum value as the maximum achievable clock period. In Equation (4.2), t split and t c2q i are much smaller than t pd i,j due to the fact that t pd i,j is composed of PTL transmiter and receiver delays as well as the propagation delay through the PTL wire, which is dependent on the longest path in the design; however, t split is a constant value, i.e., the propagation delay through a splitter cell. Hence, the bottleneck for the maximum achievable frequency is the longest path in the design which should be reduced as much as possible. On the other hand, if a complete H-tree is used to propagate the clock signal to all cells in the design, p u,i is equal to 1 for all cells, eliminating the term (p u,i − 1)t split from Equation (4.2). Clock Group u Group v i j Figure 4.5: The maximum clock frequency evaluation for HL-tree clock network. The overall design flow of our algorithm for physical synthesis of the clock networks (called qCTS), shown in Fig. 4.6, is explained in detail in Chapter 5. qCTS generates a minimum-skew clock tree such that all the clock splitters are mapped to the routing channels between the placement rows (i.e., the logic cells are placed inside the rows and the clock splitters are placed between the rows). The presented methodology can be employed for both H-tree and HL-tree clock structures. 64 Placement Topology Generation Clock Tree Embedding Min-Skew Clock Tree Placement & Legalization STA Placement Topology Generation Clock Tree Embedding Min-Skew Clock Tree Placement & Legalization STA Splitter Insertion Logic Synthesis Timing Closure qCTS Figure 4.6: The overall flow of the proposed clock tree synthesis algorithm, qCTS. 65 4.4 Customized Placement Algorithms A placement tool gets the dimensions and pin locations of each cell (gate) and their connections to each other. It then assigns cells to positions on the chip such that no two cells overlap with each other and a cost function (e.g., chip area, total wirelength, or critical path delay) is minimized. An important consideration in the placement problem is that the placement solution must be routable. Hence, to avoid multiple costly iterations between placement and routing steps, routing-aware placement algorithms are of significant interest. To simplify the design automation of the placement step, row-based placement techniques are widely used in the VLSI design community and semiconductor industry. These techniques place (fixed-height, but variable width) cells in rows on the chip. Furthermore, power interconnects run horizontally through the top and bottom of cells. Therefore, when cells are placed adjacent to each other, power interconnects form two continuous parallel tracks in each row. Input and output pins of cells are available at the top and/or bottom sides of the cell, and are connected by interconnects routed in the routing channels between adjacent rows. Connections from one row to another are done either through the surrounding ring or by using feed-through cells. A cell that complies with these features will be referred to as a standard cell (cf. Section 4.2.1). To efficiently place netlists with millions of cells, we use the following three-step algorithm. (i) Global placement: the non-overlapping cell constraint is ignored in this step and approximate locations of cells are obtained by placing cells in global bins. The main focus of the global placer is to optimize the cost function by iterating between the solution of some mathematical program (e.g., a constrained quadratic or linear programming problem) and a cell spreading (or bi-partitioning) step. (ii) Legalization: the output of the global placement must be legalized to remove any cell overlaps. (iii) Detailed placement: legalization is further refined by using local adjustments such as cell movement or swapping to reduce wirelength while preserving the legality of the solution. 66 Global placement problem has been studied for more than 50 years, and it has played an important role in the overall design flow of integrated circuits [37]. Wide variety of objective functions have been defined and numerous multi objective algorithms have been introduced. Reducing the total wirelength which can be modeled by half-perimeter wirelength (HPWL) along with routability and performance are among the main objective functions. State-of- the-art wirelength driven placement algorithms can be categorized into two categories [37]: force-directed quadratic placers, such as SimPl [50], and non-convex optimization techniques, such as mPL6 [51]. Force-directed techniques model net length by a quadratic function of cell locations and try to minimize the total wirelength by solving a system of linear equations [50]. Non-convex optimization techniques solve the wirelength minimization problem using more sophisticated numerical analysis methods. Although non-convex techniques achieve higher quality results compared with force-directed techniques, their run-time is significantly slower than their counterparts [50]. For the SFQ placement problem, we assume that the input netlist is path-balanced 1 and all fan-outs are implemented with splitters. Our placement tool then initiates by running a global placement (using SimPL [50]) followed by detailed placement and legalization, which together generate a legal solution with the minimum total wirelength. To reduce the chip area for large SFQ circuits and enable automated placement and routing of such circuits, a row-based design methodology is presented in this report. More specifically, we present design methodologies based on fixed-height but variable-width logic cells, where these cells are placed in pre-defined rows on the chip to improve the packing efficiency and automation capabilities of layout tools. The approaches presented here are implemented within a tool called qPlace. Here, we present two placement solutions compatible with the clock tree structures outlined in Section 4.3. The first solution is based on H-tree clock network. As mentioned earlier, an H-tree clock network can be built on top of any CMOS based global placement 1 In a path-balanced netlist, all paths from any primary input to any primary output have the same logical depth [97]. Any netlist can be path-balanced by inserting D flip-flops (DFFs) to paths with a smaller depth. 67 solution. The second solution, creates a placement compatible with the HL-tree clock network. That is, we generate a placement solution in which cells are grouped together, and clock is propagated to cell-groups of size k, rather than all the cells in the design. The rest of this section is organized as follows. The proposed global placement algorithm is described in Section 4.4.1. The algorithms used for generating a placement solution compatible with HL-tree clock structure are introduced and simulation results are reported in section 4.4.2. 4.4.1 Global Placement To implement a global placement solution, we adopt a methodology similar to SimPL place- ment algorithm [50]. SimPL solves large and sparse systems of linear equations (formulated using force-directed placement) by Conjugate Gradient method [98]. More specifically, the force-directed method reduces the placement problem to that of solving a set of simultaneous linear equations to determine equilibrium (i.e., zero-force) locations for cells based on Hooke’s law analogy. Next, cell spreading is performed by inserting additional forces that pull cells away from dense regions toward carefully placed anchors (pseudo-fixed pins). The main purpose of the global placement is to reduce the total interconnect length which could be approximated by half-perimeter wirelength (HPWL) denoted by: HPWL =HPWL x +HPWL y (4.3) where HPWL x = X e∈E |max(x i )| i∈e −|min(x i )| i∈e (4.4) HPWL y could be calculated in a similar manner. A set of cells with connections to each other represents a graph G(V,E) with set of edges E, where e denotes a hyperedge (a multi-pin 68 net), and set of vertices V denoting cells, with edge weights w ij representing the cost of each net. Consequently, the total wirelength, θ, can be calculated as: θ = X i,j w i,j ((x i −x j ) 2 + (y i −y j ) 2 ) (4.5) which could be rewritten in each of the x and y coordinates as follows [50]: θ x = 1 2 ~ x T A x ~ x +B T x ~ x +const (4.6) The Hessian matrix A represents the connection between movable cells and B denotes the connections between movable and fixed cells. Based on [50], (4.6) could be reduced to: A x ~ x =−B x (4.7) For the placement problem, an initial random placement of cells is generated. Then, based on the Bound2Bound (B2B) net model [47] and the initial location of fixed and movable cells,A andB matrices are calculated and Equation (4.7) is solved using Conjugate Gradient method [50][98]. Once the new locations are calculated, A and B matrices are updated and Equation (4.7) is solved again. This step continues until the HPWL reduction decreases to less than a preset threshold value. This step is usually completed in 5-7 iterations depending on the number of cells in the design. Since there are no constraints in the problem, in this solution most of the cells are placed on top of each other in the middle of the layout area. Next, based on an algorithm similar to look-ahead legalization (LAL) [50], grids are formed throughout the chip and in each grid cell, based on the location of the fixed and movable cells, pseudo pins are located and matrices A and B are updated. Solving Equation (4.7), generates the new location of the cells, and cells are moved toward the lower density areas of each grid cell to remove overlaps. LAL algorithm is performed until the ratio of cell area to whitespace area in each grid cell is less than a preset threshold value (called the overfill 69 ratio). Legalization and detailed placement are then performed to generate the solution with the minimum wirelength. The solution generated by standard CMOS placement algorithms can be directly passed to clock tree synthesis engine to create an H-tree clock network. However, such solution will have a large cell area due to the large size of H-tree clock network. In the next section we present our algorithms for cell-grouping and super-cell placement to generate a solution compatible with HL-tree clock network. 4.4.2 HL-tree Placement Different placement algorithms ranging from force-directed to min-cut placements tend to produce high quality solutions (in terms of the minimization of the total wirelength) with very different cell placements. This degeneracy of placement solutions implies that one can optimize another objective function (e.g., the routing cost of the clock tree) while only minimally affecting the primary objective function (e.g., the total wirelength). In this section we present two cell-grouping approaches with the goal of generating placement solutions compatible with the proposed HL-tree clock network. Considering that cells within a cell-group in HL-tree network receive a skewed clock signal, constraints should be set to avoid race conditions. Assume 3 sequential elements with logic levels 1− 3. Output of each cell is connected to the input of the previous cell. Clock signal is propagated using an L-tree clock network as shown in Fig. 4.7. In this configuration, if the delay of the splitter (top portion of the cells which splits the clock signal and propagates it to the next cell) is less than stage delay of each cell, then cell 2 evaluates the input (output of cell 1) before it is generated by cell 1 (i.e., receives an stale input as it is the previous output of cell 1, i.e., the output generated in the previous cycle.) One constrain to avoid such a scenario is that all the cells within a cell-group should have the same logic level. In SFQ designs, the input netlist should be path-balanced (in a path-balanced netlist, all paths from any primary input to any primary output have the same 70 Clock 1 2 3 Data Clock Figure 4.7: 3 sequential elements with logic levels 1-3. Output of each cell is connected to the input of the previous cell. Clock signal is propagated using an L-tree clock network. logical depth.) Therefore, cells of the same logic level do not have any direct connections to each other. Hence, passing an skewed clock signal from one of them to the other ones (in a cell-group) does not create race conditions. Consequently, a cell grouping approach is proposed based on same-level cell grouping. The second proposed approach relaxes the constraint in the first solution. The only constraint in the second cell grouping solution is that none of the cells in the same cell-group should have a direct connection to each other. We call this approach generic cell-grouping method. In the next subsections, each of these approaches are explained in details, and placed netlists using these two approaches are compared with the H-tree compatible solution. 4.4.3 Same-level Cell Grouping The goal of this part is to generate a placement similar to Fig. 4.8, in which cells of the same logic level form super-cells of max size k. 71 2 2 2 2 3 3 3 3 1 1 1 1 1 1 1 1 4 4 4 4 1 1 1 1 4 4 4 4 2 2 2 2 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 2 2 2 2 3 3 3 3 3 3 3 3 1 1 1 1 CLK k = 4 Figure 4.8: A placement solution compatible with HL-tree clock network obtained by groping cells of same logic level, for a group size of 4. 4.4.3.1 Design Flow Our placement algorithm consists of three main phases, as shown in Fig. 4.9. Phase 1 (Initial Cell Placement) starts by placing cells using a CMOS-based global placement algorithm. Legalization and detailed placement are used to produce a high quality legal solution (cf., Fig. 4.9 Phase 1). In phase 2 (Cell Grouping) circuit is transformed to a graph G(V,E), in which V denotes the set of all vertices (cells), and E represents set of all hyperedges (nets). In case of multi-pin nets, we use b2b net model [47], as it captures the HPWL objective function better than the clique and star net models [47]. Additionally, for each logic leveli, a sub-graph of the original graph, only containing the cells of that logic level, namely G i , is created. Initially, there are no connections between cells of same logic level. Furthermore, nodes of each sub-graph G i are processed in two steps, namely connectivity-based and distance-based graph processing (cf., Fig. 4.9 Phase 2). The goal of these two steps is to add connections between cells of the same logic level (nodes of graph G i ) to facilitate grouping of same level cells. Subsequently, each sub-graph G i is partitioned, and cell-groups consisting of strongly connected cells of same logic level are formed. In the final phase (Super-cell Placement), cell-groups (super-cells) are placed on the placement grid, and a detailed placement algorithm 72 Global Placement Legalization Detailed Placement 1. Initial Cell Placement 2. Cell Grouping Distance-based Graph Processing Netlist Graph (B2B Net model) Connectivity-based Graph Processing Global Placement for super- cells Legalization and Detailed Placement Solving a LAP to refine cell placements in each super-cell 3. Super-cell Placement Global Placement Legalization Detailed Placement 1. Initial Cell Placement 2. Cell Grouping Distance-based Graph Processing Netlist Graph (B2B Net model) Connectivity-based Graph Processing Global Placement for super- cells Legalization and Detailed Placement Solving a LAP to refine cell placements in each super-cell 3. Super-cell Placement Graph Partitioning Figure 4.9: Overall view of the proposed SFQ placement algorithm. LAP and B2B stand for linear assignment problem [99] and bound to bound net model [47], respectively. reorders cells inside each super-cell to the minimize the total HPWL (cf., Fig. 4.9 Phase 3). Details of the proposed design flow are described in the following sections. 4.4.3.2 Connectivity-based Graph Processing Once the initial placement of the cells is produced by the first phase of proposed approach (cf. Fig. 4.9, Initial Cell Placement), netlist is transformed to a graph and sub-graph G i s are formed. In this step, weights are added to edges between the nodes of sub-graph G i solely based on the connectivity of nodes to their adjacent level neighbors (as initially there are no connections between nodes of G i , a zero-weight edge is added between each pair of nodes). The intuition for this step is that if two nodes of logic level i have lots of common neighbors, it is desirable for the global placement to place them close to each other. As a result, to reduce total HPWL, we may group them together. This step runs for each of the sub-graphs G i independently. Therefore it is performed L times, where L is the total number of logic levels. Details are provided below. A pre-processing step finds the neighbor nodes of all cells in the design using Algorithm 1. This algorithm gets the base node and maxSearchLevel (a parameter which determines the search scope and is same for all the nodes) as input and returns a two dimensional vector, 73 namely neighbors, consisting of the neighbors which have a level within maxSearchLevel (mSL) of the base node. Algorithm 1 starts by initiating an empty queue. Initially, the base node is pushed to the queue. In each iteration, front element of the queue is chosen and added to the two dimensional vector, neighbors, with the difference of its logic level and that of the base node as the index of first vector. Next, children (parents) of the front element in the queue which have a level within mSL of the base node are added to the queue. This is accomplished by findChildren (findParents) function, which returns child (parent) nodes of the front node that have a logic level greater (less) than that of the front node. Finally, once all neighbors of the base node within its mSL are added to the neighbors vector, algorithm terminates. For instance, assume the base node is of level i, and mSL is equal to 2. The algorithm returns all nodes with level more than i-3 (parents) and less than i+3 (children) of the base node, which have a direct connection toward base node. Direct connection from a child (parent) to its base node means that it can only pass through nodes of logic level lower (higher) than its own logic level, only from output to input (input to output). Additionally, if the mSL is equal to 0, only the splitter cells connected to base node are returned as its neighbors. This is due to the fact that splitters are the only cells which don’t receive a clock signal. Therefore, they are given the same logic level as their driver. Once this pre-processing step is completed for all nodes, each sub-graph G i is processed as described in Algorithm 2. At each step a pair of nodes of level i, namely u and v are processed. Initially, their common neighbors is found by intersecting their neighbors vectors. If there is a common neighbor of level p, weight equal to α |p−i| is added to the edge between them. α represents a normalization factor, directly proportional to mSL. As search scope for common neighbors expands toward distant nodes, edge weights of the sub-graph G i increase, as more common neighbors for each pair of nodes are found. It should be noted that if the output of a node is connected to a splitter cell (neighbor of level 0), an edge weight of α + 1 is added to the edge between them. 74 Algorithm 1 Level order traversal int maxSearchLevel, Node* baseNode vector<vector<Node*>> neighbors queue<Node*> Q = φ Node* base 1: Q.push(base) 2: while (!Q.empty()) do 3: Node* front = Q.front() 4: Q.pop() 5: levelDiff = abs(level(front) - level(base)) 6: if (levelDiff > maxSearchLevel) 7: continue 8: neighbors[levelDiff].push back(front) 9: vector< Node* > children = findChildren(front) 10: vector< Node* > parents = findParents(front) 11: foreach child in{children, parents} do 12: if (abs (level(child) - level(base)) > maxSearchLevel) 13: continue 14: else 15: Q.push(child) 16: end foreach 17: end while 18: return neighbors Algorithm 2 Connectivity-based graph-processing vector<Node*> sameLevelNodes = G i (V) 1: foreach u in sameLevelNodes 2: foreach v in sameLevelNodes 3: commonNeighbors = intersect(neighbors(u), neighbors(v)) 4: foreach neighbor in commonNeighbors 5: levelDiff = abs(level(u) - level(neighbor)) 6: if (levelDiff == 0) 7: extraEdgeWeight = α + 1 8: else 9: extraEdgeWeight = α levelDiff 10: addEdgeWeight(u, v, extraEdgeWeight) 11: end foreach 12: end foreach 13: end foreach Fig. 4.10 shows nodes of level 3-6 for a 4-bit Kogge-Stone adder circuit with 6 logic levels. Algorithm 2 processes nodes of level 5 (shown in blue in Fig. 4.10) with mSL of 2 andα value 75 Figure 4.10: Graph of nodes of level 3-6 for a 4-bit Kogge-Stone adder. Nodes of level 5 are shown in blue. Logic level of each node is added to its name. Figure 4.11: Graph of nodes of level 5 for a 4-bit Kogge-Stone adder after connectivity-based graph processing. of 2. Reduced graph only containing the nodes of level 5 and their connections is illustrated in Fig. 4.11. As it can be observed, nodes c60 and c48 have a common neighbor of level 3, namely c66. Consequently, weight of the edge between them increases by 1. Additionally, c51 is a common incident neighbor for nodes c49 and c32, which results in an edge weight of 2 between these two nodes. Finally, since c63 does not have any common neighbors with the other nodes (it is one of the DFFs inserted for path-balancing), it remains unconnected. This step is performed for all the sub-graph G i s. After this step, the sub-graphs are passed to the next step (Distance-based Graph Processing). 76 4.4.3.3 Distance-based Graph Processing In this step, initial cell placement information (cf. Fig. 4.9 Phase 1) will be used to add edge weights between nodes of sub-graph G i , as a function of their relative distance. Edge weight between each pair of same level cells (u, v) is calculated using following formula: W x (u,v) =β· (1− |X u −X v |− ΔX i min ΔX imax − ΔX i min ) (4.8) where ΔX imax and ΔX i min represent maximum and minimum distance of cells of logic leveli, X u andX v denote thex coordinates ofu andv andβ is a normalization factor. Edge weight in y direction is calculated in a similar manner. Final edge weight between nodes u and v is calculated as follows: W (u,v) =b q W x (u,v) 2 +W y (u,v) 2 c (4.9) where W x (u,v) and W y (u,v) denote the edge weights corresponding to x and y directions. As a consequence of this phase, none of the nodes of G i will be unconnected. Hence, cell-grouping will be more deterministic rather than random in the case there are unconnected nodes which can be placed in any of the cell-groups. Sub-graph of level 5, for the 4-bit Kogge-Stone adder example, after distance-based graph processing is shown in Fig. 4.12. As it can be observed, edge weight of 4 between nodes c60 and c19 shows that they are placed near each other in the global placement, whereas edge weight of 1 shows that locations of c60 and c63 are relatively far from each other. This step is performed for all the sub-graphs G i . 4.4.3.4 Partitioning Partitioning has been one of the most prevalent techniques for solving the placement problem, since it can be used to drastically reduce the complexity of the design process with a divide and conquer strategy [100]. Various partitioning methods have been introduced to solve 77 Figure 4.12: Graph of nodes of level 5 for a 4-bit Kogge-Stone adder after distance-based graph processing. 78 the problem efficiently. Multilevel techniques such as Fiduccia-Mattheyses (FM) [101] and HMETIS [102] are examples of fast and efficient partitioning algorithms. Once the sub-graph containing nodes of level i (G i ) is processed through connectivity- based and distance-based steps, a partitioning algorithm is used to partition the sub-graph to generate cell-groups. As a result, cells with higher edge weight end up in the same part and are grouped to form the super-cells. Assuming the total number of cells in level i to be m and the the group sizes to be k, we seek to obtain p =dm/ke cell-groups. For this purpose, a p-way partitioning should be performed to generate p parts of size k. Consequently, G i is passed to HMETIS partitioner [102], for a p-way partitioning. After the partitioning, generated parts are processed as follows. For each part, a super-cell is created and all nodes in that part are added to that super-cell. Once partitioning all sub-graphs (G i ) is completed, and super-cells are created, connections among nodes in the original graph are transformed to connections among super-cells in the reduced graph. It should be noted that as the number of cells in each group (k) increases, the overall chip area decreases. The reason is that more cells of same logic level are grouped (abutted) and total number of clock sink nodes decreases. On the other hand, using larger values of k increases the delay of linear clock propagation inside each group [17]. Therefore, increasing group size creates a trade-off between total chip area and maximum clock frequency. However, in current technology and for large circuits, critical path delay is much larger than that of linear clock distribution. Consequently, increasing number of cells in each group leads to smaller chip area with negligible performance penalty. 4.4.3.5 Super-cell Placement Once the Cell Grouping phase (cf. Fig. 4.9, Phase 2) is completed and super-cells are generated, original netlist is transformed into a smaller netlist in terms of total cell count. If the total number of cells in original netlist is N, and group size is set to be k, the new netlist contains approximately N/k cells. 79 Each super-cell now represents a cell with width equal to sum of widths of its cells, and same height as all of them. Additionally, connection among cells is now transformed to connection among super-cells. Hence, each super-cell may have multiple input and output pins. In this phase, namely Super-cell Placement (cf. Fig. 4.9), reduced netlist is placed using a global placement algorithm, followed by legalization and detailed placement. Once the position of all super-cells is determined, original netlist is retrieved, and position of original cells is updated based on the position of their corresponding super-cell. They are initially placed in the center of each super-cell, on top of each other. It should be noted that due to smaller size of the reduced netlist, global placement runtime is lower than that of the original netlist. 4.4.3.6 Level Based Detailed Placement After the global placement is performed and the positions of the super-cells are determined, cells inside each super-cell should also be placed such that minimum HPWL is achieved. We present a linear assignment problem (LAP) [99] based solution to find the optimal placement of each cell inside its corresponding cell-group. Since super-cells only contain cells of same level, there are no connections among cells inside each cell-group (except for the splitter cells, which are grouped with their driver). The problem is defined as follows. Each cell in a cell-group can be placed in any of the possible placement slots inside that group. Each slot is assumed to have a width equal to average width of all the cells in that cell-group. For instance, if there are 4 cells in each cell-group, first cell can be placed in any of the 4 possible slots. Furthermore, there is a cost associated with placing each cell in each one of the slots. This cost can be defined in terms of total HPWL of signal nets connecting cells inside a super-cell to the ones outside that super-cell. For instance, if there are k cells in a cell-group a cost matrix of size k∗k is formed. cost(i,j) represents the cost of placing cell i in slot j, while cells connected to that cell (outside the cell-group) are fixed. Consequently, 80 the linear assignment problem is formulated and the solution determines the position of each cell inside its corresponding cell-group. Since this step is done for cells in a super-cell while other cells are fixed, order of processing super-cells may affect the final solution. Consequently, this step is repeated multiple times for each super-cell, until no significant improvement in total HPWL is achieved. Finally, generated placement is suitable for synthesizing an HL-tree clock network, as cell-groups only contain cells of same logic level. 4.4.3.7 Simulation Results For the experiments reported in this section, we utilized an SFQ standard cell library, using the MIT-LL SFQ5ee process technology [19], where all cells have the same height of 120μm. More specifically, heights of logic and clock parts of each standard cell are 80μm and 40μm, respectively. Widths of standard cells, as reported in Table 4.1, vary from 30μm to 50μm. Splitter cells that are used for implementing the clock networks have a height of 40μm and a width of 40μm. Table 4.1: Widths of standard cells using MIT-LL SFQee5 [19]. All standard cells have a height of 120 μm, except for splitter cells used in the H-tree and HL-tree which have a height of 40 μm. Cell Splitter NOT DFF AND OR XOR NDRO Width (μm) 30 30 40 50 50 50 50 We evaluated the effectiveness of the proposed approach and compared it with the H- tree compatible placement. Our implementation was written in C++. We used SimPL [103] placement algorithm for performing global placement and HMETIS package for p- way partitioning [102]. The effectiveness of our approach was evaluated using 8 different SFQ benchmarks obtained from [28], with group sizes of k = 2 and k = 4. HL-tree clocking scheme was used to synthesize the clock network [17]. Once the proposed placement algorithm terminates, first cell of each cell-group is marked as the clock sink. Next, BST/DME algorithm [72] is used to create an H-tree clock network to propagate clock signal to all the sink nodes. 81 Table 4.2: Comparison of total number of clock sink nodes, percentage of HPWL and area improvement for 8 different SFQ logic circuits using different placement strategies. KSA stands for Kogge-Stone adder and ID stands for integer divider. Benchmark # Cells # Levels Group size (k) # Clock sinks % HPWL improv. % Area improvement baseline proposed prop. vs. base. prop. vs. base. HL-k vs. Htree KSA16 500 10 2 353 317 3.7 1.3 23.1 4 225 161 1.0 11.6 37.0 KSA32 1208 13 2 893 829 8.1 6.2 24.1 4 523 422 6.4 5.7 35.1 ISCAS/c499 704 14 2 505 443 3.0 2.1 21.3 4 643 224 -2.5 7.2 36.4 ISCAS/c880 1299 24 2 906 776 13.0 0.5 20.9 4 613 394 3.6 7.0 31.2 ISCAS/c1908 1310 26 2 923 781 6.2 8.6 26.2 4 649 398 -2.9 11.0 34.8 ISCAS/c432 1569 43 2 1101 867 8.5 13.2 25.4 4 854 443 -2.5 14.6 36.4 ID8 5431 89 2 3487 2871 45.8 4.8 26.9 4 2475 1458 44.9 5.5 35.9 ID16 15315 305 2 11936 8242 56.3 16.4 21.4 4 10619 4196 52.4 23.6 33.7 Average(%) - - - - - 15.3 8.6 27.8 Splitters on the top portion of each cell within a cell-group are used to create L-tree clock network. Finally, total HPWL was calculated using the GSRC Bookshelf Evaluator [104]. It should be noted that total HPWL has been reported as sum of data and clock signal nets. The approach in [17] was implemented and used as the baseline solution. Table 4.2 compares the results of our method to the baseline solution. The proposed approach improves the total HPWL and total area by 15% and 8%, respectively. This is primarily due to the Cell Grouping phase (cf. Fig. 4.9). Cell-grouping in the baseline solution is only limited to same level cells within each row, while the proposed solution considers all the cells of same logic level globally, irrespective of the row at which they are placed during global placement. Consequently, a fewer number of clock sink nodes are created which in turn reduces the total area and total HPWL, simultaneously. The proposed approach reduces the total number of clock sink nodes approximately to N/k in the case of group size of k (cf. Table 4.2). However, the baseline solution generates lots of cell-groups with fewer cells than k in each cell-group. Consequently, the total area of the proposed approach is always smaller than that of [17]. Moreover, Table 4.2 shows that the proposed methodology reduces the overall area significantly, using the HL-tree clocking scheme, when compared with a global 82 placement accompanied by H-tree clock network as proposed in [17]. Comparison of the total chip area for different benchmarks, and different group sizes (k), shows that the total area can be reduced significantly, 27.8% on average, using the proposed clock-tree aware placement approach, which in turn can reduce the critical path delay and improve the performance of the circuits in terms of clock frequency. 4.4.4 Generic Cell grouping Although same-level cell grouping approach reduces the area compared with the baseline solu- tion, it increases the total and maximum wirelength, which leads to performance degradation and complicates the routability. Additionally, it over-constraints the cell grouping as the only constraint is that no two cells (no matter what their logic level is) in a super-cell should have a direct connection to each other. One other problem with this method is that the number of cells in each logic level is a fixed number generated by the logic synthesis tool. In large benchmarks, this number is huge which results in a large graph to be partitioned. Hence, runtime increases significantly. To address the above issues, we present a new cell-grouping algorithm, called generic cell grouping. The only constraint for the cells in a group is that they are not directly connected to each other. In other words, output signal of none of the gates in a cell-group should be input to the other cells in that group. An example of such placement solution for group size 4 is shown in Fig. 4.13. In this approach, after the global placement, chip area is partitioned geometrically using uniform grids over the layout area. An example of grid bins of size 500μm∗ 500μm formed over the layout area of a placed benchmark is shown in Fig. 4.14. Cells that end up within a grid bin are considered for cell-grouping. First, a graph is created for each grid bin. In this graph nodes represent cells and edges represent nets between cells. To avoid putting cells with direct connection to each other in the same cell-group a weight of−∞ is added to the existing edges in the graph. The intuition for this part is that if a large negative edge weight is added between two nodes, to minimize the total cut cost, 83 2 4 5 3 3 6 3 1 1 4 1 1 5 1 2 1 4 2 4 3 1 1 1 1 3 4 2 4 2 4 2 3 1 6 5 1 2 2 2 2 3 4 3 6 4 2 4 6 1 3 3 2 3 1 4 3 3 2 3 3 2 1 1 1 CLK k = 4 Figure 4.13: A placement solution compatible with HL-tree clock network obtained by grouping cells with different logic levels, for a group size of 4. Figure 4.14: Uniform grid over the layout area creating grid bins of size 500μm∗ 500μm. 84 the partitioner tries to put these two cells in different parts. Hence, two cells with direct connection end up in different groups. Next, similar to Distance-based graph processing approach presented in Section 4.4.3.3, edge weights are added to the edges of the graph. Each graph is then partitioned and generated parts are transformed to super-cells. Finally, similar to Phase 3 in Fig. 4.9, super-cells are placed and the final location of each group cell is determined. In this approach, thhe total number of cells that are considered to be clustered is a tunable parameter (a function of the area of the grid bins). Larger grid bins include a large number of cells, and the resulting number of cell-groups tends to be smaller as more cells can be grouped together without violating the direct connection constraint. On the other hand, runtime of the partitioner increases as the total number of nodes and edges of the graph increases. Therefore, there is a trade-off between the quality of the final placement in terms of area and total wirelength vs. the runtime of the algorithm. 4.5 Summary Special attributes of single flux quantum (SFQ) technology such as fast switching and low energy consumption, makes it one of the best replacements for CMOS-based design. However, large cell sizes, limited number of metal layers in current technology [19] and the fact that all the cells in SFQ technology need a clock signal, require special algorithms to improve the performance and reduce the total area in large SFQ circuits, especially the area dedicated to clock network. In this section, we first presented a novel hybrid clocking structure which combines an H-tree with linear clock trees to propagate a skewed clock signal to logic cells of the same logic level, within cell-groups. The presented clocking structure reduces the layout area by abutting nearby logic cells, which can further reduce the total wirelength of the design. Next, two novel cell-grouping based placement approaches are introduced which are customized to generating placement solutions compatible with the said HL-tree 85 clocking structure. Finally, the effectiveness of the presented methods are compared with state-of-the-art techniques. 86 Chapter 5 Physical Synthesis of Clock Networks This chapter outlines a synchronous minimum-skew clock tree synthesis algorithm for single flux quantum circuits considering splitter delays and placement blockages. The proposed methodology improves the state-of-the-art by accounting for splitter delays and creating a fully-balanced clock tree structure in which the number of clock splitters from the clock source to all the sink nodes is identical. Additionally, a mixed integer linear programming (MILP) based algorithm is presented that removes the overlaps among the clock splitters and placed cells (i.e., placement blockages) and minimizes the clock skew, simultaneously. Using the proposed method, the average clock skew for 17 benchmark circuits is 4.6ps, improving the state-of-the-art algorithm by 70%. Finally, a clock tree synthesis algorithm for imbalanced topologies is presented that reduces the clock skew and the number of clock splitters in the clock network by 56% and 37%, respectively, compared with a fully-balanced clock tree solution. 5.1 Overview Conventional computing based on CMOS technology and metal interconnects has faced sub- stantial issues in terms of total power consumption and energy efficiency [1]. Superconducting computing based on the Josephson effect is a promising replacement for CMOS technology aiming at high-performance and energy-efficient computing [3]. Josephson junctions (JJs), 87 basic circuit elements in single flux quantum (SFQ) technology, have a rapid switching speed (∼ 1ps) and low switching energy (∼ 10 −19 J/bit) at temperatures about 4K [2], [13]. Rapid single flux quantum (RSFQ) technology was introduced in the 1980s. It uses quantized voltage pulses in digital data generation and memorization [3]. RSFQ circuits have been shown to be functional at operating frequencies of up to 770 GHz [4]. Recent developments introduce new SFQ logic families, such as energy-efficient single flux quantum technology (ERSFQ/eSFQ) [12], dual-rail RSFQ [5], self-clocked complementary logic (SCCL) [6], reciprocal quantum logic (RQL) [7], novel approaches including re-design of the current biasing network for RSFQ [8],[9],[10], and application of low supply voltage for RSFQ circuits [11]. In spite of extraordinary characteristics of SFQ logic (including but not limited to high frequency and low energy dissipation), design automation methodologies and tools are less sophisticated than those of CMOS technology, preventing the SFQ logic to become a realistic option for realizing large-scale, high-performance, and energy-efficient computing systems of the future [13]. Although many advanced techniques have been developed for computer-aided design (CAD) for CMOS technology, these techniques cannot be directly applied to the design of SFQ circuits due to key differences between the two technologies. Some of these differences are (i) different active and passive components (JJs and inductors vs. transistors and capacitors), (ii) various types of logic gates and clocking structures, and (iii) the need for path-balancing D flip-flips (DFF), splitters, and biasing networks which increases the total cost of integration in terms of area and power consumption [105]. To address the aforementioned issues, researches have started focusing on the development of front-end and back-end tools and methodologies for design automation of superconducting electronics to enable very large scale integration (VLSI) design and verification of superconductive electronics (SCE) as a step toward the development of energy-efficient and high-performance computers [106]. 88 Physical design of logic circuits, especially the synthesis of clock distribution network (CDN), plays an important role in designing high-performance circuits robust to process- induced variations. The layout of large circuits requires automated placement, clock network synthesis, and routing tools. Recent efforts by researches have introduced effective techniques for placement, design of CDNs, and routing for large SFQ circuits [107–109]. Clock network synthesis is a crucial task in physical design of logic circuits as the clock network takes up substantial routing resources, consumes significant power, and determines the maximum frequency of the circuits. Minimizing the clock skew (i.e., the maximum difference in the arrival time of the clock signal at two different clock sinks) is of great importance since the clock skew directly limits the maximum achievable frequency of a circuit [22]. In SFQ logic circuits, the clock signal should be delivered to nearly all logic cells in the design. Therefore, to maximize the performance, a well-balanced minimum-skew clock tree structure is an absolute requirement. Previous zero-skew clock tree synthesis methods for SFQ circuits fail to produce high- quality solutions because they do not consider the delay of splitter cells (which are required to distribute the clock signal to sequential gates) and placement blockages (already placed logic cells) [17]. Additionally, the population density of cells in different regions of the chip can be very different which can result in a highly-imbalanced clock tree topology, i.e., one where the maximum difference between splitter counts from the root of the clock tree to any pair of leaf nodes is large. This chapter presents an algorithm for a fully-balanced clock tree topology construction and a min-skew clock tree placement and legalization algorithm using a mixed integer linear programming (MILP) formulation to perform clock tree construction, splitter insertion, and skew minimization under the given placement blockages, considering both splitter and interconnect delays. The proposed clock tree topology generation algorithm guarantees that the maximum difference between the number of splitters from the clock source to any pair of 89 sinks is zero. The effectiveness of the proposed CTS algorithm is verified using multiple SFQ circuits. The main contributions of this chapter can be summarized as follows. • An algorithm is presented that creates a fully-balanced clock tree in which the maximum difference between the number of splitters from the root of the tree to any pair of leaf nodes is zero. • A min-skew clock tree placement and legalization algorithm is presented that places the clock splitters in the routing channels, i.e., empty spaces between the placement rows, and eliminates the overlaps among the clock splitters and logic cells while minimizing the skew. • Using the proposed technique, the average clock skew for 17 benchmark circuits is reduced to 4.6ps. This approach improves the baseline method by 70%. • The proposed CTS algorithm is extended to generate a minimum-skew solution given imbalanced clock tree topologies. The modified algorithm reduces the clock skew and the number of clock splitters in the clock network by 56% and 37%, respectively, compared with a fully-balanced clock tree solution. The rest of the chapter is organized as follows. Background and prior work are discussed in Section 5.2. Our SFQ specific clock tree synthesis methodology including topology construction, splitter placement, and legalization algorithms are discussed in Section 5.3. Simulation results obtained by applying the proposed method to multiple benchmark circuits are reported in Section 5.4. A clock tree synthesis methodology for imbalanced topologies is presented in Section 5.5. In Section 5.6, we modify and apply the proposed clock synthesis algorithm to asynchronous clock networks. Finally, the chapter is summarized in Section 5.7. 5.2 Preliminaries 5.2.1 Definitions In this section, we summarize some definitions and notations used throughout this chapter. 90 • Clock phase delay refers to the delay from the clock source to any of the clock sinks (i.e., sequential elements such as flip-flops or latches). Phase delay, also known as insertion delay, increases as the feature size decreases and chip size increases. The phase delay is typically a combination of gate delay (e.g., buffers, clock gating elements, and clock dividers) and interconnect delay. As the feature size decreases, the effect of process and on-chip variations (OCV) on phase delay increases, which in turn affects the clock uncertainty [22]. Accordingly, minimizing phase delay values is beneficial in reducing the clock uncertainty. • Clock skew: Two flip-flops i and j connected by combinational gates and interconnects are called sequentially adjacent flip-flops (cf. Fig. 5.1). Clock skew between nodes i and j is defined as the difference between the clock arrival times (phase delay values) at these two nodes. In this chapter, clock skew for a circuit is defined as the maximum skew between any two flip-flops. Equations for calculating clock skew are as follows. skew i,j =T i −T j (5.1) skew max = max 1≤i,j≤n |T i −T j | (5.2) WhereT i andn denote the clock arrival time at sinki and the total number of clock sinks, respectively. Timing constraints can be categorized as setup and hold time constraints, defined as follows. Setup time: is the amount of time that the input to the capturing flip-flop (FF j ) should stay valid before the next triggering clock edge arrives [110]. The following inequality summarizes the relation between clock skew, clock period and setup time. T p ≥skew i,j +t max c2Q +t max comb +t max setup (5.3) 91 Data Clock FFj Comb. FFi Figure 5.1: A pair of sequentially adjacent flip-flops i and j connected by a combinational gate and interconnects. where t max c2Q denotes the maximum clock-Q delay of a flip-flop, t max comb accounts for the maximum delay through combinational logic (which also includes the interconnect delay), T p represents the clock cycle time, and t max setup denotes the maximum setup time for a flip-flop. As shown, a positive clock skew increases the clock cycle time. On the other hand, a negative clock skew (if the clock signal is received at the launching flip-flop earlier than the capturing flip-flip) decreases the effective clock period. Hold time: To ensure the proper propagation of an input signal through a flip-flop, the input must remain valid or hold steady for a short duration after the clock edge, referred to as the hold time [40]. The hold time of the capturing flip-flop imposes an additional constraint on the total propagation delay of a signal through the launching flip-flop and the combinational logic as follows. skew i,j ≥t max hold −t min c2Q −t min comb (5.4) In the worst case, the input signal at the capturing flip-flop (j) should remain stable for t max hold after the clock edge of the same clock cycle arrives at node j. If the clock signal arrives at theFF i earlier than theFF j , it causes the input signal toFF j to change before FF j can capture that. 92 As shown above, the clock skew directly limits the maximum clock frequency of a circuit and reduces the available positive time slack for setup constraints. Additionally, a negative clock skew may result in hold time violations. Unlike setup time violations, hold time violations cannot be fixed by increasing the clock period. Actual Arrival Time (AAT): is defined as the latest transition time at a given node in a circuit, measured from the beginning of the clock cycle [110]. Required Arrival Time (RAT): is defined as the latest time at which a signal should arrive at a given node, such that the circuit works correctly, given setup or hold constraints. Timing Slack: For each node in a circuit (e.g., pins or gates) timing slack is calculated as the difference between RAT and AAT at that node. While a positive slack means that the timing constraint is satisfied (i.e., the signal arrives earlier than it is required), a negative slack is an indicator of a (setup or hold) timing violation (i.e., the signal arrives after its required time) [110]. • Clock Tree Topology: A clock tree topology is defined as a binary tree G, in which each node has a maximum of 2 children, is rooted at the clock source R, and has a total number of|S| leaf nodes representing the set of clock sinksS. We define this tree topology to be a directed graph in which edges are directed from parents to the children. The level of each node i is defined as the number of nodes in the longest path from the root of the tree to nodei, denoted byL i . The height of a node is the number of nodes on the longest path from that node to a leaf node, denoted by H i . The height of a tree is defined as the height of the root node of the tree. • Clock Tree Embedding: A clock tree embedding determines the location of each internal (non-sink node) v of the clock tree topology, denoted by pl(v), in the Manhattan plane. If there is a connection between a parent node p and a child nodec, the cost of the edgee p,c , denoted as l p,c , is defined as the Manhattan distance between pl(p) and pl(c). The total wirelength of a tree is calculated as the sum of the cost of all the edges of the tree. 93 Based on the above definitions, reducing the maximum clock skew increases the maximum frequency of the circuit, reduces the number of hold time violations, and facilitates the timing closure of the design (i.e, fixing the timing violations of the circuit which is typically done after the placement and clock tree synthesis steps.) 5.2.2 Delay Model Single flux quantum pulses are typically propagated over long distances using passive trans- mission lines (PTL). PTL micro-strips transmit the pulses with extremely low losses, with a speed of approximately 1/3 of the speed of light in a vacuum [13]. Equation (5.5) models the propagation delay as a function of the length of the PTLs. D = L 1 3 c ≈ L (μm) 100 μm ps ! (5.5) In Equation (5.5), D represents the delay over the PTL, L represents the length of the PTL, and c denotes the speed of light in a vacuum. As a result, phase delay from the root of the tree R to a sink node C i over a path path(R,C i ) is calculated as follows. D R,C i = 1 100 × X e j,k ∈path(R,C i ) l j,k (5.6) where l j,k denotes the Manhattan distance between clock nodes C j and C k . Although the SFQ signals can also be propagated using Josephson transmission lines (JTLs), we do not use JTLs in the global clock network, due to low propagation speed and difficulties introduced in the routing stage. The introduced delay model for PTLs is similar to the path-length delay model used in CTS algorithms for early CMOS technology nodes [72]. Although the proposed linear delay model is used throughout the chapter, other delay models can be integrated into the proposed design flow. 94 5.2.3 Prior Work Multiple clock topologies have been proposed and the trade-offs are discussed in [111]. Synchronous clock tree synthesis using an H-tree structure has been proposed as the best option for large circuits in terms of the max clock frequency [111]. In [112], a layout driven CTS method is presented that groups cells by logic level and propagates a skewed clock signal to each logic group. For the first logic level, a clock tree is built to propagate the clock signal to each gate such that timing constraints are met. Then, the clock signal is passed to the root of the clock tree for the next logic level. This work employs splitter and JTL insertion and replacement of the logic cells in the same logic level for timing adjustments. An earlier SFQ design methodology presented in [113] first synthesizes a zero-skew clock network utilizing an H-tree structure. The proposed algorithm then places the cells on predefined rectangular grid bins at the leaves of the clock tree, using the min-cut placement algorithm. However, since the placement slots are limited to grid bins and the placement is done after the CTS, the quality of placement in terms of the total wirelength and routability is degraded significantly. In [17], a CTS algorithm for H-tree and HL-tree clock structures was presented and results were discussed. In HL-tree structures, an H-tree is used for global clock distribution and a Linear tree (L) for local clock distribution [17]. Authors in [107] provide an algorithm for optimizing the placement of logic cells such that the HL-tree clock structures can be utilized efficiently. 5.3 Proposed Min-Skew Clock Tree Synthesis Methodology Conventional clock tree synthesis methods for CMOS circuits do not consider the placement of the splitters and their associated delay in clock skew minimization, as they use zero-area and zero-delay branching points for splitting the global clock signal. However, in SFQ circuits, 95 the splitter delay adds to the overall insertion delay at the sink nodes. Additionally, the placement of these clock splitters determines the delay of each edge in the clock tree and therefore affects the insertion delays. Moreover, clock splitters should be placed in legal locations, i.e., should not have any overlaps with the placed logic cells and should not violate the layout rules. To address the aforementioned challenges, we propose a minimum-skew clock tree synthesis considering splitter delays and placement blockages. In the following subsections, the overall flow of the proposed algorithm is presented and details of each step are explained. 5.3.1 Overall Design Flow The overall design flow of our clock tree synthesis algorithm (called qCTS) is shown in Fig. 5.2. The proposed approach generates a minimum-skew clock tree such that all the clock splitters are mapped to the routing channels between the placement rows (i.e., the logic cells are placed inside the rows and the clock splitters are placed between the rows). The proposed methodology can be employed for both H-tree and HL-tree clock structures. A sample output of the proposed algorithm for a placed netlist is shown in Fig. 5.3. The inputs to the qCTS algorithm are (i) a placed netlist, (ii) a list of clock sink nodes and their locations, and (iii) a delay model. There are 4 steps in the proposed algorithm. • In the first step (cf. Fig. 5.2, Topology Generation), a fully-balanced tree topology is generated to minimize the maximum level difference among the clock sinks to zero. After this stage, it is guaranteed that all the sink nodes have the same level. • In the second step (cf. Fig. 5.2, Clock Tree Embedding), the clock tree embedding algorithm generates a zero-skew clock network and calculates the location of all the internal nodes of the clock network, given the tree topology and the location of the sink nodes. • In the third step (cf. Fig. 5.2, Splitter Insertion), the splitter cells in the clock tree are placed at the location of the embedding points of the clock network. 96 Placement Topology Generation Clock Tree Embedding Min-Skew Clock Tree Placement & Legalization STA Placement Topology Generation Clock Tree Embedding Min-Skew Clock Tree Placement & Legalization STA Splitter Insertion Logic Synthesis Timing Closure qCTS Figure 5.2: The overall flow of the proposed clock tree synthesis algorithm, qCTS. 97 • In the final step (cf. Fig. 5.2, Min-Skew Clock Tree Placement and Legalization), a MILP based approach is used to map the clock splitters to the routing channels and to remove the horizontal overlaps between the clock splitters. Cell Splitter Row Clk Source Figure 5.3: The proposed placement of logic cells (blue rectangles, placed inside the rows) and clock splitters (black rectangles, placed between the rows) for a circuit with 32 logic gates and 8 rows. Rows are shown using red rectangles. 98 Once the qCTS algorithm generates the CDN, the static timing analysis (STA) tool calculates the maximum clock frequency and hold/setup time slacks and the timing closure flow tries to solve all the timing violations. In the following subsections, each step is explained in detail. 5.3.2 Clock Topology Generation In SFQ logic circuits, the clock signal is distributed to nearly all the logic cells as most of the cells are sequential elements, i.e., need a clock signal for synchronization. Conventional clock topology generation methods do not consider the delay of splitter cells (needed to distribute the clock signal to multiple fan-outs). Additionally, the population density of cells in different regions of the chip can result in a highly-imbalanced clock tree topology with a large level difference among sink nodes. Consider an example with 8 leaf nodes as depicted in Fig. 5.4. As shown, the leaf nodes have different levels (e.g., nodes 9 and 14 have levels 4 and 3, respectively). The maximum level difference among leaves is 2. Fig. 5.5 depicts a balanced tree topology in which all the leaves have the same level. Insertion delay at each leaf node is a combination of all the splitter delays from the clock source to each leaf node and interconnect delays. By creating a balanced tree topology, the portion of the insertion delay corresponding to splitter delays is balanced out among the leaf nodes, which helps reducing the clock skew. Some of the clock topology generation algorithms for CMOS circuits, such as greedy-DME or geometric matching, create imbalanced topologies as they allow merging two sub-trees with different heights [114][115]. We intend to create a fully-balanced binary tree in which the maximum level difference among leaf nodes is equal to 0. For this purpose, we propose using an algorithm similar to the method of means and medians (MMM) [116]. The MMM algorithm, proposed as one of the initial minimum-skew clock tree synthesis algorithms, heuristically minimizes both the clock skew and total wirelength of the CDN. The MMM method performs topology generation and clock embedding simultaneously, in a top-down manner. Note that in this work, we only use the topology generation part of this algorithm. 99 Figure 5.4: Clock tree topologies for 8 leaf nodes (shown in blue). An imbalanced tree with a max level difference of 2 among leaves is shown. Figure 5.5: Clock tree topologies for 8 leaf nodes (shown in blue). A balanced tree with a max level difference of 0 among leaves is shown. 100 The MMM algorithm recursively bi-partitions the set of sinks in a region, creates a tree node for each of the sub-regions, and assigns these tree nodes as the children of the parent node (corresponding to the original region) [116]. In each step, sinks are sorted based on their x or y coordinate. Assuming the sinks are ordered in x (y) coordinate, half of the sinks are assigned to the left (bottom) sub-region and the other half are assigned to the right (top) sub-region. For each of the created sub-regions, a new tree node is created and the root of the current tree (node corresponding to the original region) is assigned as the parent of the two newly created nodes (i.e, nodesleft andright). Additionally, nodes corresponding to the two sub-regions are assigned as children of the root node. The same procedure is repeated for the two created sub-regions, recursively, until the number of sinks in each sub-region becomes less than 2. The MMM algorithm terminates in logn steps and its complexity in terms of run-time is O(nlogn), where n is the number of clock sinks [116]. An example is illustrated in Fig. 5.6. As shown, with 10 sink nodes, the maximum level difference among the leaf nodes is 1. The MMM algorithm always creates a topology in which the max level difference among leaves is at most 1. The reason is it initially creates a fully-balanced binary tree with a height ofdlog ne− 1, without adding any sink nodes to the tree. At this stage, in each created sub-region, there are either 2 or 1 sinks left. If there is only 1 sink left, that sink is assigned as a leaf node of the tree, therefore, it does not increase the height of the tree. If there are 2 sinks left, another bi-partitioning adds two children to one of the leaf nodes of the tree and increases the height of the tree todlog ne. Consequently, leaves of the tree have a max level difference of 1. To further reduce the max level difference to 0, we use JTL cells. If the max sink level is l max , we find all the sinks with level l max − 1 and add a JTL cell as their parent. Hence, their level becomes l max and the output clock tree becomes fully balanced. To balance the insertion delay at the sink nodes and reduce the max clock skew, we design special JTL cells to have the same propagation delay as splitters. 101 (a) (b) (a) (b) (c) (d) (c) (d) Figure 5.6: The MMM clock topology generation algorithm applied to an example with 10 sinks [116]. Blue rectangles and black circles show the splitter cells and sinks of the clock tree. Given the location of all the sink nodes, the topology of the clock tree is generated using the outlined method. Next, the generated topology is passed to the clock tree embedding step, which calculates the location of the embedding points of the internal nodes of the clock tree in the Manhattan plane. 5.3.3 Clock Tree Embedding and Splitter Insertion The generated clock tree topology along with the locations of the sink nodes are the inputs to the clock tree embedding step. In this step, the location of clock splitters is determined. The goal is to construct a zero-skew clock tree, while minimizing the total wirelength of the clock network. We use the deferred merge embedding (DME) algorithm to embed the clock 102 tree as it generates a zero-skew solution with minimum cost in terms of wirelength, assuming a linear delay model [72]. The DME algorithm was developed according to the observation that there are multiple locations for an internal node in a given topology which satisfy the skew specifications [72]. The DME algorithm constructs a clock tree in two phases: (i) a bottom-up pass that finds all potential zero-skew merging locations for two nodes, called merging segments (ms), as a function of the distance between the child nodes and the downstream delay of each child node. Downstream delay is defined as the max delay from a node to its leaf nodes. (ii) a top-down tree traversal in which the DME picks one location on each merging segment. The DME algorithm has linear time complexity given the input topology. For a complete description of this algorithm please refer to [72]. An example of applying DME algorithm to a clock tree with 4 sinks is redrawn from [72] and shown in Fig. 5.8. Fig. 5.7 shows a balanced clock tree topology. The goal is to find the location of internal nodes of the tree, a, b, and r. Fig. 5.8 shows the location of sinks, merging segments, and the embedding points of the internal nodes of the clock tree in the Manhattan plane. In the bottom-up pass, the DME algorithm finds the merging segments for the internal nodes. For instance, the Manhattan distance between sinks s 3 and s 4 is 4. Therefore, to generate a zero-skew solution, the distance between node b and each of its two children should be 2. Accordingly, a merging segment (ms b ) is formed within a distance 2 from s 3 and s 4 . Similarly, ms a is formed with a distance 3 from s 1 and s 2 . Note that the downstream delay of any node on ms a and ms b is 3 and 2, respectively, assuming a path-length delay model. Finally, the merging segment for node r is calculated such that the maximum clock skew becomes 0. The clock tree embedding step generates the exact location of internal nodes of the clock tree. Subsequently, we place the clock splitters at these embedding points and add nets between each splitter node and its children. In contrast to CMOS that the embedding points are locations of the branching points of the clock signal (with zero area), in SFQ technology, 103 s 1 s 2 s 3 s 4 a b r s 1 s 2 s 3 s 4 a b r Figure 5.7: An example of a zero-skew clock tree produced by the DME algorithm for a circuit with 4 sinks. The topology of the tree. s 4 s 3 s 2 ms b ms a ms r s 1 s 4 s 3 s 2 ms b ms a ms r s 1 r a b Figure 5.8: An example of a zero-skew clock tree produced by the DME algorithm for a circuit with 4 sinks. The location of sinks, merging segments, and the embedding points of the internal nodes of the tree. Black rectangles and yellow circles represent the clock sinks and splitters, respectively. 104 splitter cells (with non-zero area) are placed in these locations. Generated locations for clock splitters may already be occupied by logic cells or other clock splitters. Hence, after placing the clock splitters in these locations, the clock tree should be legalized. In other words, the overlaps among clock splitters and cells should be removed. Note that due to overlaps among clock splitters and placed cells, after the legalization step, the clock skew may not be zero anymore. To remove the overlaps among the clock splitters and the placed cells, we employ a two-step approach: (i) map the clock splitters to routing channels (adjusting y coordinates) (ii) remove the overlaps among splitters in each routing channel (modifying x coordinates). Note that the legalization step changes the insertion delays of the leaf nodes and results in clock skew. Accordingly, the primary objective is to minimize the introduced clock skew. In the next step, we present our algorithms for finding the best routing channel and the best x coordinate for each clock splitter, such that the clock skew is minimized. The output of this step (an illegal solution) along with the a legal placement of the clock splitters that yields a minimum skew solution for a 4-bit Kogge-Stone adder circuit [117] are shown in Figures 5.9 and 5.10, respectively. 5.3.4 Min-Skew Clock Tree Placement and Legalization In this section, we present a minimum skew clock tree placement and legalization algorithm. This algorithm removes the overlaps among logic cells and clock splitters in two steps. In the first step, clock splitters are mapped to the routing channel while their x coordinates are fixed (displacement in vertical direction). The motivations for moving the clock splitters to the routing channels are as follows. (i) we will not need to change the placement of logic cells already placed in the placement rows. This helps ensure the routability of the circuit is not affected much. (ii) we will not need to increase the width of the chip to accommodate the placement of clock splitters inside placement rows. Note that in a circuit with n = 2 m clock sinks, a total number of n− 1 splitters are needed to build a fully-balanced clock 105 Figure 5.9: Illustration of a 4-bit Kogge-Stone adder circuit [117], after the placement and CTS steps. Logic cells, clock splitters, and I/O pads are shown using blue, red, and black rectangles, respectively. An illegal zero-skew solution is shown. Figure 5.10: Illustration of a 4-bit Kogge-Stone adder circuit [117], after the placement and CTS steps. Logic cells, clock splitters, and I/O pads are shown using blue, red, and black rectangles, respectively. A legal nonzero-skew solution is depicted. 106 tree. Therefore, by adding splitters to the placement rows, width of the chip may increase significantly. The main objective in this step is to minimize the clock skew, i.e., the difference between the largest and smallest insertion delay values at the sink nodes. As explained in Section 5.2, reducing the phase delay values increases the robustness to on-chip variations. Additionally, considering the delay model for PTLs, reducing the phase delay values also reduces the total wirelength of the clock tree and facilitates the clock routing. Consequently, minimizing the sum of the phase delays is considered as a secondary objective. The variables in this problem are the assignments of splitter cells to the routing channels. For each splitter cell (i), the index of the assigned routing channel (y i ) is an integer value between 1 and the number of available routing channels (n R ). Parameters used for formulating the problem are listed in Table 5.1. The constraints are defined as follows. • The total number of clock splitters in each routing channel should be less than the capacity of the routing channel (i.e., the summation of the width of the splitters should be less than the width of a routing channel.) • The mappings of the cells to the routing channels should not be much different from the solution generated by the embedding algorithm, as it already provides a good initial solution. Subsequently, we present a mathematical formulation of the min-skew clock tree placement problem in the vertical direction as follows. minimize max i=1...n S D i − min j=1...n S D j +λ· n S X k=1 D k subject to y i ∈{1...n R } D i = X e k,j ∈path(C 0 ,C i ) δ k,j = X δ x k,j +δ y k,j δ y k,j =α∗|y k −y j |∗H r+ch (5.7) 107 In the above formulation, the phase delay of sink node i denoted by D i is calculated as the sum of the wire delays on the path from the clock source (C 0 ) to the sink node i. The delay constant α is used to convert PTL length to delay, as described in Section 5.2.2. The main objective is to reduce the max difference between the largest and smallest phase delay values, i.e., the clock skew. The phase delay value at each sink node (D i ) multiplied by a constant value (λ) is added to the objective function as a regularization term. In this formulation, x coordinate of all the clock splitters is kept constant (i.e., δ x k,j terms are constant values, a function of the locations calculated in the previous step). Since the absolute values add non-linearity to the problem, the following transformation is used to linearize the constraints. |z|⇒η + +η − (5.8) Table 5.1: Notations and definitions used for formulating the clock tree placement and legalization problem. Term Definition C i Clock cell i, including sinks C 0 Clock source (root of the clock tree) L i Level of the clock cell i D i Phase delay at the sink node i (x i , y i ) Lower left coordinates of the clock cell i e k,j An edge in the clock tree connecting cells C k and C j δ k,j Delay of the edge e k,j n C The total number of clock cells, excluding the sink nodes n S The total number of clock sinks n R The total number of placement rows (same for routing channels) W a , H a Width and height of the layout area W ch , H ch Width and height (≥ 40μm) of the routing channels W r , H r Width and height (120μm) of the placement rows H r+ch The sum of the height of a placement row and a routing channel m y (m x ) The max difference between the y (x) coordinates of the clock cells W spl ,H spl Width and height of the clock splitter cells (40μm) P The minimum distance between adjacent clock cells d spl Splitter delay (5.5ps) λ Regularization constant (1e −3 ) α Delay constant (1e −2 ps μm ) 108 Using the above transformation, the following constraints should be added to the problem. z =η + −η − 0≤η + ≤b.m 0≤η − ≤ (1−b).m b∈{0,1} (5.9) As shown, a new boolean variable b is added to the problem. Parameter m represents the maximum value of z. Additionally, in order to transform max and min functions to linear functions, two new parameters D max and D min are introduced and the following constraints are added to the problem. D i ≤D max ∀i∈{1...n S } D min ≤D i (5.10) Furthermore, to control the channel density, i.e., to limit the number of clock splitters mapped to each routing channel, separate constraints are added to the problem. The capacity of a channel, defined as the max number of splitters in each channel, is calculated using the following formula. n ch =b W ch W spl +P c (5.11) 109 To model the assignment of each cell i to each routing channel j, a boolean parameter c i,j is defined and the following set of constraints are added to the problem. y i =j ⇐⇒ c i,j = 1 ∀i∈{1...n C } (5.12) ∀j∈{1...n R } n R X j=1 c i,j = 1 ∀i∈{1...n C } (5.13) n C X i=1 c i,j ≤n ch ∀j∈{1...n R } (5.14) Constraint (5.12) ensures that if cell i is assigned to channel j, then c i,j is one. If-then constraints can be transformed to linear constraints in a similar way used to transform the absolute value constraints [118]. Constraint (5.13) ensures that each cell is only assigned to one row. Constraint (5.14) controls the channel density by ensuring that the total number of cells assigned to each channel is less than the channel capacity (calculated by Equation (5.11)). The initial placement generated by the embedding algorithm is a good starting point for the final mapping of clock splitters to routing channels. This initial solution (y 0 i ) simply maps each cell to the nearest channel below the original location. Accordingly, we restrict the final mapping of the cell i to either the same row as the initial solution or the row above the initial solution, using the following constraint. y 0 i ≤y i ≤y 0 i + 1 ∀i∈{1...n C } (5.15) 110 Eventually, using the proposed transformations, objective function, constraints, and variables, the problem formulation is summarized as follows. minimize D max −D min +λ· n S X i=1 D i subject to D i ≤D max ∀i∈{1...n S } D min ≤D i D i = X e k,j ∈path(C 0 ,C i ) δ x k,j +δ y k,j δ y k,j =α∗ η + k,j +η − k,j ∗H r+ch y k −y j =η + k,j −η − k,j 0≤η + k,j ≤b k,j ∗m y 0≤η − k,j ≤ (1−b k,j )∗m y n R X j=1 c i,j = 1 ∀i∈{1...n C } n C X i=1 c i,j ≤n ch ∀j∈{1...n R } y i =j ⇐⇒ c i,j = 1 ∀i∈{1...n C } ∀j∈{1...n R } y 0 i ≤y i ≤y 0 i + 1 ∀i∈{1...n C } variables y i ∈{1...n R } ∀i∈{1...n C } b k,j ∈{0,1} ∀k,j→∃e k,j c i,j ∈{0,1} ∀i∈{1...n C } ∀j∈{1...n R } D i ∈R + ∀i∈{1...n S } δ k,j ∈R + ∀k,j→∃e k,j (5.16) 111 As observed, there are integer (y i ), binary (b k,j and c i,j ), and real-number (D i and δ k,j ) variables in this formulation. Therefore, this problem is an instance of mixed integer linear programming (MILP). Solving this problem yields the optimum assignment of splitter cells to routing channels such that the clock skew is minimized. Once this step is completed, a similar problem can be formulated to minimize the skew while eliminating all the horizontal overlaps among the cells allotted to the same routing channel. Using the horizontal ordering imposed by the embedding algorithm, we can set constraints on the location of each cell in the routing channel, such that no two adjacent cells overlap. Accordingly, after the assignment of cells to routing channels are determined, for each routing channel, cells are sorted based on theirx coordinate and the following constraints are used to eliminate the horizontal overlaps among cells. W spl +P≤x j −x i ∀i,j∈{1...n C }|y i =y j , x 0 i ≤x 0 j (5.17) Since the ordering of the cells is determined, we can make use of the transitive relationships. For example, if there are three cells x, y, and z, two constraints x≤y and y≤z imply that that x≤z. Therefore, if there are a total number of n cells in a row, a total number of n− 1 constraints are required to ensure the legality of the placement solution in that row. Using the transformations and variables defined for legalization in the vertical direction, the 112 min-skew placement and legalization problem in the horizontal direction is formulated as follows. minimize D max −D min +λ· n S X i=1 D i subject to D i ≤D max ∀i∈{1...n S } D min ≤D i D i = X e k,j ∈path(C 0 ,C i ) δ x k,j +δ y k,j δ x k,j =α∗ η + k,j +η − k,j x k −x j =η + k,j −η − k,j 0≤η + k,j ≤b k,j ∗m x 0≤η − k,j ≤ (1−b k,j )∗m x x i +W spl +P≤x j y i =y j ,x 0 i ≤x 0 j variables x i ∈ [0,W a −W spl ] ∀i∈{1...n C } b k,j ∈{0,1} ∀k,j→∃e k,j D i ∈R + ∀i∈{1...n S } δ k,j ∈R + ∀k,j→∃e k,j (5.18) Similar to formulation (5.16), min-skew placement and legalization in the horizontal direction is also an instance of MILP. Note that assuming the lower left corner of the layout area to be at (0,0), thex i coordinates are constrained to be within the boundaries of the layout. Solving the above problem yields the final placement of the clock splitters, such that the clock skew is minimized and a legal placement (with no overlaps) is produced. Note that the legalization in the vertical direction should be done before the horizontal direction. The reason is that the assignment of cells to channels and their initial ordering in the horizontal direction determine the necessary constraints for removing overlaps during horizontal legalization. Once the 113 horizontal and vertical legalization problems are solved, a legal minimum-skew solution similar to Fig. 5.10 is produced. In the next section, our simulation framework along with the results of applying the proposed CTS algorithm to multiple SFQ benchmarks are presented. 5.4 Simulation Results We used the qPlace package for placing the logic cells in the layout area [107], [17]. We added the support for our proposed delay model (cf. Section 5.2.2) to the implementation of the DME algorithm for embedding the clock trees [72]. We implemented the clock topology generation and the rest of the proposed algorithms in C++ and used the IBM CPLEX v12.8 package for solving the MILP problems [91]. The qSTA tool was used for static timing analysis. We used the clock tree synthesis approach in [17] as the baseline for comparison. This approach essentially uses the DME algorithm for clock tree synthesis. There are two major differences between the proposed method and the baseline approach. • The proposed method maps each splitter to the channel above or below the initial location, such that clock skew is minimized. However, the baseline approach maps the clock splitters to the closest routing channel (either above or below) greedily, minimizing the displacement of each individual splitter, while ignoring the clock skew or the total wirelength of the clock tree. • The proposed approach moves all the cells in the horizontal direction aiming at minimizing the skew and removing the overlaps. Conversely, the baseline approach only moves the cells that have overlap with each other, by shifting the overlapping cell(s), ignoring the effect of displacement on the clock skew. We assume the baseline approach uses a fully-balanced topology to minimize the skew, similar to our proposed approach. Note that this is an advantage for the baseline solution as using an imbalanced clock topology while ignoring the splitter delays increases the clock skew. On the other hand, using a fully-balanced clock tree topology, which makes sure all the sink 114 nodes in the clock tree have the same level, effectively removes the impact of clock splitter delays on the skew. Note that, in this work, it is assumed that all the splitter cells have the same delay and the process variations do not change the delay of splitters. The characteristics of the benchmark circuits, including the number of I/O pads, sink nodes, cells, nets, before and after the clock tree synthesis are listed in Table 5.2. Since the same topology generation algorithm is used for both the proposed and baseline approaches, the number of cells after clock synthesis is equal for both solutions. The clock skew, total negative hold slack, worst negative hold slack, and the maximum achievable clock frequency for each design, after the placement and clock tree synthesis are reported in Table 5.3. We have also listed the clock skew values after the clock routing using qGDR routing tool and 4 metal layers for routing [119]. The clock period for each circuit Table 5.2: Benchmark characteristics. KSA stands for Kogg-Stone adder [117], ArrMult stands for array multiplier, and ID stands for integer divider. Rest of the benchmarks are chosen from ISCAS85 benchmark suite [28]. Post-CTS columns report the number of cells and nets in the design after adding clock splitters and clock nets (using the proposed approach). Pre-CTS Post-CTS Benchmark #I/O pads #Clk Sinks #Cells #Nets #Cells #Nets KSA4 15 59 87 124 150 246 KSA8 27 318 230 318 485 732 KSA16 51 414 592 803 1103 1728 KSA32 99 1049 1486 1988 3533 5084 ArrMult8 33 1404 1875 2296 3922 5747 ArrMult16 65 4798 6206 7646 14397 20635 ID4 17 420 570 694 1081 1625 ID8 33 2703 3192 3697 7287 10495 c432 44 976 1186 1432 2209 3431 c499 74 566 875 1225 1898 2814 c880 87 1133 1469 1865 3516 5045 c1355 74 618 922 1267 1945 2908 c1908 59 1100 1516 1965 3563 5112 c2670 206 1713 2195 2832 4242 6592 c3540 73 2679 3936 5097 8031 11871 c5315 285 4483 5931 7557 14122 20231 c6288 65 5546 7236 8958 15427 22695 115 Table 5.3: Simulation results (clock skew, total negative hold slack (TNS), worst negative hold slack (WNS), and clock frequency) for several benchmarks using the proposed method and the baseline method [17]. Impr. stands for improvements over the baseline. Freq. denotes the max clock frequency. Post-pl. and post-rt. stand for post-placement and post-routing, respectively. Values for skew, WNS, and TNS are in ps. Proposed Baseline Skew Hold Hold Benchmark Post-Pl. Impr. (%) Post-RT. TNS Impr. (%) WNS Impr. (%) Freq. Skew TNS WNS Freq. KSA4 2.7 66.3 3.6 0 N/A 0 N/A 37.6 8.0 0.0 0.0 34.6 KSA8 5.3 47.0 8.4 0 100 0 100 26.7 10.0 -1.4 -1.4 21.9 KSA16 4.5 59.5 6.7 -1.3 38.1 -1.3 -8.3 25.2 11.1 -2.1 -1.2 26.5 KSA32 4.6 52.1 9.1 -0.6 76.0 -0.6 45.5 14.2 9.6 -2.5 -1.1 13.4 ArrMult8 4.0 70.4 8.3 0 100 0 100 17.2 13.5 -10.0 -2.8 16.0 ArrMult16 6.1 63.5 14.2 -3.2 89.9 -1.8 62.5 9.4 16.7 -31.6 -4.8 9.1 ID4 4.4 60.0 7.2 0 100 0 100 19.5 11.0 -3.3 -2.2 18.7 ID8 4.4 73.7 9.7 -0.5 98.9 -0.5 88.6 6.5 16.7 -45.9 -4.4 6.4 c432 3.6 75.5 5.9 -0.4 94.2 -0.4 83.3 4.8 14.7 -6.9 -2.4 4.8 c499 4.9 84.5 9.5 -11 87.5 -2.7 77.1 6.5 31.6 -88.3 -11.8 6.5 c880 3.9 78.7 7.0 -6 73.8 -3.8 0.0 10 18.3 -22.9 -3.8 9.8 c1355 5.6 78.7 8.6 -17.3 77.6 -2.1 85.8 6.9 26.3 -77.1 -14.8 7.0 c1908 4.8 64.4 9.9 -3.6 76.8 -1.6 15.8 4.3 13.5 -15.5 -1.9 4.3 c2670 4.1 72.7 6.6 -13.7 79.6 -2.9 59.7 3.6 15.0 -67.3 -7.2 3.7 c3540 4.3 69.1 8.0 -18.1 20.3 -5.2 -188.9 4.4 13.9 -22.7 -1.8 4.5 c5315 5.0 70.9 9.0 -13 84.1 -2.9 58.0 2.6 17.2 -81.9 -6.9 2.6 c6288 5.7 67.8 15.3 -14.8 71.2 -2.6 43.5 4.5 17.7 -51.3 -4.6 4.5 Average 4.6 70.5 8.6 -6.1 80.4 -1.7 60.5 - 15.6 -31.2 -4.3 - is calculated as the smallest value such that there are no setup time violations. For solving each of the MILP problems, a time limit of 30 minutes is established. As shown in Table 5.3, the average clock skew value for 17 benchmarks is 4.6ps. The proposed method improves the average clock skew by 70%, compared with the baseline approach. Additionally, the total and worst negative hold slack values are improved by 80% and 60%, respectively. The average values for the total and worst negative hold slack for all the benchmarks are−6.1ps and−1.7ps, respectively. As observed, even the worst case hold time violation among all the benchmarks (cf. Table 5.3, benchmark c3540) can be solved by the post-CTS timing closure flow, without the need for extensive refinement of the circuit and by insertion of a small number of hold buffer/JTL cells. Finally, it should be mentioned that if the time limits for solving MILP problems are increased or movement of clock splitters is not restricted to the top or bottom channels only, lower skew values may be achieved. Post-routing maximum clock skew results are listed in Table 5.3. The average clock skew increases to 8.6ps after the clock routing. The main reason behind this increase in the maximum clock skew is that the routing tool’s main objective is finishing the routing of all the nets while reducing the total via count used for routing [119]. Therefore, it ignores the 116 propagation delay along the nets, and consequently, the maximum clock skew values. Aside from the routing tool itself, we point out that the competition for limited available routing resources in large benchmarks (e.g., 16-bit Array Multiplier with more than 14,000 logic cells and 13,000 clock sub-nets) results in an increase in the length of routed nets, compared with the ideal Manhattan distance, and hence the clock skew may be increased. Note that the maximum post-routing clock skew among all benchmarks is 15.3ps, which is rather small, and the resulting negative slack values can be easily eliminated by a timing closure flow, which selectively adds a small number of gates on data propagation paths. A timing-driven routing tool will address the increase in maximum clock skew. Aside from post-routing results reported in Table 5.3, throughout the chapter, all the delay values are reported after the placement and clock tree synthesis steps (and before routing), assuming that the net lengths are equal to the Manhattan distance between the corresponding logic gates. Although the proposed clock embedding, placement, and legalization algorithms (cf. steps 2-4 Fig. 5.2) minimize the skew given a fully-balanced clock topology, these algorithms can also be applied to imbalanced topologies with fewer number of splitters, to produce minimum-skew clock trees with a fewer number of JJs, compared with a fully-balanced tree topology. In the next section, we present the necessary modifications to the proposed methodology in Section 5.3 to minimize the clock skew given imbalanced clock tree topologies. 5.5 Clock Synthesis for Imbalanced Tree Topologies In RSFQ circuits, the static power dissipation in the resistive bias network is about 100× larger than the dynamic power dissipation of the Josephson junctions [13]. Additionally, the clock network may require large amounts of current, exceeding 10A for large circuits. Consequently, the current delivery is a significant problem in large circuits with more than 100K JJs [120]. 117 One of the possible ways to reduce the required amount of current delivered to the circuit and the static power consumption is to decrease the number of splitters (i.e., JJ count) in the clock network. Although creating a fully-balanced clock tree in which all the sink nodes have the same level reduces the skew significantly, it may introduce a large overhead in terms of the total area, biasing current, and static power consumption of the clock network. To address this issue, imbalanced tree structures with a fewer number of splitters compared with the fully-balanced solution may be utilized. As a consequence of imbalance in the clock tree topology, to minimize the skew, the clock tree embedding, placement, and legalization algorithms should be modified to account for the delay of clock splitters. In the following subsections, first we propose using an algorithm for imbalanced clock tree topology generation [114]. Next, we present splitter-delay-aware zero-skew clock tree embedding and splitter-delay-aware min-skew clock tree placement and legalization algorithms in detail. The goal is to minimize both the clock skew and the number of JTLs in the clock network. 5.5.1 Imbalanced Topology Generation For topology generation, we use the greedy-DME algorithm [114]. In this approach, topology generation is performed in a bottom-up fashion (in contrast to our proposed approach which was done in a top-down manner, cf. Section 5.3.2). Assume that initially there are n sink nodes. The greedy-DME algorithm starts with a set of nodes, representing the sinks of the clock tree. The algorithm iteratively finds the nearest neighbors u andv in the set of possible nodes, where the distance between nodes u and v is smaller than the distance between any other pairs of nodes. A parent node is then created by merging nodes u and v and these two child nodes are removed from the set of nodes. The next pair of nearest neighbors are detected and the same procedure is repeated until the total number of remaining nodes is 1 (this process takes n− 1 steps). A time complexity of O(nlogn) can be obtained for this algorithm, where n denotes the number of sinks [114]. 118 As observed, the greedy-DME algorithm merges the nearest neighbors, which may be sink nodes or roots of partial-trees with different heights. Therefore, as a result of this bottom-up approach which tries to minimize the total wirelength heuristically, the generated topology has a fewer number of internal nodes (i.e., clock splitters) compared to a fully-balanced topology which always merges the sub-tree roots with the same height. Consequently, the total number of JTLs in the clock tree generated by the greedy-DME algorithm is smaller than the tree produced by the proposed approach in Section 5.3.2 (i.e., MMM [116]). 5.5.2 Splitter-Aware Clock Tree Embedding In the proposed zero-skew clock tree embedding algorithm in Section 5.3.3, we did not need to account for the splitter delays during the embedding of the tree. The reason was that the generated clock topology was already fully-balanced, so all the source-sink paths had the same number of splitters. Consequently, all the phase delays included an identical delay value associated with the sum of the delay of the clock splitters in each source-sink path, which did not affect the clock skew. In the DME algorithm, the location of a merging segment is a function of the location and downstream delay of its child merging segments [72]. Accordingly, assuming an imbalanced tree topology, splitter delays play an important role in determining the location of merging segments and embedding points of a clock tree. However, the DME algorithm does not consider the delay of splitter cells in the clock tree. To address this issue, we modify the formulation of the DME algorithm as follows. Similar to the original algorithm, clock tree embedding is done in two phases. In the bottom-up pass, after merging two child nodes, we add the delay of the splitter cell to the total downstream delay of the parent node. Consequently, when merging two sub-trees, the algorithm accounts for both wire delay of each sub-tree and the delay associated with the splitters inserted in each sub-tree. Accordingly, the location of the merging segments is different than the one produced by the original DME 119 s 1 s 2 s 3 a r s 1 s 2 s 3 a r Figure 5.11: An example of applying DME algorithm to a circuit with 3 sink nodes. An imbalanced tree topology is shown. algorithm. The top-down phase of the DME algorithm remains the same. Finally, a zero-skew embedding is generated that accounts for both splitter and interconnect delays. Figures 5.11 – 5.13 depict an imbalanced tree topology and the placement of internal nodes, using the original and the modified DME algorithms for an example with 3 sinks. The delay associated with different edges are also shown. Assume the delay of splitter cells to be 2 units (of delay) and a path-length delay model. As depicted in Fig. 5.12, generated using the original DME algorithm that ignores the splitter delays, the phase delays of sinks s 1 -s 3 are 10, 10, and 8, respectively. Hence, the clock skew is 2. Conversely, Fig. 5.13 depicts the location of merging segment ms r considering the splitter delay values. Once the ms a is formed, the downstream delay ofms a which is originally set to be 3, is modified and the delay of the splitter cell corresponding to node a is added to this delay. Hence, the downstream delay of node a becomes 5. Accordingly, the merging segment ms r is formed further away from nodes 3 and closer to nodea, to create a zero-skew merging segment. Consequently, the phase delays of sinks s 1 -s 3 are all equal to 9 and the clock skew becomes 0. The splitter-delay-aware clock tree embedding algorithm can be applied to the imbalanced topology to calculate the location of clock splitters. In the next subsection, we describe 120 s 3 s 2 ms a ms r s 1 s 3 s 2 ms a ms r s 1 3 3 6 3 r a Figure 5.12: An example of applying DME algorithm to a circuit with 3 sink nodes. The location of sinks, merging segments, and the embedding points of the internal nodes of the tree using (b) original DME algorithm is shown. s 3 s 2 ms a ms r s 1 s 3 s 2 ms a ms r s 1 3 3 2 7 r a Figure 5.13: An example of applying DME algorithm to a circuit with 3 sink nodes. The location of sinks, merging segments, and the embedding points of the internal nodes of the tree using the splitter-delay-aware DME algorithm are shown. 121 the modifications made to the placement and legalization algorithm to properly handle imbalanced tree topologies. 5.5.3 Splitter-Aware Clock Tree Placement and Legalization The clock tree placement and legalization algorithms should also account for splitter delays, while minimizing the clock skew and removing the overlaps. To do so, we modify the formulas calculating the insertion delay at each sink node (i.e., formulations (5.16) and (5.18)) as follows. D i =L i ∗d spl + X e k,j ∈path(C 0 ,C i ) δ x k,j +δ y k,j (5.19) where L i denotes the level of sink i and d spl denotes the delay of splitter cells. As observed, the first term in the above formula is a constant value, a function of the tree topology and not one of the variables of the MILP formulation for clock tree legalization. Once the above modifications are done to the flow of Fig. 5.2, the proposed clock tree synthesis algorithm can be used for both balanced and imbalanced tree topologies, possibly generating min-skew solutions with a fewer number of clock splitters compared with the approach proposed in Section 5.3. The splitter-delay-aware CTS flow tries to increase the length of the wires along the source-sink paths that have fewer splitters, to minimize the difference between max and min insertion delays, hence, to minimize the maximum clock skew. In the next subsection, the results of applying the CTS algorithm for imbalanced topologies to multiple benchmark circuits are presented and compared to the baseline solution. 5.5.4 Simulation Results for Imbalanced Clock Tree Topologies We used the greedy-DME algorithm for imbalance clock topology generation [72] and added the aforementioned modifications to the qCTS flow. Table 5.4 lists the clock skew, total negative hold slack (TNS), worst negative hold slack (WNS) (in ps), the number of clock 122 Table 5.4: Results of the clock skew, total negative hold slack (TNS), worst negative hold slack (WNS) (in ps), the number of clock splitters, maximum clock frequency, imbalance degree (maximum level difference among sinks), and comparisons with the baseline solution [17]. Impr., Freq., and Imb. Deg. stand for improvement, maximum clock frequency, and the imbalance degree of the clock tree, respectively. Benchmark Skew Impr. (%) Hold TNS Impr. (%) Hold WNS Impr. (%) #Clk Spl. Impr. (%) Freq. Imb. Deg. KSA4 4.9 38.8 -1.4 N/A -1.4 N/A 58 7.9 38.8 2 KSA8 5.2 48.0 -3.1 -121.4 -1.7 -21.4 158 38.0 22.7 4 KSA16 5.5 50.5 0 100 0 100 413 19.2 26.3 3 KSA32 6.0 37.5 -2.6 -4.0 -1.3 -18.2 1048 48.8 14.1 8 ArrMult8 6.0 55.6 -7.6 24.0 -1.8 35.7 1403 31.5 15.3 5 ArrMult16 8.6 48.5 -42.5 -34.5 -4.3 10.4 4797 41.4 9.5 8 ID4 5.0 54.5 0 100 0 100 419 18.0 19.2 3 ID8 6.0 64.1 -14.8 67.8 -2.6 40.9 2702 34.0 6.5 6 c432 5.8 60.5 -2.7 60.9 -0.7 70.8 975 4.7 4.9 6 c499 10.1 68.0 -31.2 64.7 -5.5 53.4 565 44.8 6.4 5 c880 6.8 62.8 -18.2 20.5 -2.9 23.7 1132 44.7 9.9 6 c1355 5.8 77.9 -17.4 77.4 4.3 70.9 617 39.7 6.6 4 c1908 6.3 53.3 -12.4 20.0 -2.4 -26.3 1099 46.3 4.3 8 c2670 9.2 38.7 -34.9 48.1 -3.5 51.4 1712 16.4 3.6 5 c3540 7.5 46.0 -35.9 -58.1 -5.3 -194.4 2678 34.6 4.4 6 c5315 8.9 48.3 -84.4 -3.1 -4.6 33.3 4482 45.3 2.6 6 c6288 8.1 54.2 -48.5 5.5 -2.8 39.1 5545 32.3 4.5 6 Average (%) 6.8 56.4 -21.0 32.7 -2.1 51.2 1753.1 37.1 splitters, the maximum clock frequency, and imbalance degree (the maximum level difference among sinks) for several benchmarks obtained by applying the proposed algorithm and compares them with the baseline solution [17]. As shown, the average number of clock splitters and average clock skew value are reduced by 37% and 56%, respectively, compared with the baseline solution [17] described in Section 5.4. The average clock skew value over all 17 benchmarks is 6.8ps. Additionally, the average total negative hold slack and average worst negative hold slack values are improved by 32% and 51%, respectively. Table 5.4 also lists the imbalance degree of the clock trees for different benchmarks. As shown, some of the tree topologies have an imbalance degree as large as 8, i.e., there exists a source-sink path that has 8 splitters fewer than a path with the maximum number of splitters. Such large differences can potentially cause a large skew (i.e., 8×d spl = 44ps), if a CTS algorithm ignores the splitter delays. However, the proposed embedding and legalization algorithms modify the location of splitters and the delay of interconnects along all the source-sink paths in a way that the effect of splitter delays on the maximum clock skew is balanced out by the wire delays. Consequently, the maximum clock skew among all the benchmarks is limited to 10.1ps, approximately equal to the delay of only 2 splitters. 123 As shown in Tables 5.3 and 5.4, using the same clock tree embedding and legalization algorithms, imbalanced tree topologies create a trade-off between the number of clock splitters (also the total static power consumption and the total biasing current delivered to the network) and the total negative hold slack values, compared with balanced topologies (cf. Section 5.3). This suggests that the timing closure flow may need more hold buffers to fix all the hold time violations for imbalanced topologies, compared with balanced trees. 5.6 Application to Asynchronous Clock Networks SFQ is a promising option for high performance and low power supercomputing platforms. Nevertheless, timing uncertainty represents an obstacle to the design of high-frequency clock distribution networks. The hierarchical chains of homogeneous clover-leaves clocking, (HC) 2 LC, was proposed as an innovative solution to this challenge [121]. This part presents a novel algorithm for the physical implementation of (HC) 2 LC networks. The proposed method models the (HC) 2 LC network as a directed graph with multiple cycles representing the synchronizing feedback signals. This graph is then transformed to a directed acyclic graph (DAG) by eliminating feedback edges. The physical location of the nodes in the generated DAG (such as splitters and C-junctions) in the Manhattan plane is calculated using a zero-skew clock embedding algorithm. Additionally, a novel mixed integer linear programming (MILP) based approach minimizes the maximum clock skew among the sinks of the clock network and the sum of the delay of the edges in feedback loops, simultaneously. Experimental results show that using the proposed approach, the average clock skew for five benchmark circuits is 4.6ps. 5.6.1 Overview In general, physical implementation of the clock distribution network (CDN) is a crucial task in the physical design of logic circuits. In particular, the quality of the CDN, in terms 124 of clock skew and insertion delay, determines the circuit’s performance (maximum clock frequency) in addition to the circuit’s robustness against timing uncertainties. Moreover, clock networks take up a large area, a significant portion of the routing resources, and a large share of the total power consumption of the circuits. In particular, due to the significant timing uncertainty in SFQ, the imperfection of CDN design can lead to timing violations and chip failures. In contrast to CMOS, where usually a small portion of the gates are sequential elements, in SFQ circuits, almost all of the cells receive a clock signal. Additionally, splitter cells are needed to distribute the clock signal, which increases the overall chip area. Asymmetries in the splitter delays in the CDN affect the timing metrics such as clock skew. Therefore, designing high-speed and low-power CDNs which are robust to timing uncertainties is essential in achieving SFQ technology’s full potential of high-speed and low-power. Recently, an innovative strategy called hierarchical chains of homogeneous clover-leaves clocking, or (HC) 2 LC for brief, was introduced, promising robust and self-adaptive clocking for generic SFQ gate-level pipelined circuits [121]. This approach leverages the spatial correlation of different sources of uncertainties and takes advantage of the counter-flow clocking methodology to design robust and high-performance CDNs. In these CDNs, the maximum clock frequency is not only a function of the data path delay, but also the insertion delays of the clock network, the delay associated with longest chain, and the maximum clock skew. However, no algorithms have been proposed for the physical implementation of such clock networks. Moreover, the proposed (HC) 2 LC topology generation algorithm does not consider the wire delays which play an important role in calculating the timing metrics such as worst negative slack (WNS) and total negative slack (TNS) for both setup and hold constraints. In this part, we present a novel physical implementation algorithm for the (HC) 2 LC networks. The proposed method starts by transforming the cyclic-graph model of the (HC) 2 LC network to a directed acyclic graph (DAG). It then determines the physical location of the nodes in the generated DAG in the Manhattan plane utilizing a zero-skew clock 125 embedding algorithm. Finally, a novel mixed integer linear programming (MILP) based clock network synthesis algorithm is used to legalize and place clock elements while minimizing the maximum clock skew among the sinks of the clock network and the sum of the delay of the edges in the feedback loops of the original cyclic graph, simultaneously. Minimizing both the clock skew and delay of feedback loops affects the performance of the circuits [121]. To quantify the benefits of the proposed algorithm, we applied the proposed methodology to various benchmark circuits from ISCAS85 benchmark suite [28]. Experimental results show that the average clock skew for five benchmark circuits is 4.6ps. Additionally, the proposed method reduces the sum of the delay of the feedback loops by 28.6% compared to an algorithm that does not account for the delays associated with chain lengths. As a result of such small clock skew values, the cost of fixing timing violations (i.e., setup/hold time violations) will be negligible. The contributions of this section can be summarized as follows. • The proposed CDN placement algorithm accounts for the delay of the gates in the clock network, as well as the interconnect delays, and the location of the placed logic cells, to ensure the legality of the final solution. • The proposed mixed-integer-linear-programming algorithm accounts for the delay associ- ated with feedback loops and generates a min-skew clock placement solution such that the maximum achievable frequency is not reduced due to the feedback loop delays. 5.6.2 Problem Definition The placement algorithm calculates the physical location of the logic gates in the circuit. Sequential elements, called the sinks of a clock network, are denoted by S ={s 1 ,s 2 ,...,s n } assuming a total number of n sequential gates. The clock network synthesis (CNS) problem is defined as the design of a hierarchy of clock gates (which connects the source of a CDN (s 0 ) to all the clock sinks (s i )), placement of clock gates, and routing of clock nets. 126 The CNS is typically done in two steps: (i) topology generation, and (ii) topology embedding. A clock network topology (CNT) specifies the hierarchy of the clock gates, branching points, and connections between clock gates and sinks within a CDN. This step specifies which clock gates and nets connect the clock signal to a specific logic gate in the circuit. As a result of this step, m clock gates are added to the CDN, transforming the set of clock gates to V ={s 0 ,s 1 ,...,s n ,s n+1 ,...,s m+n }. A clock embedding algorithm calculates the exact location of clock gates and branching points of the clock topology (i.e., pl(v)∀v∈{s n+1 ,...,s m+n }) in the Manhattan plane. The main objective of CNS algorithms is to minimize the maximum clock skew or the total negative slack. Additionally, minimizing the total wirelength and the total area of the CDN are considered as secondary objectives. After CNS is done, the legality of the placement should be preserved; the CNS algorithm should make sure there are no overlaps between logic gates and clock gates. The (HC) 2 LC network is an asynchronous clock topology that connects a clock signal to clovers of logic gates using clock gates (i.e., splitters and C-junctions) and feedback loops. Tadros and Beerel [121] first introduced the (HC) 2 LC networks and presented an algorithm for CNT design based on the placement of clock sinks and the criticality of the data paths. This approach tries to minimize the timing violations using feedback loops, C-junctions, and a counter-flow clocking structure for local clock propagation within leaves. In (HC) 2 LC networks, the maximum achievable clock frequency is determined by the delay of the clock paths in the clovers, the delay of the feedback loops in the clock network, and the delay of the data paths. The embedding of the clock gates in the Manhattan plane determines the clock insertion delays and clock skews, the maximum frequency of the circuit, and the total number of logic gates needed for fixing the timing of the circuit, during the timing closure flow [122]. In this work, we present a CNS algorithm, called qHC2LC, which embeds the (HC) 2 LC network in the Manhattan plane such that the clock insertion delays, the maximum skew between clovers, and the delay of the feedback loops are minimized. 127 5.6.3 Proposed Method In this section, we describe the overall flow of the qHC2LC algorithm and explain each step in detail. The overall flow of the proposed CNS algorithm is outlined in Fig. 5.14. The input to the qHC2LC algorithm is a placed and legalized netlist. The output contains the locations of all the clock gates and connections among clock gates and logic gates (i.e., an embedded clock network). In the proposed method, first, the (HC) 2 LC network is generated; the number of clovers, leaves, and the assignment of the clock sinks to the leaves are determined. Afterwards, logic cells mapped to the same leaf node are clustered, super-cells corresponding to each leaf node are formed, and super-cells are placed and legalized. In the second step, the feedback loops in the (HC) 2 LC topology are removed, the cyclic graph is transformed to a DAG, and the deferred merge embedding (DME) [123], a zero-skew clock tree synthesis (CTS) algorithm, is used to place the clock gates. Subsequently, the feedback nets are added to the netlist. In the final step, a novel timing-aware MILP based algorithm is utilized to minimize the maximum clock skew among leaves and the delay of feedback loops while removing the overlaps between clock and logic gates. In the following subsections, each step of the algorithm is further elaborated. 5.6.4 (HC) 2 LC Topology Generation In the first step of the qHC2LC algorithm, a (HC) 2 LC network topology is generated [122]. This CNT can be modeled as a cyclic directed graph G(V,E). In this graph, nodes represent the clock gates and edges represent the clock nets (all the nets are two terminal nets). Afterwards, the logic cells mapped to the same leaf nodes are grouped together (these cell groups are called super-cells) and super-cells are mapped to the placement rows and legalized to remove the overlaps. At this point, the netlist contains super-cells (comprising of multiple sequential gates), splitters and C-junctions (non-clocked gates), and I/O pads. 128 Placement (HC) 2 LC Topology Gen. Clock Network Embedding Timing-Aware Clock Network Placement Timing Analysis Logic Synthesis qHC2LC Placement (HC) 2 LC Topology Gen. Clock Network Embedding Timing-Aware Clock Network Placement Timing Analysis Logic Synthesis qHC2LC CDN Hierarchy Const. Super-Cell Placement CDN Hierarchy Const. Super-Cell Placement Clock Tree Construction Tree Embedding Clock Tree Construction Tree Embedding Min-Skew Legalization Min-Skew CDN Placement Min-Skew Legalization Min-Skew CDN Placement Figure 5.14: The overall flow of the proposed clock network synthesis algorithm, qHC2LC. 5.6.5 Clock Tree Embedding Conventional CNS algorithms typically handle trees rather than general cyclic graphs. Ac- cordingly, we remove the feedback edges in this graph and transform the graph to a DAG, denoted by ˆ G(V, ˆ E) where ˆ E⊂E. In the generated graph, each node has only one parent and at most two children. Therefore, the generated DAG is also a binary tree in which the root of the tree denotes the clock source, the leaves represent the clock sinks, and each node has at most two children. Figures 5.15 and 5.16 depict the original graph along with the generated clock tree for a small benchmark from ISCAS85 benchmark suite [28]. The generated clock tree is balanced, i.e., the maximum level difference among leaves is at most 1 (the level of each node i in a tree is defined as the number of nodes in the longest path from the root of the tree to node i). Next, the DME algorithm embeds the generated clock tree using a two step zero-skew embedding algorithm. Accordingly, the location of 129 Figure 5.15: The graph model of HC 2 LC topology for circuit c1355, the cyclic graph [28]. Figure 5.16: The graph model of HC 2 LC topology for circuit c1355, the reduced tree graph [28]. all the clock gates are calculated such that the maximum difference between the insertion delay at the leaves of the clock tree (i.e., the maximum clock skew) is 0. Additionally, DME algorithm minimizes the total wirelength of the clock network, which in turn minimizes the insertion delays [123] 5.6.6 Timing-Aware Clock Network Placement Once the location of branching points of the clock tree is calculated by the DME algorithm, the clock gates (clock splitters and C-junctions) are placed in these locations. Feedback edges, removed in the previous step, are added to the graph. In this step, the goal is to remove the overlaps among logic and clock gates and to minimize the maximum clock skew. Additionally, since the delay of the feedback loops in the clock network may affect the maximum clock frequency, we intend to minimize such delays by placing nodes in the feedback loops close to each other. Accordingly, we propose a two step MILP formulation. In the first step, the clock 130 gates are mapped to the routing channels between the placement rows, similar to the design approach presented in [124]. In the second step, the horizontal overlaps among clock gates are removed. The variables in the first and second steps are the y and x coordinates of all the clock gates, respectively. The first objective for both steps is to minimize the maximum clock skew, as it limits the maximum achievable frequency and leads to timing violations. The second objective is to minimize the delay of the feedback edges. The third objective is to minimize the sum of insertion delays, as it facilitates the routing of the clock nets. Therefore, the objective function can be defined as follows. f =α·skew max +β· X i∈E− ˆ E e i +γ· X i∈ ˆ E e i (5.20) where α, β, and γ are user-defined constants determining the importance of each objective. The maximum clock skew can limit the frequency of the circuit and lead to hold time violations. Therefore, minimizing the clock skew improves the timing yield of the HC 2 LC networks and reduces the overhead in terms of area occupied by the gates which are required for fixing the timing of the circuits [122]. The constraints ensure the legality of the final solution (i.e., there should be no overlap among logic and clock gates in the final solution). 5.6.7 Simulation Results We implemented the qHC2LC tool using Python and C++ and used IBM CPLEX v.12.8 for solving instances of MILP problems. Table 5.5 lists the statistics. In these experiments, the α and γ values are set to 1000 and 1, repsectively. As shown, by setting β = 10, with small increase in the average clock skew (15%), the sum of delay of the feedback loops is reduced by 28.6%, on average. In the future, we plan to evaluate the yield of the circuits considering process variations and wire delays, quantify the effect of feedback loops on the frequency of the circuits, and compare the maximum frequency and yield of (HC) 2 LC networks to zero-skew clock trees. 131 Table 5.5: Results of applying qHC2LC algorithm to five benchmarks from ISCAS85 benchmark suite [28]. Skew (ps) Sum of feedback loop delays (ps) #Clk gates #Leaves β = 0 β = 10 Inc. (%) β = 0 β = 10 Impr. (%) c499 300 120 3.6 3.6 0.0 484.4 381.8 21.2 c432 662 264 4.6 6.4 39.1 936.7 693.3 26.0 c880 550 220 4.0 5.0 25.0 841.9 576.2 31.6 c1355 300 120 3.6 3.6 0.0 481.4 380.3 21.0 c2670 554 380 4.4 4.4 0.0 1960.0 1327.5 32.3 Average 4.0 4.6 15 940.9 671.8 28.6 In summary, this section presents a clock network synthesis algorithm for physical implementation of a self-adaptive and robust clock network topology: the hierarchical chains of homogeneous clover-leaves clocking (HC) 2 LC. The presented method transforms the clock topology with loops to a DAG, places the nodes in the DAG using a zero-skew clock tree synthesis algorithm, restores the feedback edges, and minimizes the insertion delays, max clock skew, and the sum of the length of edges in the feedback loops to minimize the overheads of the clock network in terms of routing and increase the maximum achievable frequency of the circuits. Preliminary experimental results show that the average skew for five benchmarks is 4.6ps and the proposed method reduces the sum of the delay of the feedback loops by 28.6%, on average. 5.7 Summary In this chapter, we presented a minimum-skew clock tree synthesis methodology for single flux quantum logic circuits, called qCTS. The qCTS algorithm first builds a fully-balanced tree topology considering the placement of the sequential elements in a circuit, such that there are equal number of splitters from the clock source to any clock sink. The fully-balanced tree topology removes the effect of splitter delays on the clock skew. The location of the clock splitters is then calculated using a zero-skew embedding algorithm. Finally, using a novel mixed integer linear programming based method, overlaps among clock splitters and 132 logic cells are removed while the clock skew is minimized. The qCTS method improves the state-of-the-art by accounting for splitter delays and placement blockages and reduces the clock skew by 70% on average, over 17 benchmarks. Subsequently, this methodology is extended to minimize the clock skew given an imbalanced topology in which there are different number of splitters from the root of the tree to every sink node. The modified algorithm generates solutions in which the average number of splitters and the average clock skew are reduced by 56% and 37%, respectively, compared with a fully-balanced clock tree synthesis and a greedy legalization algorithm. 133 Chapter 6 Timing Uncertainty-Aware Clock Topology Generation In this chapter, we present a low-cost, timing uncertainty-aware synchronous clock tree topology generation algorithm for single flux quantum (SFQ) logic circuits. The proposed method considers the criticality of the data paths in terms of timing slacks as well as the total wirelength of the clock tree and generates a (height-) balanced binary clock tree using a bottom-up approach and an integer linear programming (ILP) formulation. The statistical timing analysis results for ten benchmark circuits show that the proposed method improves the total wirelength and the total negative hold slack by 4.2% and 64.6%, respectively, on average, compared with a wirelength-driven state-of-the-art balanced topology generation approach. 6.1 Overview Superconducting electronics is a promising replacement for CMOS technology, aimed at high- performance and low-power computing. Phenomenal characteristics of Josephson junctions (JJs), basic circuit elements in single flux quantum (SFQ) technology, such as ultra-fast switching speed and ultra-low energy consumption, promise enhancements of at least two orders of magnitude in terms of the product of the energy-efficiency and operation frequency, compared with CMOS technology [13]. 134 Physical design of logic circuits, especially the synthesis of a robust clock distribution network (CDN), plays an important role in designing high-performance circuits which are resilient against process-induced sources of variability. A CDN should be designed to not only eliminate timing violations (by controlling clock skews at various launch and capture flops under nominal conditions) but also to reduce mismatches between the target clock skews under nominal conditions and those achieved under process variations. Clock network synthesis for SFQ circuits is more challenging than it is for state-of-the-art CMOS circuits because of the following reasons: (i) Clock frequencies in SFQ circuits are in 25-50GHz range, which call for much tighter control of the clock skew; (ii) Nearly all SFQ combinational gates (exceptions being splitter cells and Josephson junction (JJ) transmission line cells) receive a clock signal in order to pass out their internal state information as a voltage pulse to the output; (iii) Splitter cells are required to distribute the clock signal from the clock source to all combinational gates that receive the clock signal as an input. (iv) Due to layout constraints (e.g., limited number of routing layers and the need to match driver/receiver impedances to the line impedances to avoid signal bounce and oscillations) and SFQ process design rules (e.g., lack of a stacked via), designing a CDN that would simultaneously minimize various timing metrics such as the maximum clock skew, the insertion delay, and the total negative slack is a difficult undertaking. (v) As a result of ultra-deep (gate-level) pipelining in SFQ circuits, the ratio of the clock insertion delay to the combinational path delay is large. Process-induced variations in SFQ circuits include variations in: (a) inductance values inside the gates and those associated with the passive transmission lines, (b) biasing current levels that affect both gate and clock splitter delays, and (c) critical current levels of JJs which in turn affects the JJs’ switching characteristics. These variations can thus change the delay of clock splitters employed in the CDN as well as delays of combinational gates in the circuit, which can in turn result in a large number of setup or hold time violations. Consequently, a large portion of timing margins should be dedicated to process-induced clock skews. These 135 constraints translate into large timing uncertainty in the clock insertion delays and a low timing yield. This chapter presents a timing-aware clock tree topology generation algorithm which considers the signal flow and timing in the data path and the worst-case timing slack (s) to any clock sink considering both wire and splitter cell delays and the total wirelength of the clock tree, while also accounting for the process variations and timing uncertainties. More precisely, the proposed algorithm generates a balanced binary clock tree topology (CTT), level by level, and in a bottom-up manner. Targeting a zero clock skew, generated clock trees are height balanced, i.e, there is an equal number of splitters from the root of the tree to any of the leaf nodes (note that in some cases a single output of an splitter cell is used i.e., the splitter will simply function as a delay element). At each level of the tree, the proposed algorithm solves an integer linear programming (ILP) problem to determine which nodes in the clock tree should be paired up (i.e., become siblings by assigning them the same parent node). The objective function in this ILP formulation is the weighted sum of the total wire length and total negative slack of the clock tree. By minimizing this objective function, we simultaneously minimize the routing cost of the clock tree and optimize the assignment of sink nodes to appropriate branches of the clock tree to increase the efficacy of common path pessimism removal (CPPR) techniques [23], which in turn help control the adverse and uncertain effects of process-induced variations on the worst-case clock skews. Experimental results using statistical timing analysis show that the proposed approach improves the total negative slack and the overhead of timing-closure in terms of the number of timing-fix buffers by 64.6% and 52.2%, respectively, compared with an state-of-the-art wirelength-driven approach. Additionally, the total wirelength of the clock tree is reduced by 4.2%. The remainder of this chapter is organized as follows. The preliminaries including a formal definition of clock topology generation problem are discussed in sections 6.2. Our approach 136 along with simulation results are detailed in sections 6.3 and 6.4, respectively. The chapter is summarized in section 6.5. 6.2 Preliminaries 6.2.1 Definitions In this section, we summarize some notations and definitions used throughout this chapter. We define the clock tree topology (CTT) to be a directed binary tree G (each node has a maximum of 1 incoming edge and 2 outgoing edges). The set of nodes for G consists of the root of the treeS 0 (corresponding to the clock source) and a set of leaf nodesS ={S 1 ,...,S n }, representing the clock sinks. As the clock topology is generated, a number of nodes are added to graph G, extending the set of nodes of the tree to S ={S 1 ,...,S n ,...,S n+m }. In G, an edge e j,k corresponds to a connection from a parent node S j to a child node S k . The height of a node is the number of nodes on the path from that node to a leaf node, denoted by H i . The height of a tree is defined as the height of the root node S 0 . The depth of a node i is defined as the number of nodes in the (unique) path from the root of the tree to node S i , denoted by D i . We assume that leaf nodes are at level 0 of the tree and the root node is at the highest level of the tree. Common clock path to a pair of leaf nodes refers to the portion of the clock tree (nodes and edges) that is common between those two nodes. Similarly, the non-common clock path refers to nodes and edges that are only on the clock path to one of the leaf nodes. Consider a pair of sequentially adjacent flip-flops (FF) connected by combinational gates and interconnects where the signal flows from FF i (launching FF) toFF j (capturing FF), as depicted in Fig. 6.1. The clock skew between FFs i and j is defined as the difference between clock arrival times (phase delay values) at these two nodes and calculated as t skew i,j =T i −T j , where T i and T j denote the clock arrival time at sinks i and j, respectively. In this chapter, the max 137 Combinational SPL Data CLK FFj FFj FFi FFi Figure 6.1: A pair of sequentially adjacent flip-flops i and j connected by a combinational gate and interconnects. clock skew for a circuit is defined as the maximum absolute value of clock skew between any two FFs. The insertion delay at a node S i is calculated as the sum of the delay of the nodes and edges on the path from the root of the tree to S i , as defined below. T i =D i ×t spl + X e k,j ∈path(R,S i ) δ k,j (6.1) where t spl and δ k,j denote the delay of a splitter cell and the wire delay associated with edge e k,j , respectively. For each path between two sequentially adjacent FFs, two timing constraints are defined; Setup time constraint specifies the amount of time the input data to the capturing FF should stay steady before the arrival of the next clock edge [110]. The following inequality summarizes the relation between maximum clock-Q delay of a FF (t max c2Q ), the maximum combinational path delay (t max comb ), the maximum clock skew (t max skew ij ), and the maximum setup time of the capturing FF (t max setup ). T p ≥t max c2Q i +t max comb +t max skew ij +t max setup j (6.2) 138 The hold time of the capturing FF, defined as the amount of time the input data signal should stay valid after the clock edge, imposes an additional constraint on clock insertion delays and the total combinational path delay over the data path, as follows: t min c2Q i +t min comb +t min skew ij ≥t max hold j (6.3) In the worst case, the input signal at the capturing FF should remain stable for t max hold after the clock edge of the same clock signal arrives. Timing criticality of a path in terms of setup or hold constraints can be defined as the difference between the actual and the required arrival times at the capturing FF. For setup and hold time constraints, setup and hold slacks at a node (e.g. FF j ) corresponding to a path (e.g. FF i →FF j ) are defined as follows: slack setup j =T p −t max c2Q i −t max comb −t max skew ij −t max setup j slack hold j =t min c2Q i +t min comb +t min skew ij −t max hold j (6.4) We define slack i→j to be equal to slack j corresponding to a path from node i to node j. Given the above definitions, a positive slack means the timing constraint is satisfied (i.e., the signal arrives earlier than it is required) while a negative slack is an indicator of a (setup or hold) time violation (i.e., the signal arrives later than it is required) [110]. As a consequence of variations in the fabrication process, the clock-Q delay, setup, and hold time of the FFs, and the propagation delay of combinational gates (e.g., splitters) both in data and clock paths deviate from their nominal values. Accordingly, the setup and hold slacks at different nodes may become negative, resulting in timing violations. As observed, hold time violations are generally more important as they can not be fixed by reducing the clock frequency. 139 6.2.2 Motivating Example Consider a balanced clock tree with eight leaf nodes, as depicted in Fig. 6.2. We make the following assumptions. (i) Each internal node of the clock tree to is a splitter gate with a nominal delay of 2ps. (ii) The process variations results in a±0.5ps variation on the propagation delay of each splitter in the clock tree. (iii) In the nominal condition, the hold slack at S 15 is +1ps. (iv) The process variations do not change the setup, hold, and clock-Q delays. Assume that due to process variations, the delay of all the splitters in the clock tree on the path S 0 →S 8 decreases to−1.5ps and the delay of all splitters on the path S 0 →S 15 increases to +2.5ps (cf. Fig. 6.3). As a result of the introduced clock skew of−2ps, the hold slack at S 15 decreases to−1ps which results in a timing violation (i.e., short-path violation). One way to fix this violation is to insert hold buffers on the data path (S 8 →S 15 ) to avoid a short-path violation (cf. Fig. 6.4). However, this method increases the total number of gates in the design, which in turn increases the total area and power consumption and complicates the routing. The alternative approach would be to pair up leaves S 8 and S 15 , i.e., assign them the same parent node in the clock tree. As a consequence, the total number of splitters on the non-common clock path to nodes S 8 and S 15 reduces to zero, eliminating the timing uncertainty on t skew 8,15 variable (cf. Fig. 6.5). In SFQ circuits, typically, balanced tree topologies are utilized to balance the clock insertion delays among all sink nodes (i.e. to minimize the maximum clock skew). In a circuit with n clock sinks, using a balanced binary tree, the maximum number of non-common splitters on the clock path to a pair of sequentially adjacent flip-flops is equal to 2·dlog 2 ne. On the other hand, the total number of splitters between sequential gates in the data path is typically limited to a few splitters (i.e., < 5). Therefore, the ratio of clock insertion delay to data path delay is relatively large. For instance, in circuits with more than 1000 sequential element, there can be more than twenty splitters on the non-common clock path to a pair of flip-flops. Moreover, due to ultra-deep pipelining in SFQ logic, a large number of DFFs are required to path balance every input-output path. Connections from DFFs to DFFs 140 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Figure 6.2: A balanced binary tree with 8 sinks. At the nominal condition, the max clock skew is zero. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Figure 6.3: A balanced binary tree with 8 sinks. (a) At nominal condition, the max clock skew is zero. (b) If all the splitter delays on path S 0 →S 7 decrease to the min propagation delay and all the splitter delays on the S 0 →S 14 increase to the max propagation delay, a hold time violation occurs on data path S 7 →S 14 . A balanced binary tree with 8 sinks. If all the splitter delays on path S 0 →S 7 decrease to the min propagation delay and all the splitter delays on the S 0 →S 14 increase to the max propagation delay, a hold time violation occurs on data path S 7 →S 14 . 141 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Figure 6.4: A balanced binary tree with 8 sinks. Two buffers are added to data path S 7 →S 14 to fix the timing violation. 0 1 2 3 4 5 6 7 14 9 10 11 12 13 8 Figure 6.5: A balanced binary tree with 8 sinks. The clock tree is designed such that the number of non-common splitters on the clock path to S 7 and S 14 is zero, therefore no timing violations occur on the data path S 7 →S 14 . 142 are typically short connections with relatively small contamination delays. Therefore, such connections are prone to hold time violations in the presence of timing uncertainty on the clock skew. Accordingly, accounting for the timing criticality of the data paths during the clock topology generation is a key factor in improving the timing yield of SFQ circuits. Additionally, timing uncertainty-aware topology generation paves the way to reducing the total number of required gates for fixing the timing of the circuits, hence, leads to savings in terms of the total area and power consumption. Therefore, it is crucial for exploiting the full potential of SFQ logic family for building complex digital systems running at tens of GHz. 6.3 qTopGen: A Timing Uncertainty-Aware Topology Generation Algorithm The timing uncertainty-aware topology generation problem is defined as follows. • Objective: Minimize the weighted sum of (i) the total negative slack induced by timing uncertainties in a clock tree and (ii) the total wirelength of a clock tree. • Input: (i) a placed netlist, (ii) a list of sink nodes, (iii) timing characteristics of logic gates in terms of clock-Q delay, setup time, hold time, combinational delay. • Output: A CTT connecting the clock source to all the clock sinks. • Constraints: The clock topology should be a binary height-balanced tree; each node should have at most two child nodes as the maximum fan-out for splitters in the current SFQ cell library is two [125]. Additionally, the total number of clock splitters on each source-sink path should be identical, i.e., all the sinks should have the same depth. 6.3.1 Overall Flow The overall flow of the proposed CTT generation algorithm is shown in Fig. 6.6. Once the logic synthesis and placement steps are done, a netlist comprising gates and their connections, along 143 Clock Tree Embedding Placement Logic Synthesis Timing Analysis Timing Analysis & Fix ILP-based Topology Gen. ILPTopGen 1 2 3 4 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 1 2 3 4 1 2 3 4 Figure 6.6: The overall flow of the proposed clock tree topology generation algorithm (qTopGen). with the location of all the sequential elements (leaves of the clock tree) are passed to the clock topology generation algorithm (qTopGen). The clock topology generation algorithm (qTopGen) starts by performing timing analysis on the circuit to calculate the timing criticality of the paths (in terms of setup and hold time slacks), assuming an ideal zero-skew clock network. In the following step, the hierarchy of the clock tree is formed; the number of nodes in each level of the clock tree is calculated. Next, the connections between parent and child nodes are formed level by level, in a bottom-up manner. In each step of the clock topology generation, an integer linear programming (ILP) problem is solved that determines which nodes in the clock tree should be paired up. Once the clock topology is generated, the locations of all internal nodes of the tree on the Manhattan plane are determined. Subsequently, timing analysis is performed to calculate the timing slacks after the clock tree synthesis, considering the effect of clock skew and timing uncertainties. At this step, the overhead of fixing all timing violations is estimated. In the following, each step of the qTopGen algorithm is detailed. 144 In the timing analysis step, we consider the effect of process variations on the timing metrics of the logic gates and calculate the worst-case timing slack for each path considering the maximum and minimum values of clock-Q delay, setup time, hold time, and combinational gate delay using equation (6.4). Consequently, we identify the paths that are inherently prone to timing violations. As this analysis is performed after the placement and prior to synthesizing a clock tree, we assume an ideal zero-skew clock tree is employed. The goal is to design the clock tree topology, such that the number of clock splitters on the non-common clock path from the clock source to pairs of sequentially adjacent FFs (connected by timing critical paths) is minimized. The timing critical paths (i.e., the ones with small positive or negative slacks at the capturing FF) may easily violate the timing constraints when accounting for the process-induced clock skews. Consequently, to minimize the overhead of timing-closure and maximize the timing yield, the timing uncertainty on the clock path to FFs connected by the timing critical paths should be minimized. Assuming a circuit has n sink nodes, we design the clock tree to have a height of h max =dlog n 2 e. Additionally, assuming the sinks to be at level 0 of the tree, the number of nodes at level i (0 0 then 13 V i+1 m = parent(v i j ) V i+1 n = parent(v i k ) if V i+1 m 6=V i+1 n then 14 slack i+1 V i+1 m →V i+1 n γ +slack i j→k 15 return Slacks i+1 147 on the non-common clock path to them increases. The constraints are as follows: (i) each node can only be paired with one other node, (ii) based on the number of nodes at levels i and i + 1 of the hierarchy, the total number of pairs is fixed. Accordingly, an integer linear programming problem is formulated to determine the connections between nodes at level i + 1 and i as follows: minimize n i X j=1 n i X k=j+1 α·dist i j,k ·pair i j,k + n i X j=1 n i X k=j+1 (1−α)·slack i j→k · (1−pair i j,k ) subject to n i X k=j+1 pair i j,k ≤ 1 ∀j∈{1...n i } n i X j=1 n i X k=j+1 pair i j,k =m i variables pair i j,k ∈{0,1} ∀j,k∈{1...n i }, j<k (6.5) In this formulation,n i denotes the number of nodes at leveli of the clock tree. α is a constant (0<α< 1) setting the ratio of importance between the total wirelength and timing criticality. m i represents the total number of pairs at each level of the tree. At level 0 of the tree (i.e., all nodes are leaf nodes), m i =n 0 −n 1 as there may be single output nodes at level 1. For all other levels, in which each parent node has exactly two child nodes, m i =n i /2. dist i j,k and slack i j,k are constant values, elements of the nDist i and nSlack i matrices, respectively. In the above formulation, the cost function comprises of the total cost of the edges from a level to another (in terms of wirelength) and the total (negative) slack for all the paths among nodes in a level. Since the clock tree is not embedded yet, the cost of each edge (in terms of wirelength) is unknown. In zero-skew embedding algorithms such as the deferred merge embedding (DME) [123], the location of the internal nodes of a clock tree are determined based on the observation that the sum of the cost of the edges between a parent node p and its two child nodes j and k is a constant value (ignoring the detour wires). If nodes j and k 148 are at locations loc(k) and loc(j), respectively, the total cost of edges from a parent node p to j and k is equal to the Manhattan distance between loc(k) and loc(j), independent of the loc(p). Consequently, thedist i j,k value models the increase in the total wirelength of the clock tree if nodes j and k become siblings. Similarly, if nodes j and k are paired up, the number of splitters on the non-common clock path to nodes j to k becomes 0. Accordingly, slack at node k can not further decrease (become more negative) as a result of timing uncertainty in the clock network. Intuitively, if nodes j,k are close to each other or there exists a large negative slack on the path from j to k, pairing up j and k reduces the cost function. Using this hierarchical approach, we jointly minimize the uncertainty on the clock skew and routing cost for each level of the tree. After the ILP problem is solved, the locations of the nodes in level i + 1 are estimated as a function of the locations of their child nodes (cf. Algorithm 3 line 6). It is important to note that these locations are not the final embeddings of the internal nodes, but estimations required for evaluating the total wirelength of the next level of the tree. We estimate the location of a parent node to be the center of the merging segment (MS) of the two child nodes. This is motivated by the well-known DME algorithm that creates a zero-skew MS for each parent node as a function of the merging segments of its child nodes, such that the total wirelength of the clock tree is minimized [123]. Once the location of nodes are estimated, the Dist i+1 matrix is updated. The slack on the data paths between non-leaf nodes is calculated using the updateSlacks function (cf. Algorithm 4 lines 9 - 15). In this function, γ represents a (negative) constant value, modeling the uncertainty on the clock path caused by not pairing up nodes j and k, calculated as (t min spl −t max spl ). The intuition is that if two nodes with a data path connection are not paired in this level, the number of clock splitters on the non-common clock path to them increases by two. in the worst case, one of the splitter delays may decrease to the min 149 possible value while the other one may increase to the max possible value. Therefore, the uncertainty on the clock skew between them increases by γ. Fig. 6.6 shows a design with 4 sink nodes. The timing critical paths among sinks (i.e., the ones with negative slacks) are shown in red. As depicted, after the topology generation, node 1 is paired up with node 3. Similarly, nodes 2 and 4 become siblings. Each of these pairs are assigned a parent node from the upper level nodes, i.e, nodes 5 and 6, respectively. At this point, the total number of splitters on the non-common clock path to nodes 0 and 1 becomes 2. Solving the ILP for all the levels of the tree generates a height balanced binary tree topology. In the next section, we present the simulation setup, the metrics used for evaluating of the quality of results, and the simulation results. 6.4 Simulation Results We implemented the proposed clock topology generation algorithm in C++ and used the IBM CPLEX v. 12.8 package for solving the ILP problems [91]. We used the qPlace tool for placement and qSSTA tool for statistical static timing analysis [126]. We implemented the CTT generation algorithm proposed in [127], which uses a variation of MMM method for generating balanced tree topologies with low routing cost [116], as the baseline of our comparison. Note that well-known topology generation algorithms such as greedy deferred merge embedding [114], balanced bi-partitioning [128], and geometric matching [115] either result in unbalanced clock topologies or are wirelength-driven and do not consider the timing uncertainties. Therefore, are not suitable options for generating high-quality clock trees for SFQ circuits. We used the DME algorithm for clock tree embedding [123] and the approach proposed in [127] for clock splitter legalization. Note that although the target is a zero-skew clock tree, due to mapping splitters to routing channels and legalization of clock splitters, the 150 Table 6.1: Simulation results for ten benchmarks from ISCAS85 and EPFL benchmark suites [28] [94]. WL, Avg., and Impr. stand for wirelength, average, and improvement, respectively. We report the average total negative slack and the average number of hold-fixing buffers for all SSTA samples. #Sinks Max Clock Skew (Nominal) Total Clock Tree WL Avg. Total Negative Hold Slack (ps) (SSTA) Avg. #Hold-fixing buffs (SSTA) Benchmark Baseline ILPTopGen Impr (%) Baseline ILPTopGen Impr (%) Baseline ILPTopGen Impr (%) Baseline ILPTopGen Impr (%) EPFL/router 830 3.2 4 -25 331040 326940 1.2 -5.7 -5.2 8.8 7.1 5.3 25.4 EPFL/cavlc 1559 5.6 3.7 33.9 625100 673600 -7.8 -8.2 -2.2 73.2 8 3.4 57.5 EPFL/ctrl 253 3.1 3.7 -19.4 91370 86900 4.9 -2.8 -0.5 82.1 1.4 1.3 7.1 ISCAS/c432 992 4.6 4.1 10.9 370570 344570 7 -7.2 -1.5 79.2 7.7 2.3 70.1 ISCAS/c499 630 3.3 1.8 45.5 281200 298600 -6.2 -2.8 -1.3 53.6 3.5 1 71.4 ISCAS/c1908 1206 7.8 5.6 28.2 633300 654900 -3.4 -8.5 -2.4 71.8 6.8 3.4 50 ISCAS/c2670 3220 6.5 4.9 24.6 1332700 1285400 3.5 -17.9 -6.8 62 18.6 8.5 54.3 ISCAS/c3540 2679 4.2 5.1 -21.4 1275200 1239300 2.8 -17.4 -6.5 62.6 16.2 9.1 43.8 ISCAS/c5315 5286 21.2 12.5 41 2698000 2302000 14.7 -126.1 -47.9 62 76.6 34.7 54.7 ISCAS/c6288 5567 5.9 5.6 5.1 2511900 2516500 -0.2 -40.1 -9.8 75.6 33.9 12.8 62.2 Average 6.5 5.1 21.5 1015038 972871 4.2 -23.7 -8.4 64.6 18.4 8.8 52.2 final clock skew may be non-zero [127]. Based on experimental analysis, we chose parameter α to be 0.67. We evaluated the quality of results using statistical static timing analysis (SSTA). For this purpose, in each simulation, we generated random values for all the parameters in a logic cell (i.e, critical current, and inductor values) assuming normal distribution and standard deviation values of 3% and 8% for inductors and JJ area, respectively. Then we calculated the clock-Q delay, setup, and hold time values, and propagation delays for both logic and clock gates and performed timing analysis based on the generated values. We used the RSFQ cell library from [125] and extracted the timing parameters for each gate. As a consequence of process variations, the delay of an splitter gate can deviate from the nominal value of 5.5ps to the minimum and maximum values of 4.1ps and 9.89ps, respectively. We performed 6000 Monte-Carlo simulations for each circuit. For each design, we report the total negative slack and the total number of logic gates required to fix the timing violations on all paths, divided by the total number of SSTA samples. In this dissertation we only considered the hold time slacks during clock tree generation and we report the quality of results considering the hold constraints only. Experimental results in terms of the the maximum clock skew (in the nominal condition), the total wirelength of the clock tree, the total negative slack, and the total number of required hold-fixing buffers are shown in Table 6.1. As shown, the proposed method reduces the nominal clock skew and the total wirelength of the clock trees by 21.5% and 4.2%, respectively. As mentioned earlier, having more flexibility in terms of placement of clock splitters helps reducing the nominal clock skew and the total wirelength values for 151 most of the benchmarks. Additionally, the qTopGen algorithm reduces the average negative hold slack and the average number of required hold buffers by 64.6% and 52.2%, respectively, compared with the baseline method. The proposed method consistently improves the total negative hold slack and the number of required hold-fixing buffers for all the benchmarks. The key factor enabling the success of this approach is considering the critically of the data paths and the total wirelength of the clock, simultaneously, during the clock topology generation. The proposed approach leads to low cost clock trees which require a small number of gates for fixing timing violations in presence of process variations, which in turn reduces the total area and power consumption of the circuits. 6.5 Summary In this chapter, we presented a hierarchical bottom-up clock tree topology generation algorithm for superconducting electronic circuits that considers the timing criticality of the data paths as well as the total wirelength of the clock network using a novel integer linear programming formulation. The qTopGen algorithm generates a height-balanced binary clock tree and optimizes the assignment of sink nodes to appropriate branches of the clock tree, aimed at reducing the process induced total negative slack and the total wirelength of the clock tree, simultaneously. 152 Chapter 7 Uncertainty-Aware Timing Closure In this chapter 1 , we present a physical design methodology for timing closure for super- conducting circuits. In particular, we present a timing uncertainty-aware hold time fixing methodology using the hold buffer insertion technique which takes advantage of the common path pessimism removal (CPPR) phenomena and and an incremental placement algorithm to minimize overhead of timing closure in terms of area and power consumption. Compared to a state-of-the-art methodology which employs fixed hold time margins, across ten benchmark circuits in ISCAS 85 benchmark suite, our proposed methodology improves the number of inserted hold buffers by 8.4% with a 6.2% increase in timing yield and by 21.9% with a 1.7% increase in timing yield. 7.1 Overview Superconductive electronics, and single-flux quantum (SFQ) [129] in particular, is a promising replacement for complementary metal–oxide–semiconductor (CMOS) technology for exascale supercomputing. With the increasing need for big data and supercomputing, the hundreds of megawatts of power needed by current exa-scale computing platforms is a growing concern [130]. Rapid SFQ technology was introduced back in the late 1908s [129], with a theoretical 1 The materials presented in this chapter represent a collaborative work with Xi Li from University of Southern California. Both authors contributed equally to this research. 153 potential of meeting high performance needs with three orders of magnitude lower power compared to state-of-the-art semiconductor technologies [131]. Nevertheless, the potential of SFQ is yet to be achieved for complex designs such as microprocessors for a variety of reasons. Most notably, the lack of an established reliable computer-aided design (CAD) tool flow has been a long-term obstacle for SFQ technology to scale [132] [133]. More specifically, SFQ circuits often have ultra high clock frequencies and significant timing uncertainties [134] [135], which makes the design of clock distribution networks (CDN) and timing closure extremely challenging. In particular, managing clock skew in high-frequency CDNs is extremely important [136] because, in addition to the storage gates, all logic gates are clocked and all fanouts are implemented with splitters, which increases the size and insertion delay of the clock network significantly. Moreover, due to variations in the fabrication process, and biasing or temperature variations, timing uncertainties in the clock trees are significant. Therefore, developing methodologies for generation of high-quality clock networks and robust timing uncertainty-aware timing closure algorithms is paramount. In the previous chapter, we presented a clock network topology generation algorithm using a fully-balanced tree structure which accounts for timing criticalities in the data path and the total wirelength of the clock network, and minimizes the total negative slack in the presence of timing uncertainties by assigning different sink nodes to different branches of the clock network. This approach tries to minimize the timing uncertainties and the overhead of timing closure by modifying the structure of the CDN. Even when deploying such CDN algorithms, EDA flows need to utilize efficient techniques to close timing, i.e., fix setup and hold timing issues. In particular, potential hold-time violations are typically mitigated by adding clockless buffers into data paths with a negative hold slack. The high operating clock frequencies and gate-level pipelining in SFQ circuits makes this task particularly challenging. Moreover, during addition of these hold-buffers, one should account for timing variations as 154 well as incurred overheads through some form of static timing analysis (STA) [24] or classic Monte Carlo simulation [25]. To the best of our knowledge, this is the first work to propose a physical design methodology for hold buffer insertion in SFQ circuits. Our approach is variation-aware and reduces area and performance overheads by applying common-path pessimism removal (CPPR) to remove the pessimism associated with the common clock paths to pairs of sequentially adjacent gates [26, 27]. Furthermore, we employ an incremental placement methodology to place the added buffers and minimize the perturbations to the original placement solution to further preserve the layout area and minimize the overheads. The efficacy of the proposed method is evaluated using ten benchmarks from ISCAS’85 benchmark suite [28]. Compared with a methodology utilizing fixed constant margins for fixing all timing paths [25], our method significantly reduces the number of inserted hold buffers with competitive timing yield. The key contributions of this chapter can be summarized as follows: • We develop the first timing variation-aware hold time fixing approach for SFQ circuits. Our approach considers both local and global timing uncertainties and worst-case scenarios in terms of hold slacks and effectively uses common path pessimism removal technique to reduce the number of inserted hold buffers on each timing path. • We present an incremental placement methodology for hold buffer to generate high quality solutions in terms of placement metrics, such as the layout area and maximum clock frequency. • The approach is evaluated using dynamic timing analysis with a grid-based placement- aware variation model [25] on multiple ISCAS’85 benchmark circuits. The functionalities of circuits are verified via Monte Carlo co-simulation with behavioral netlists. Comparing our variation-aware approach to a fixed-margin baseline, the average number of hold buffers is reduced by 8.4% with 6.2% increase in timing yield and 21.9% with a 1.7% 155 increase in timing yield. Our methodology enables a trade-off between timing yield and layout area by tuning algorithmic parameters. Compared with prior work in CMOS, there are two key differences in our approach. First, state-of-the-art CMOS CPPR algorithms use complex pre-processing steps to transform the problem of finding the least-common-ancestor (LCA) of sequentially adjacent gates into a range minimum query (RMQ) problem [27, 137, 138]. In SFQ circuits, however, the clock tree must be implemented with splitter circuitry that is often limited to binary forks. Moreover, since every logic gate is clocked, the clock tree is usually far deeper than CMOS and is typically balanced to minimize the clock skew. Its balanced nature enables our LCA detection algorithm to be significantly simpler and more efficient to solve. Secondly, the combinational logic between sequentially adjacent gates is limited to splitters that route datapath signals to multiple clocked sinks. It is sufficient to insert hold buffers at the input port of any clocked gate with a hold violation. The rest of this chapter is organized as follows. Background and prior work are summarized in Section 7.2. Section 7.3 presents the proposed approach for hold time fixing and incremental placement. Section 7.4 details the experiments flow including the CAD tools, grid-based variation model, and dynamic simulation methodology. Simulation results are presents in Section 7.5 followed by the summary in Section 7.6. 7.2 Preliminaries 7.2.1 Definitions and Notation In this section, we introduce some definitions and notations used throughput this chapter. 7.2.1.1 Combinational and sequential cells in SFQ logic In CMOS logic, combinational logic refers to digital circuits which determine their outputs as Boolean functions of their inputs. However, in SFQ circuit, gates such as AND, OR, XOR, 156 Figure 7.1: Example of a timing path in a SFQ circuit. and NOT are clocked [139]. They process pulses on their inputs by changing states of their internal current loops and produce their output pulses in response to an input clock pulse and thus are also sequential. Moreover, unlike CMOS, splitters are needed to actively copy input pulses to multiple outputs. In particular, splitters are needed to distribute the clock and implement the fanout of logic gates. In addition, buffers that simply copy their single input to their single output are also sometimes needed. Both of these gates need not be clocked and thus are considered as combinational SFQ gates. 7.2.1.2 Data path delay In conjunction with interconnect delay, combinational gate delays make up the data path delay between a pair of sequentially adjacent SFQ gates. For example, Fig. 7.1 illustrates two sequential SFQ gates G 1 and G 2 , while S 1 −S 5 are clock splitters that distribute the clock signal from a clock source to each sequential gate in the circuit. The data path delay of the path from G 1 and G 2 refers to the summation of clock-to-Q delay of G 1 , the delay of the splitters, and interconnect delays between the gates. 157 7.2.1.3 Insertion delay The insertion delay at each gate (pin) refers to the total delay from the clock source to clock input of that sequential element (pin) [140]. This is also known as the arrival time of the clock signal at the clock pin of each sequential gate. For G 1 in Fig. 7.1, it refers to the delay from clock source (Clk Src) through splitters S 1 , S 3 , and S 4 as well as the delay of interconnects connecting the clock source to G 1 . 7.2.1.4 Clock skew Two sequential elements connected by data path splitters and interconnect are called se- quentially adjacent gates. In particular, G 1 and G 2 are called launching and capturing flops, respectively. The difference between the clock arrival time at the launching and capturing flip-flops is defined as the clock skew between the pair of gates. For the pair of flip-flops illustrated in Fig. 7.1, the clock skew is the difference between the delay from Clk Src through splitters S 1 , S 3 , and S 4 as well as the delay of interconnects connecting the clock source to G 1 , and the delay from clock source through splitters S 1 , S 3 , and S 5 as well as the delay of interconnects connecting the clock source to G 2 . 7.2.1.5 Hold time Each sequential element requires some time to reliably capture input data at its data pin. Hold time is defined as the minimum amount of time after the clock pulse during which input data must remain stable. The hold time imposes a timing constraint for a pair of sequentially adjacent elements, such as G 1 and G 2 , as follows: T skew (G 1 ,G 2 ) +T min DP (G 1 ,G 2 )≥T max hold (7.1) whereT skew (G 1 ,G 2 ) represents the clock skew betweenG 1 andG 2 , andT min DP (G 1 ,G 2 ) denotes the minimum data path delay. The data that propagates fromG 1 toG 2 should remain stable 158 at G 2 for T max hold after the sampling clock edge. Hold slack is defined as the difference between data arrival time and data required time, defined as Slack hold (G 1 ,G 2 ) =T skew (G 1 ,G 2 ) +T min DP (G 1 ,G 2 )−T max hold (7.2) A negative hold slack leads to a hold time violation. As shown in (7.2), hold slack is not a function of the clock cycle time. Therefore, it cannot be fixed by adjusting clock frequency and even a single hold time violation may lead to circuit malfunction. Unlike CMOS technology, where multiple hold time refinement techniques such as gate down-sizing, switching between low-threshold and high-threshold gates, and wire sizing are commonly used, such options are not available in the current SFQ technology and existing cell libraries. Hence, hold buffer insertion is one of the most applicable approaches for timing closure in SFQ circuits. 7.2.1.6 Hold time margin Hold time margin (i.e., safety margin) is the extra delay added to data paths to avoid hold violations accounting for variations in the timing parameters in (7.2). Providing this “cushion” to the design can help make it more robust to timing variations and possible hold time violations. Accounting for hold margin, (7.1) is modified as follows: Slack hold (G 1 ,G 2 )≥T margin hold (G 1 ,G 2 ) (7.3) 7.2.1.7 Clock tree topology In this paper, we define the clock tree topology (CTT) as a directed binary tree T connecting clock source to all the clock sinks. The root of T denotes the clock splitter S 0 connected to the clock source while the nodes{S n+1 , S n+2 , . . . , S n+k } represent the leaf nodes (clock sinks), i.e., sequential elements in a circuit. A clock topology generation algorithm creates nodes{S 1 ,...,S n } representing the clock splitters, adds them to the clock tree, and connects the clock source to all the sink nodes. We assume the leaf nodes are at level 0, while the root 159 is at the highest level of the tree. In Fig. 7.1, the CTT refers to the binary tree including Clk Src, S 1 through S 5 , and the edges between clock splitters S 4 ,S 5 , and sink nodes G 1 ,G 2 , respectively. 7.2.1.8 Common clock path (CCP) A clock path (CP) is a path in T from root node through tree nodes to each leaf node (clock sink). For two leaf nodes in T, the common clock path (CCP) refers to the portion of the clock tree that is common between the CPs to each gate, while the non-common clock path (NCP) represents the non-common portion of the CPs. In Fig. 7.1, the common clock path includes nodes S 1 and S 3 and edges connecting Clk Src to S 3 . 7.2.2 Timing Uncertainty-Aware Clock Topology Generation The topology of a clock network is one of the key factors that determine the robustness of a circuit against variations in timing parameters and the timing yield of a the circuit. One of the key goals in achieving high timing yield is to maximize the CCP to pairs of flip-flops which are on timing critical paths; that is to ensure the variations in the arrival time of the clock signal at launching and capturing flops does not lead to negative slacks and timing violations. Consequently, when generating a CTT, accounting for the criticality of data paths and timing yield are crucial. In the previous chapter, we presented a low-cost, timing uncertainty-aware balanced CTT generation algorithm for SFQ circuits. This algorithm considers the criticality of the data paths in terms of timing slacks as well as the total wirelength of the clock tree and generates a (height-) balanced binary clock tree using a bottom-up approach and an integer linear programming (ILP) formulation. The results show significant improvement in terms of timing metrics (i.e., total and worst negative slack) over the baseline algorithm which is essentially a wirelength driven approach, i.e., method of means and medians (MMM) [116, 141]. 160 In this chapter, we utilize this method to generate high quality balanced clock topologies which improve the timing yield and reduce the need for hold-fixing buffers and associated area overhead. Table 7.1 lists the results of applying the proposed ILP based topology generation algorithm for several benchmarks from the ISCAS’85 benchmark suite [28] and compares the results with that of the MMM algorithm in terms of the number of required hold buffers after placement and clock tree synthesis. To account for worst case conditions, it is assumed that due to timing variations, the delay of all the gates on the launch (capture) CP decrease (increase) by (3∗σ), where σ denotes the standard deviation of gate delay values. It is also assumed that the delay of all gates on each data path is set to its minimum possible value. and the delay of hold buffers is set to be 5.5ps. As shown in Table 7.1, using the proposed approach in [141], the total number of hold buffers required to fix the hold time violations in various benchmarks is reduced by an average of 6% over 7 benchmarks, by as much as 14.8%, showing the importance of minimizing worst-case negative slack. Table 7.1: Number of hold buffers with greedy topology generation vs ILP topology generation. 3σ Design # buffers MMM [116] # buffers ILP [141] Saving (%) c432 1188 1012 14.8 c499 447 426 4.7 c880 828 757 8.6 c1355 421 431 -2.4 c1908 805 739 8.2 c3540 1490 1416 5.0 c5315 3249 3055 6.0 AVG. 1204 1119 6 7.3 Proposed Method In this section, we formally define the problem, provide a motivating example, and illustrate the proposed methodology in detail. 161 7.3.1 Motivating Example Consider the clock tree topology depicted in Fig. 7.2 where a clock signal is propagated to eight sink nodes using a balanced binary tree. Assume, the propagation delay of internal nodes (i.e., splitters) as well as hold buffers to be 5ps with a±20% variation on the delay due to timing variations. Consider a data path connecting node 7 to node 14. Considering the worst-case scenario in terms of hold slack, due to timing variations the delay of all clock splitter nodes on the launch CP decrease to 4ps and all the splitter delays on the capture CP increase to 6ps. As a result, the hold slack may be reduced by 6× 1ps = 6ps. However, such an estimation is overly pessimistic as node 0 is on the common CP to nodes 7 and 14. Therefore, the worst-case incurred hold slack cannot be less than−4ps. Consequently, accounting for CPPR during the timing closure can reduce the number of required hold buffers, connecting node 7 to 14, from 3 to 2, assuming each hold buffer adds a delay of 2ps. This reduces the area overhead by 33% while achieving the same timing yield. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (a) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (b) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (c) Figure 7.2: A balanced binary tree with 8 sinks. (a) At nominal condition, the max clock skew is zero. (b) If all the splitter delays on path S 0 →S 7 decrease to the min propagation delay and all the splitter delays on the S 0 →S 14 increase to the max propagation delay, a hold time violation may occur on data path S 7 →S 14 . (c) Two buffers are added to the path between S 7 →S 14 to increase its hold slack. 7.3.2 Problem Formulation The input to this problem is a netlist graph (G) comprising of the logic cells (i.e., nodes), their connections (i.e., edges), the location of each node on the layout area, descriptions of the clock network, including the clock tree topology (T) and the location of the splitter 162 nodes (S 0 ...S n ), and a variability model that defines the variations of gate delays under the presence of process, biasing, or temperature variations. The objective function is to minimize the total number of required hold buffers such that timing of the circuit with respect to hold constraints is satisfied or a target timing yield is reached. Therefore, the total negative hold slack in the presence of timing variations should be less than a predefined threshold value. In other words, we aim to satisfy the constraints in terms of worst negative hold slack while minimizing the number of inserted hold buffers, i.e., the area overhead of timing closure. 7.3.3 Proposed Methodology - An Overview The overall flow of our proposed method is depicted in Fig. 7.3. The first phase follows the algorithms outlined in [142] and [25]. The Berkeley open source logic synthesis tool, ABC[143], is used to synthesis the netlist. The qPlace tool, as part of the qPALACE CAD tool suite for SFQ circuit [142] [141], is employed to place the logic gates, construct a minimum-skew clock network, and place the clock splitters [141] [142] to minimize the nominal clock skew, reducing the total negative slack associated with all sequentially adjacent gates. ABC then parses and translates the generated netlist and clock tree information, extracts delays associated with the logic gates, splitters, and interconnect using a linear delay model [141], and performs timing analysis for all setup and hold constraints. Using our proposed variation-aware hold fixing algorithm, ABC then inserts hold buffers, i.e., Josephson transmission line (JTL) cells, between sequentially adjacent pairs in the netlist as needed. Figure 7.3: Overview of the proposed variation-aware timing closure methodology. 163 As shown in Fig. 7.3, in the second phase of the flow, the updated circuit, including newly inserted hold buffers, is physically placed by the placement tool, either using our proposed incremental placement technique or by re-placing the entire circuit. The IBM CPLEX v12.10 package [144] is used for solving mixed integer linear programming (MILP) problems for clock tree placement and legalization, as described in Chapter 5. Since this step may modify the location of logic gates, the clock tree is re-synthesized to minimize the nominal clock skew and generate a high quality clock network, albeit we force it to use the same clock tree topology as in phase one. This is motivated by the variation-aware hold fixing algorithm that takes advantage of the topology of the clock tree and common clock paths to minimize the area overhead of timing fixes. Although clock skews and hold slacks in the nominal condition may be modified, the timing variations will not lead to additional timing violations, as the worst case variations as a function of the clock topology remains the same. Additionally, to both mitigate routing congestion caused by the relative few number of routing layers and to facilitate a final pass of hold buffer insertion, during this second placement phase, we reserve empty space next to each logic cell and existing hold buffers. After this second placement, we parse the re-placed circuit into ABC and run a final round of timing analysis. If any additional hold buffers are required, they are inserted into the reserved spaces without the need to move other gates. The next subsections detail our novel hold time fixing and incremental placement tech- niques. The description of the third (evaluation) phase of the flow is presented in Section 7.4. 7.3.4 Variation Aware Common Path Pessimism Removal Common path pessimism removal (CPPR) refers to the removal of unnecessary pessimism of timing analysis on launching clock path and capturing clock path by accounting for the common portion of the launch and capture CPs [26] [27]. 164 In particular, the timing uncertainty of the propagation delay of these elements in the CCP affects the launching and capturing clock path delays similarly. Thus, we opt to remove this unnecessary timing uncertainty from our timing analysis to more accurately capture the worst-case timing scenario. To account for the remaining timing variations, we add hold margins to each timing path, thereby make the design robust to hold time violations. AssumeG 1 andG 2 are the launching and the capturing logic gates, respectively. The hold slack from G 1 to G 2 is calculated as: Slack hold (G 1 ,G 2 ) =T skew (G 1 ,G 2 ) +T min DP (G 1 ,G 2 )−T max hold −T margin hold (G 1 ,G 2 )≥ 0 (7.4) Instead of applying a constant hold margin to all hold slacks as proposed in [25], we present a variation-aware strategy to determine the hold margin required for each timing path and apply CPPR to eliminate unnecessary pessimism. For all the gates in the circuit, we assume the gate delay follows a Gaussian distribution with same standard variation value σ. In the worst-case scenario, a hold critical (short) path can be assumed to have the following characteristics: the delay of clock splitters in the launching clock path bias to the lowest possible value, while the delay of clock splitters in the capturing clock path bias to the highest possible value. At the same time, the delay of gates in data path bias to the minimum value. We consider the three-sigma rule [145] which states that 99.7% of the possible data are within three standard variation deviations from the mean and assume a worst-case delay change of 3σ for each gate. Consider the worst case scenario in (7.4), the variation-aware hold margin can be set as the the maximum reduction of Slack hold (G 1 ,G 2 ) caused by gate delay variations. T margin hold (G 1 ,G 2 ) = 3σ∗ (D NCP (G 1 ,G 2 ) +D DP (G1,G2)) (7.5) where D NCP (G 1 ,G 2 ) represents the sum of the delay of the clock splitters delay that are not common to LP and CP, while D DP (G1,G2) refer to the sum of delay of the gates along the data path from G 1 to G 2 . 165 D NCP (G 1 ,G 2 ) is determined by finding the lowest common ancestor in the balanced binary clock tree. We record the routes from the clock source to clock sinks when building the clock tree. For any two sequentially adjacent gates, the algorithm traverses both launching and capturing clock paths level by level, starting at the leaves (level 0), until reaching a common node, the lowest common ancestor (LCA). 2 The number of splitters in the NCP is simply the number of nodes met during the traversal, which is equal to twice the level of the LCA minus 1. Thus, we can compute D NCP (G 1 ,G 2 ) in (7.5) as: D NCP (G 1 ,G 2 ) = 2∗ (level(LCA(G1,G2))− 1)∗D sp (7.6) where D sp denotes the nominal splitter delay. For each timing path, the hold time constraint is checked and in case of a negative slack, hold buffers are added before the corresponding input pin of G 2 until the constraint is met. Consider as an example the circuit in Fig. 7.1 and the hold time for the path G 1 to G 2 . The clock path of G 1 consists of the following clock segments: the clock source, splitters S 1 , S 3 , S 4 . Similarly, the clock path of G 2 comprises of clock segments: clock source, splitters S 1 , S 3 , S 5 . The CCP contains S 1 and S 3 , while NCP contains S 4 and S 5 . level(LCA(G1,G2)) is 2 in this case. The hold time margin for G 2 is thus T margin hold (G 1 ,G 2 ) = 3σ∗ (2∗D sp +D G 1 +N data sp ∗D sp ) (7.7) where D G 1 denotes the nominal clock-to-Q delay of G 1 , and N data sp denotes the number of splitters in data path from G 1 to G 2 . T margin hold (G 1 ,G 2 ) can be used to guide hold buffer insertion for the timing path from G 1 to G 2 . Assuming there are N clock sinks in the whole circuit, the height H of clock tree, a fully-balanced binary tree, isblog 2 Nc + 1. For CCP, the time complexity to find LCA is O(H). We denote the total number of pairs of sequentially adjacent gates asE. Consequently, the time complexity of finding LCA of all sequentially adjacent gates is O(E∗ log 2 N). The fact that the clock tree is a fully-balanced binary tree simplifies the LCA detection algorithm. 2 Note that the common node must be at the same level in both the launch and capture paths because the clock tree is balanced. 166 In particular, for traditional CMOS circuits, state-of-art LCA detection algorithms typically require a reduction to an instance of range minimum query (RMQ) problem via an Euler walk of the clock tree and the storage of several extra tables [137, 138, 146]. 7.3.5 Placement Methodology After adding the hold buffers to the netlist, the location of logic cells and inserted hold buffers is determined. Note that since the timing closure algorithm is based on the original clock topology and inserted hold buffers are not sequential elements, after placement and legalization of the hold buffers, the same clock topology can be utilized. However, to minimize timing metrics such as maximum clock skew and the negative timing slacks, the location of gates are properly adjusted. Similar to [142], we adopt the deferred merge embedding (DME) algorithm for calculating the location of the tapping points of the clock network on the layout area and employ an integer linear programming algorithm to map the clock splitters to the routing channels and remove the overlaps among the clock splitters and logic gates, such that the maximum clock skew is minimized, as explained in Chapter 5. Finally, once a legal high quality solution in terms of placement and timing metrics is generated, another iteration of timing fixes resolves the timing violations. Once the total number of required hold buffers for each timing path are determined using the algorithm presented in Section 7.3.4, the inserted hold buffers must be physically placed. Accordingly, the location of the placed logic and clock splitters are modified to accommodate the placement of the hold buffers. The total number of inserted hold buffers is a function of the degree of timing uncertainties and can be a large fraction of the size of the original netlist. Therefore, to optimize the quality of results in terms of the total wirelength and timing metrics, we propose two placement strategies: (i) Incremental Placement and (ii) Placement from Scratch. 167 7.3.5.1 Incremental Placement The incremental placement methodology helps preserving the original placement solution by constraining the placement of logic cells and data splitters to the same rows as they are initially placed. When the number of inserted hold buffers is relatively small, an incremental placement approach helps minimizing the displacement of original netlist elements, eliminates the need for extensive modifications to the clock network, and facilitates converging to a high quality solution in presence of timing uncertainties and without the need for multiple iterations of placement and clock synthesis. Alternatively, when the number of inserted hold buffers is large, executing the placement algorithm from scratch may yield better overall placement metrics such as the total wirelength, layout area, and routing congestion. As the number of required hold buffers depends on the structure of circuit and the degree of variation, in this chapter we consider both of the options. In our incremental placement flow, the logic cells and data splitters remain in their initially placed logic row, i.e., their y coordinates are preserved. However, their x coordinates are adjusted to accommodate the placement of hold buffers. First, the ideal number of gates per row is determined by computing the total number of gates (including hold buffers, logic cells, data splitters and clock splitters) and dividing this number by the number of rows. The algorithm tries to assign hold buffers to logic rows to minimize the final width of the layout area thereby the total layout area after hold buffer insertion, using the number of cells assigned per row as a proxy for row width. This is achieved by setting a threshold value for the number of gates per row and trying to distribute the hold buffers among rows such that the final distribution of logic gates per row is below this threshold for all rows. Initially, this threshold value is set to the average number of gates per row and each row is marked as full or partially-filled by comparing the existing gate count with this threshold. We then assign each hold buffer to the nearest partially-filled row to its fanout gate. In particular, if the row of the fanout gate is partially-filled, we assign the hold buffer to the same row with its x coordinate shifted from its fanout gate by the width of one hold buffer. Otherwise, the 168 hold buffer is assigned to the nearest partially-filled row with x coordinate set to the same as the fanout gate. There may be scenarios in which all close-by rows for a hold buffer are full which can increase the wire delay more than expected and force an increase in the clock cycle time. To avoid degrading the clock frequency when this situation is encountered, we instead increase the threshold by one buffer and accept a small increase in area. Once the assignment of the hold buffers to the logic rows are determined, each logic row is legalized to remove all cell overlaps using [141]. Then the algorithm synthesizes and places the clock tree and produces the final legal placement [142]. Because the logic cells are mapped to their original logic rows, the location of the sink nodes of the clock network are minimally modified. Therefore, the location of the tapping points of the clock network, i.e., the placement of the clock splitters, will be similar to those of the original clock network. In summary, the incremental placement algorithm tries to fulfill three objectives: (i) preserve the layout area by distributing the hold buffers among partially-filled rows; (ii) minimize the perturbation to the original placement solution to facilitate the clock tree synthesis and control the clock skew; and (iii) minimize the adverse effect of additional cell and interconnect delays, due to the insertion of hold buffers, on the maximum clock frequency of the circuit. 7.3.5.2 Placement from Scratch For some circuits and variation settings, the number of inserted hold buffers per row can be comparable to the initial number of gates. Thus, the incremental placement algorithm is forced to make substantial modifications to the original placement solution resulting in a significant increase in the total wirelength. For these cases, it is beneficial to re-place the updated netlist without any row restrictions. Once the placement and clock tree synthesis are completed, the timing closure flow adds a final round of hold buffer insertion to ensure timing requirements under nominal conditions 169 Figure 7.4: Grid-based variation model with uniform grids. are satisfied. Simulation results show that the total number of required hold buffers in this phase is much smaller than the size of the netlist and can be placed in the reserved empty spaces next to existing logic cells without requiring multiple iterations between placement, clock synthesis, and timing closure steps. 7.4 Evaluation Flow We evaluated the timing yield of our final circuits by translating them to System Verilog (SV) netlists using Python scripts, similar to [147], and running dynamic Monte Carlo (MC) co-simulations in Cadence NCsim, checking all setup and hold constraints. This section first presents the variation model used and then details of the co-simulation Monte Carlo flow. 7.4.1 Variation Model In this subsection, we summarize the variation model utilized to generate the random gate delays used in the MC simulations for evaluating timing yield [25]. To account for the spatial correlation of timing uncertainties, we employ a placement-aware variation model, utilizing a grid-based model similar to [25, 148] illustrated in Fig. 7.4. In this model, the entire layout area is considered to be inside one grid bin (level 0) where we assume 170 global variations affect all the gate delays. Subsequently, to account for the local variations on the chip, we divide the layout area level by level, each bin is subdivided into smaller bins. At each level, it is assumed that all process parameters and variations within the same bin have the same characteristics. The differences of bins are determined by the local variation of the level and different hierarchy levels are assumed to be statistically independent [25]. By adding up the time variations induced by all levels, the delay factor of each gate can be defined. The delay of all gates produce a data set which follows a Gaussian distribution. We set the targeted standard variation of this Gaussian distribution data set under process variations based on the process control monitor (PCM) data for 350nm fabrication process SFQ5ee developed by at MIT Lincoln Laboratory (MIT LL) [149]. Then the parameters for global and local variations can be set according to the targeted standard variation [25] [150] [151]. 7.4.2 Dynamic Simulation In order to run Monte Carlo (MC) simulations, we define gate delay scaling factors for each gate in the circuit and write them as System Verilog macros. According to our variation model, we generate a new random delay factor set for each Monte Carlo simulation. In case of a plain simulation without random gate delays, the gate delay factors are set to one. This is used to verify the circuit in the nominal conditions before Monte Carlo simulations. A “golden” behavior netlist is co-simulated to validate the circuit functionality with the same random input vectors. A MC run is considered as a “pass” only when there are no setup or hold violations and no mismatches between the primary outputs of the model under simulation with those of the golden results. 171 7.5 Simulation Results We compare the proposed method with a baseline approach proposed in [25] that applies a constant hold margin to each timing critical path in the circuit. The fixed margin approach does not consider clock topology and requires, in general, multiple iterations to obtain a design-specific fixed margin that achieves a high timing yield while managing the overhead in terms of incurred area. In contrast, our proposed method adds more hold margins to paths where the lowest common ancestor of sequentially adjacent FFs is higher in the clock tree and thereby achieves superior results over all the benchmarks. We also compare the two proposed re-placement methods, i.e., placement from scratch and incremental placement. We run Monte Carlo simulations on the placed ISCAS’85 benchmarks [28]. All experiments were run on two Intel Xeon E5-2450 v2 CPUs with 128 GB of RAM. We experimented with both 2σ and 3σ variations in hold time margin with random Gaussian distributed gate delays to analyze the trade-off between area and timing yield. For the fixed margin approach, we evaluated hold margin values of 10ps and 4ps, before settling on a compromise of 7ps, optimizing both buffer area overhead and timing yield. In all our experiments, the additional white-space around logic cells and hold buffers designed to reduce local routing congestion (i.e., ensuring a routable solution is produced) and inserting hold buffers in the final step of timing closure, is set to 8× the routing track (10μm). To solve the MILP problem for clock tree synthesis, the CPLEX time limit is set to 60 minutes. Table 7.2 lists the timing yield values and the number of inserted hold buffers before and after re-placement, for the baseline, 2σ, and 3σ approaches. In these experiments, the clock frequencies are set to the delay of the longest timing path in addition to a small setup margin. In our simulation results, the average clock frequency of fixed and variation-based hold fixing approaches differ by an average of 5% across all the benchmarks. Compared with the fixed margin approach, our 3σ approach produces netlists with an average reduction of 8.4% in terms of number of hold buffers and an increase in timing yield of 6.2%. The 2σ approach achieves an average of 21.9% saving in terms of number of hold 172 buffers with a 1.7% higher timing yield. The primary source of improvement over the baseline approach comes from the application of CPPR to the clock tree topology. Note that these results assume no common clock path between PIs/POs and the logic cells. If we estimate the common clock path (CCP) length of PI/PO paths with the average CCP length of the circuit, the 3σ approach achieves an average of 10.2% saving in hold buffers over the fixed margin approach. Alternatively, if we simply exclude JTLs on PI/PO paths, the 3σ approach saves an average of 14.1% hold buffers. The results of the placement from scratch algorithm are shown in Table 7.3. The 3σ approach shows a 2.0% saving on the number of hold buffers with a 5.9% improvement on timing yield, when compared with the baseline approach. Additionally, the 2σ approach shows a 19.4% saving on the number of hold buffers with a 0.9% drop on yield. Finally, aggregating Tables 7.2, 7.3, and 7.4 provides a comparison between placement from scratch and incremental placement approaches. Under the assumed variations, incremental placement outperforms the placement from scratch approach, with savings on hold buffers with 22.3%, 27.5% and 24.6% under fixed, 3σ and 2σ margin approaches, respectively. Additionally, incremental placement minimizes the total power consumption associated with DC bias resistors in each SFQ cell by reducing the number of inserted hold buffers, hence static power consumption, which is responsible for most of the circuit power dissipation in standard RSFQ logic [131]. The main advantage in terms of reducing the number of hold buffers originates from minimizing the perturbation to the original placement solution and layout area. As mentioned earlier, our flow utilizes the the original timing-aware clock topology, which considers the criticality of the data paths in terms of timing slacks, after placement and legalization of the hold buffers. Therefore, less perturbations to the placement solution and the fixed clock topology lead to fewer changes in the location of clock splitters, less modifications to the clock arrival times, hence less effort on timing closure and hold fixing. Table 7.5 presents the results including the minimum clock cycle and the layout area by applying two placement approaches under 3σ margin approach. Under the assumed variations, 173 incremental placement has an average overhead of 12.4% and 0.2% in terms of minimum clock cycle time and layout area, respectively. By distributing the hold buffers such that the logic and hold buffers are somewhat uniformly distributed among all the rows, incremental placement minimizes the impact on layout area. The reason behind the degradation of clock cycle times is that incremental placement sometimes creates setup critical paths by adding to the wire delay of paths with inserted hold buffers, whereas placement from scratch approach, aimed at minimizing the total wirelength without row restrictions, manages to minimize more of the wirelengths and hence reduces the long wires on some paths which improves the clock cycle time. Table 7.2: Number of hold buffers (JTLs) and timing Yield (%) for the fixed margin (7ps) and the proposed variation-based 3σ and 2σ hold margin with incremental placement. fixed margin (7ps) 3σ 2σ Design Std Dev Yield(%) # JTLs plc # JTLs replc Yield(%) # JTLs plc # JTLs replc Yield+(%) # JTLs-(%) Yield(%) # JTLs plc # JTL replc Yield+(%) # JTLs-(%) c432 0.0808 99.1 1508 1564 99.7 1012 1420 0.6 9.2 95.4 1012 1185 -3.7 24.2 c499 0.0808 95.9 564 584 99.7 426 531 4.0 9.1 99.3 426 470 3.5 19.5 c880 0.0808 90.4 1031 1151 99.8 757 1018 10.4 11.6 95.4 757 871 5.5 24.3 c1355 0.0809 99.1 577 591 100.0 431 525 0.9 11.2 97.8 431 470 -1.3 20.5 c1908 0.0808 97.5 1085 1150 99.7 739 985 2.3 14.3 98.3 739 830 0.8 27.8 c2670 0.0831 93.8 2831 3570 99.8 2513 3573 6.4 -0.1 96.1 2513 3196 2.5 10.5 c3540 0.0829 93.7 2055 2294 99.7 1416 2190 6.4 4.5 93.5 1416 1770 -0.2 22.8 c5315 0.0829 90.2 4199 5036 99.1 3055 4747 9.9 5.7 91.4 3055 3939 1.3 21.8 c6288 0.0829 92.5 4697 6039 99.2 3592 5404 7.2 10.5 93.0 3592 4685 0.5 22.4 c7552 0.0829 86.8 2601 2990 98.9 1794 2745 13.9 8.2 94.2 1794 2252 8.5 24.7 AVG. 0.0819 93.9 2115 2497 99.6 1574 2313.8 6.2 8.4 95.4 1574 1967 1.7 21.9 Table 7.3: Number of hold buffers (JTLs) and timing yield (%) for the fixed margin (7ps) and the proposed variation-based 3σ and 2σ hold margin with placement from scratch. fixed margin (7ps) 3σ 2σ Design Std Dev Yield(%) # JTLs plc # JTLs replc Yield(%) # JTLs plc # JTLs replc Yield+(%) # JTLs-(%) Yield(%) # JTLs plc # JTL replc Yield+(%) # JTLs-(%) c432 0.0809 90.7 1508 1935 98.8 1012 1685 8.9 12.9 95.6 1012 1385 5.4 28.4 c499 0.0810 99.2 564 698 99.9 426 731 0.7 -4.7 95.9 426 601 -3.3 13.9 c880 0.0808 97.9 1031 1408 99.5 757 1394 1.6 1.0 92.6 757 1132 -5.4 19.6 c1355 0.0810 96.9 577 695 99.5 431 678 2.7 2.4 95.8 431 573 -1.1 17.6 c1908 0.0808 91.3 1085 1343 99.9 739 1274 9.4 5.1 92.5 739 1032 1.3 23.2 c2670 0.0819 95.7 2831 4170 99.5 2513 4155 4.0 0.4 95.1 2513 3615 -0.6 13.3 c3540 0.0829 94.3 2055 2922 99.3 1416 3105 5.3 -6.3 94.6 1416 2417 0.3 17.3 c5315 0.0828 90.8 4199 8507 97.8 3055 8621 7.7 -1.3 89.3 3055 7143 -1.7 16.0 c6288 0.0829 85.9 4697 7731 98.9 3592 7591 15.1 1.8 87.4 3592 6333 1.7 18.1 c7552 0.0829 94.7 2601 5115 98.3 1794 4658 3.8 8.9 89.1 1794 3742 -5.9 26.8 AVG. 0.0818 93.7 2115 3452 99.1 1574 3389 5.9 2.0 92.8 1574 2797 -0.9 19.4 Table 7.4: Number of hold buffers (JTLs) for placement from scratch vs incremental placement approaches. fixed margin (7ps) 3σ 2σ Design # logic gates+splitters # JTLs scratch # JTLs incre. Saving (%) # JTLs scratch # JTLs incre. Saving (%) # JTLs scratch # JTLs incre. Saving (%) c432 3760 1935 1564 19.2 1685 1420 15.7 1385 1185 14.4 c499 1916 698 584 16.3 731 531 27.4 601 470 21.8 c880 3585 1408 1151 18.3 1394 1018 27.0 1132 871 23.1 c1355 1916 695 591 15.0 678 525 22.6 573 470 18.0 c1908 3596 1343 1150 14.4 1274 985 22.7 1032 830 19.6 c2670 6841 4170 3570 14.4 4201 3573 14.9 3615 3196 11.6 c3540 7831 2922 2294 21.5 3105 2190 29.5 2417 1770 26.8 c5315 15575 8507 5036 40.8 8621 4747 44.9 7143 3939 44.9 c6288 14169 7731 6039 21.9 7591 5404 28.8 6333 4685 26.0 c7552 9420 5115 2990 41.5 4658 2745 41.1 3742 2252 39.8 AVG. 6861 3452 2497 22.3 3394 2314 27.5 2797 1967 24.6 Note that if the timing uncertainties were significantly higher, more hold buffers would be needed and the perturbations to the original gate locations would be larger, increasing 174 Table 7.5: Clock cycle time (CCT) and area comparison for placement from scratch vs incremental placement (3σ). Placement from Scratch Incremental Placement Design Std Dev CCT (ps) Area CCT (ps) CCT Overhead Area Area Overhead c432 0.0809 116.6 3.72E+07 139.6 19.7% 3.64E+07 -2.3% c499 0.0810 115.8 1.80E+07 121.7 5.2% 1.82E+07 1.0% c880 0.0808 118.9 3.19E+07 117.4 -1.2% 3.16E+07 -1.0% c1355 0.0810 82.0 1.79E+07 92.5 12.8% 1.80E+07 0.1% c1908 0.0808 88.6 3.20E+07 115.8 30.7% 3.15E+07 -1.6% c2670 0.0819 249.3 7.17E+07 302.2 21.2% 7.24E+07 0.8% c3540 0.0829 146.3 7.11E+07 157.0 7.3% 7.26E+07 2.0% c5315 0.0828 299.5 1.44E+08 324.5 8.4% 1.44E+08 0.4% c6288 0.0829 204.5 1.30E+08 248.9 21.7% 1.32E+08 1.6% c7552 0.0829 306.3 9.81E+07 299.9 -2.1% 9.95E+07 1.5% AVG. 0.0818 172.8 6.52E+07 192.0 12.4% 6.57E+07 0.2% the overheads of incremental placement. For example, consider circuit c880 with variation increased from 0.080 to 0.134. Then, incremental placement overheads increase to 0.4% in area and 8% in minimum clock cycle time. Table 7.6: Runtime for the fixed margin (7ps) and the proposed variation-based 3σ and 2σ hold margin with incremental placement. Run-time (sec) Design fixed margin (7ps) 3σ 2σ c432 101 92 93 c499 57 56 55 c880 95 87 86 c1355 59 56 57 c1908 90 87 86 c2670 206 199 199 c3540 245 235 237 c5315 618 594 593 c6288 516 484 489 c7552 383 363 361 AVG. 237 225 226 Finally, to quantify the scalability of our proposed hold time fixing algorithm, Table 7.6 summarizes the run-time of our timing closure flow beginning after clock tree synthesis and including incremental placement. Our variation-aware approach takes similar time as the fixed margin approach and even for benchmarks with a large gate count takes less than 10 minutes, which verifies the scalability of the proposed algorithm. For running CPLEX, the time limit is set to 60 mins. 175 7.6 Summary In this chapter we presented a variation-aware hold time fixing methodology for SFQ circuits. Aiming at reducing the total number of inserted hold buffers, we consider the the worst-case scenario in terms of gate delay variations given timing uncertainties and apply the common path pessimism removal technique in the clock tree to eliminate the common clock path from considerations. By tuning the algorithmic parameters, the flow allows a trade-off between timing yield and area of the circuit. Furthermore, we present two placement methodologies to place the hold buffers incrementally or from scratch which can be guided by the degree of timing uncertainties expected and creates a trade-off between the number of inserted buffers, the layout area, and maximum clock frequency. 176 Chapter 8 Margin and Yield Calculation 8.1 Overview SFQ cells exhibit high sensitivity to variations in JJ characteristics, inductor values, and biasing current levels. Sources of variation can be independent (truly random variations) or correlated (systematic variations). Process and circuit variations tend to be large in current fabrication processes and design practices [29]. These variations can cause an SFQ logic gate to malfunction. For instance, if the biasing current is too high, the gate will generate an output pulse every cycle whereas if it is too low, it may never generate any output pulse independent of whether an input pulse comes in or not. Above reasons increases the sensitivity of the circuit to variations of the component parameters and consequently complicate the design of a single gate [29]. Additionally, in contrast to CMOS designs which primarily use standard cell libraries, SFQ logic requires custom design of cell structures and parameters for specific applications. Critical margin calculation, yield and yield roll-off analysis using Monte-Carlo (MC) simulations have been extensively used to estimate the robustness of a logic cell [152]. Parametric yield of a cell can be defined as the percentage of the correctly functioning cell instances. In practice, each parameter tends to exhibit a probability distribution function (PDF) for its values. To calculate the parametric yield of a cell, Monte-Carlo (MC) simulations are performed where the cell parameter values are randomly chosen according to their 177 respective PDFs. Although the calculated yield based on MC simulations can estimate the robustness quite accurately, MC simulations are computationally expensive. Furthermore, using MC-based yield estimation throughout the cell optimization process is inefficient as a large number of simulations should be performed at each iteration of the optimization process and after each change in the parameter values. Therefore, it may only be used at the final stage of the optimization process to quantify the robustness of the optimized cell. A simpler method of robustness evaluation called critical margin (CM) calculation has been extensively used in the literature [29][12][31]. In this method, a binary search over a predefined range of values for each parameter is performed while all other parameters are fixed at their nominal values. Every time a parameter is changed, a simulation is performed to test whether the cell functions correctly. Parameter margins are calculated using upper and lower bound values for each parameter. The critical margin is calculated as the smallest margin among all parameters. This method only considers the alteration of each parameter while other parameters are fixed at their nominal values. Consequently, it does not capture the dependence of parameters values on each other. It also over-constraints the design space of the cell parameters as it enforces the worst-case margin of one parameter on all other cell parameters. Hence, critical margin calculation fails to evaluate the robustness of a cell accurately [31]. The focus of this chapter is to propose new margin calculation methods for SFQ cells with a large number of parameters. Specifically, the goal of the proposed methods is to calculate a set of margins for which the cell yield will be nearly one if all parameters lie within specified margins. The key contributions of this chapter can be summarized as follows. • We present novel margin calculation algorithms to estimate a set of margins for all parameters in a logic cell such that if all parameters are spread inside these margins, yield is nearly one. • The proposed algorithm works irrespective of the cell structure and topology. 178 • We extend the proposed margin calculation algorithm by clustering parameters into hyper-parameters and reducing the parameter value search space. • We consider mixed positive and negative changes to variability sources and efficiently find corners of a feasible parameter region for SFQ logic cells. If all parameter values lie within the convex hull of the reported corners, the yield tends to be near one. • We present a machine learning based approach to model the parameteric yield of logic cells. This model is used to estimate Monte-Carlo (MC) based yield in a significantly smaller amount of time compared with doing simulation for each MC sample. • The proposed algorithms can be used to evaluate the robustness of a cell throughout the optimization process efficiently, using a small number of simulations. Rest of this chapter are organized as follows. Prior work are outlined in Section 8.2. Proposed algorithms for margin calculation along with simulation results are detailed in Section 8.3. This chapter is concluded in Section 8.4. 8.2 Preliminaries Margins are typically calculated using Line Search (LS) method. This method performs a binary search over the range of the parameter values for each parameter, while other parameters are fixed at their nominal values [31]. The assumption is that if the cell works when value of a parameter p is equal to a and b, then the cell will work for any value in the range of [a,b] [153]. Although this assumption is reasonable for a circuit in which performance depends on a single parameter, it fails if there are more than one parameters. More precisely, even if the circuit functions correctly when parameters p and q are set to range of [a,b] and c, respectively, and it functions correctly when p and q are set to a and range of [c,d], respectively, there is no guarantee that the circuit will function correctly when parameters p and q are set to ranges [a,b] and [c,d], respectively. This scenario illustrates the main source of difficulty for precise margin calculation. Since this method calculates 179 L4 L4 L2 L2 J7 J7 J7 J9 J9 J9 I2 I3 L6 L6 J8 J8 J8 I1 L5 L7 J3 J1 J5 J6 J4 J2 L3 L8 B A Clock Out Figure 8.1: Circuit diagram for AND2 cell. margins for each parameter independently of the variations of the other parameters, we refer to this method as Single Parameter Change Margin Calculation or SPCMC for short. Notice that as a result of the single parameter change (SPC) assumption, the calculated parameter margins tend to be optimistic (overestimating the actual margins.) This is because of the concept of critical margin, which sets all positive (negative) parameter margins to the least positive (negative) margin. Consider nominal value of V i and margins of [NM i ,PM i ] for parameter i calculated by the SPCMC method. With n parameters, margin polyhe- dron may be constructed to have its two all-negative and all-positive corners located at hV 1 × (1−|NM 1 |),...,V n × (1−|NM n |)i andhV 1 × (1 +PM 1 ),...,V n × (1 +PM n )i, respec- tively. Now, with critical margin definition, the aforesaid margin cuboid will have its two extreme corners located athV 1 × (1−|NCM|),...,V n × (1−|NCM|)i and hV 1 × (1 +PCM),...,V n × (1 +PCM)i, respectively. Circuit diagram for an AND2 (2-input AND gate) cell is shown in Fig. 8.1. It has 20 parameters, including different inductance values (L1−L8), junctions (J1−J9), and biasing currents (I1−I3). Calculated margins for all parameters in AND2 cell using (SPCMC) method is depicted in Fig. 8.2. As it can be observed, values for parameter J8 can deviate from−72% to +25% of its nominal value while other parameters are fixed. Binary search for each value is performed using an initial bound of±100%. In the case of AND2 gate, the critical margin is equal to 25%. 180 Figure 8.2: Margins for each parameter of AND2 (2-input AND) gate calculated using SPCMC method. MC simulations is another way of calculating the robustness of a cell. We can calculate the yield of a cell by generating MC samples based on the respective PDF of process parameters. Monte Carlo yield results are often shown for increasing standard deviation of all parameters (parameter spread) on a yield roll-off curve. In such a curve, different standard deviations are considered for process paramaters, and MC simulations are performed to estimate the yield for those parameter spread values. However, this method assumes that all parameters have the same standard deviation value which is not realistic. Additionally, for each parameter spread value a round of MC simulations with more than 1000 simulations must be performed, which is computationally expensive [152]. 8.3 Proposed Methods In this section we propose algorithms for accurate margin calculation. Based on the proposed margin and yield definitions, the idea is to calculate a set of margins for all parameters in 181 the design, such that calculated yield is one if all parameters lie within these margins. Our proposed solution is more accurate than SPCMC method as it considers the inter-dependence of variables on each other. Here, three algorithms are introduced for accurate margin calculation. 1) individual pa- rameter update (IPU), 2) simultaneous parameter update (SPU) 3) Hybrid Margin Calculation (HMC) Inputs to all of these algorithms are the parameters in the cell design and an HDL model describing the correct operation of the cell. Details of the proposed approaches are explained in the following subsections. 8.3.0.1 Individual Parameter Update (IPU) In this algorithm, a set of simulations are performed to calculate the margin for each parameter. Psuedo code for IPU algorithm is given in Algorithm 5. The process starts by creating variables (X j ) representing the value of each component in the cell design. All variables are initialized to their corresponding nominal value (line 0). In each iteration, upper margin for one parameter (U i X j ) is updated using SPCMC method, while other parameters are fixed (line 3). This is achieved by UpdateByLS function which calculates the upper bound for each parameter given input parameters. Next, the mean value for X j , namely M i X j , is calculated as the average of upper bound value (U i X j ) and previous value (X i−1 j ) of the corresponding parameter (line 4). A simulation is performed while variable j is changed to M i X j and other parameters are fixed. If the simulation for this set of parameters fails, M i X j value is updated until a passing set is achieved (lines 5-9). Once a mean value M i X j is found such that cell functions correctly for given set of variables, X i j value is updated to its corresponding mean value M i X j (line 10). The algorithm continues by calculating the upper bound for next parameter while some of the parameters are updated to their mean values (X i k ,∀k = 1...j−1) and others are fixed at their previous values (X i−1 t ,∀t =j...p) (cf., Alg. 5 line 3). 182 Parameter update and cell simulation continues until stopping criteria is satisfied (lines 12-14), that is, once the absolute difference between previous value (X i−1 j ) and current value (X i j ) for all parameters is less than a predefined threshold value. Output of IPU algorithm is a set of upper bound margins for all parameters. Lower bound margins can be calculated by replacing lower bound values (L i j ) by upper bounds in the Alg. 5. If there are n parameters in the cell design, possible set of values for all parameters form two n-dimensional hyper- rectangles (one in all positive dimension and one in all negative dimension). If all parameters lie within the given hyper-rectangles, the cell functions as specified in the HDL model. Assuming a cell contains only three parameters (X, Y, and Z), IPU algorithm works as follows. First, upper bound for X is calculated while Y =Y 0 and Z =Z 0 . Mean value for X is calculated as average of X 0 and U 0 X . If the cell works with the set of parameters (X =M 1 X ,Y 0 ,Z 0 )X 1 is updated toM 1 X . Otherwise, it continues updatingM 1 X until a passing set of parameters is found. Next, the upper bound for Y is calculated while X =X 1 and Z =Z 0 . Similarly, Y is updated to its mean value Y =M 1 Y (lines 5-9). Finally, upper bound for Z is calculated considering X =X 1 and Y =Y 1 . The algorithm continues updating parameters X, Y , and Z until none of them can be updated anymore, or the difference of current and previous values for all parameters drop below a predefined threshold value. 8.3.0.2 Simultaneous Parameter Update (SPU) Psuedo code for this algorithm is given in Algorithm 6. This is different than the Alg. 5 as it updates all parameters to their corresponding mean value simultaneously rather than individually. Initially, all parameters are set to their nominal values and SPCMC is performed to calculate the upper bound for each parameter. At each iteration i, mean value (M i X j ) for all parameters is calculated as the average of their upper bound (U i−1 X j ) and previous value (X i−1 j ) (cf., Alg. 6 lines 2-4). If the cell does not function correctly given these mean values, all mean values are updated until a set of passing points is found. This is achieved through 183 Algorithm 5 Individual Parameter Update (IPU) Initialize X j 0 for j = 1...p Initialize j=1, i=1, stop=false 1: while (!stop) do 2: for each j in j=1...p 3: U i X j = UpdateByLS(X i k ,X i−1 t ,∀k = 1...j− 1,t =j...p) 4: M i X j = U i X j +X i−1 j 2 5: if (!pass(X i k ,M i X j ,X i−1 t ,∀k = 1...j− 1,t =j + 1...p)) 6: while (!pass(X i k ,M i X j ,X i−1 t ,∀k = 1...j− 1,t =j + 1...p)) do 7: M i X j = M i X j +X i−1 j 2 8: end while 9: end if 10: X i j =M i X j 11: end for 12: if (abs (X i j −X i−1 j ) < threshold,∀j = 1...p) 13: stop = true 14: end if 15: i = i+1 16: end while 184 averaging mean value and previous value for each parameter (lines 5-9). Once a passing set is found, all parameters are updated to their mean values (lines 10-12). Additionally, upper bounds for all parameters are updated using SPCMC method (line 16, UpdAllByLS function updates all upper bounds.) Similar to Alg. 5 stopping criteria is met once the absolute difference between previous and current value for all parameters is below the predefined threshold. Assuming a cell with three parameters (X,Y andZ), the SPU algorithm works as follows. Initially upper bounds (U 0 X ,U 0 Y , andU 0 X ) are calculated using SPCMC method. Mean values are calculated as average of upper bound and initial values (X 0 ,Y 0 , and Z 0 ). Until cell functions correctly, mean values are updated by averaging mean values and previous values (e.g., M 1 X = M 1 X +X 0 2 .) Once cell passes the test, all parameters are updated to their mean values and upper bound for each parameter is updated using SPCMC approach. The finalX j values denote a set of margins representing the upper bounds for all parameters. Similar to Alg. 5, lower bound margins can be calculated by replacing lower bounds (L i X j ) with upper bounds. Upper bound and lower bound margins for each parameter define a hyper-rectangle. Consequently, two hyper-rectangle are formed as Cartesian product of all intervals and the nominal center. If all parameters lie within these hyper-rectangle, the cell should function correctly. Alg. 6 uses fewer number of simulations to calculate the margins compared with Alg. 5. The reason is that Alg. 6 updates all parameters at once, rather than individually in the case of Alg. 5. Therefore, it moves faster towards the boundaries of hyper-rectangle of passing points. On the other hand, resulting margins may be more conservative compared with Alg. 5 as it is searches for multiple parameters (i.e., a set of parameters) for which cell functions correctly. This is in contrary with searching for a single parameter by changing individual variables as in Alg. 5. Intuitively, Alg. 5 is similar to taking small steps toward finding margins (boundaries of hyper-rectangle) whereas Alg. 6 takes larger steps towards boundaries. 185 Algorithm 6 Simultaneous Parameter Update (SPU) Initialize X j 0 ,U j 0 for j = 1...p Initialize j=1, i=1, stop=false 1: while (!stop) do 2: for each j in j=1...p 3: M i X j = U i−1 X j +X i−1 j 2 4: end for 5: if (!pass(M i X j ,∀j = 1...p)) 6: while (!pass(M i X j ,∀j = 1...p)) do 7: M i X j = M i X j +X i−1 j 2 8: end while 9: end if 10: for each j in j=1...p 11: X i j =M i X j 12: end for 13: if (abs (X i j −X i−1 j ) < threshold,∀j = 1...p) 14: stop = true 15: end if 16: U i X j = UpdAllByLS(X i j ,∀j = 1...p) 17: i = i+1 18: end while 186 Consequently, a hybrid approach is presented to calculate a larger set of margins than that of Alg. 6 using fewer number of simulations than Alg. 5. 8.3.0.3 Hybrid Margin Calculation (HMC) Hybrid margin calculation (HMC) approach starts by using Alg. 6 to calculate an initial set of margins. To reduce total number of simulations during running Alg. 6, we only calculate upper bounds (U X j ) once and not in every iteration (i.e., line 16 in Alg. 6 is skipped.) Let’s denote these margins as a passing set, namely P. Furthermore, a set of mean values for which cell fails to function correctly and is farthest away from P is recorded using Alg. 6. This set of points is called F . It is worth mentioning that since initial upper bound margins are calculated using SPCMC method, there is no guarantee that set of upper bound values is a passing set. Hence, this set can be a candidate for F (cf., Section 8.2). Also, an empty vector is initialized to keep track of passing sets generated throughout HMC method as B (Alg. 7 line 0). At first, we find passing sets by changing only one parameter from set F as follows. Mean value for each parameter is calculated as the average of its corresponding P X j andF X j values (line 2). If the cell functions correctly using set of parameters (X i k ,M i X j ,X i−1 t ,∀k = 1...j− 1,t =j + 1...p), this set is added to vector B (lines 3-5). After iterating all parameters, we check vector B. There are two possible cases. 1) Vector B is not empty, hence at least one passing set is found. In this case, function findOptimalSet(B) finds the passing set for which sum of all margins is maximized. Consequently, optimal set B opt j is the output of the algorithm (lines 7-9). 2) Vector B is empty. Hence, we have not found any passing sets yet. In this case, all parameters are set to their corresponding values in F X j . At each iteration, we update one parameter to its mean value (M i j ) and check whether the circuit works given updated parameter. If (X i k ,X i−1 t ,∀k = 1...j,t =j + 1...p) defines a passing set algorithm is terminated and this set is returned (lines 14-17). Otherwise, we continue updating next parameter. If all parameters are updated to their mean value and yet no passing set is found, 187 Figure 8.3: Margins for different parameters of AND2 gate calculated using HMC and SPCMC methods. we update set F by changing all values to their corresponding mean value (lines 19-21). Stopping criteria is met once the absolute difference of all P X j s and F X j s drop below the predefined threshold value. In this situation, set P is returned. This essentially means that margins calculated using Alg. 6 are the largest possible margins and can not be improved anymore. Calculated margins for AND2 gate (2 input AND gate) using Alg. 7 is depicted in Fig. 8.3. As it can be observed, there is significant difference between SPCMC method and proposed method especially in lower bound margins of multiple parameters (e.g., L8 and I2). To sum up, the HMC method calculates the margins in two phases. In the first phase, a multi-dimensional binary search is performed between the nominal centerhV 1 ,...,V n i and the all-positive cornerhV 1 × (1 +PM 1 ),...,V n × (1 +PM n )i as calculated by the SPCMC method. The search procedure changes all parameter values at the same time and in proportion to their respective search windows. For example, if the all-positive corner fails to pass, the next point to check to see if it is a passing or failing point will be 188 Algorithm 7 Hybrid Margin Calculation (HMC) Calculate P X j ,F X j for j = 1...p Initialize j=1, i=1, stop=false, B ={} 1: for each j in j=1...p 2: M i X j = F X j +P X j 2 3: if (pass(F X k ,M i X j ,∀k = 1...j− 1,j + 1...p)) 4: add ({F X k ,M i X j ,∀k = 1...j− 1,j + 1...p}) to B 5: end if 6: end for 7: if (!B.empty()) 8: B opt =findOptimalSet(B) 9: return B opt 10: end if 11: Initialize X j to F X j for j = 1...p 12: while (!stop) do 13: for each j in j=1...p 14: M i X j = F X j +P X j 2 15: X i j =M i X j 16: if (pass (X i k ,X i−1 t ,∀k = 1...j,t =j + 1...p)) 17: return{X i k ,X i−1 t ,∀k = 1...j,t =j + 1...p} 18: end if 19: for each j in j=1...p 20: F X j =M i X j 21: end for 22: end for 23: if (abs (F X j −P X j ) < threshold,∀j = 1...p) 24: stop = true 25: return{P X j ,∀j = 1...p} 26: end if 27: i = i+1 28: end while 189 hV 1 × (1 +PM 1 /2),...,V n × (1 +PM n /2)i. If this points passes, then the next point to check will behV 1 × (1 + 3×PM 1 /4),...,V n × (1 + 3×PM n /4i, and so on. The search is stopped after a fixed number of steps or after a desired level of resolution is achieved. Suppose the search is terminated at a new passing cornerhV 1 × (1 +α×PM 1 ),...,V n × (1 +α×PM n )i where 0≤α≤ 1. In the second phase, the HMC method performs an asymmetric local neighborhood search around the new passing point, whereby it attempts to raise each parameter margin as much as possible. This process, which is order dependent, yields the final passing cornerhV 1 × (1 +α 1 ×PM 1 ),...,V n × (1 +α n ×PM n )i where alpha i ≥α for all i. The HMC method performs a similar process to correct the all-negative corner hV 1 × (1−|NM 1 |),...,V n × (1−|NM n |)i and end athV 1× (1−β 1 ×|NM1|),...,Vn× (1−β n ×|NM n |)i. The HMC method accounts for the dependence of parameter values and hence, the yield values inside resulting margin polyhedron is high. However, it only considers two all-positive and all-negative parameter subspaces (i.e., it fails to consider the cases where some parameter values are larger than their nominal values, while others are smaller.) Assume a cell has n parameters. Due to process variations each of them can have a value larger or smaller than its nominal value. Therefore, there are 2 n parameter subspaces. The HMC method only considers 2 of these possible subspaces. Therefore, calculated margins using the HMC method tend to be inexact. To evaluate the effectiveness of the proposed approach and compare it with the conventional SPCMC method, we have done yield analysis using MC simulations. For this purpose, we have calculated margins as well as critical margin for several logic gates using both proposed HMC and SPCMC methods. We have used PSCAN2 for SPCMC based margin calculation and circuit simulations [154]. For yield analysis we have assumed different σ values for different process parameters. Total number of 20,000 MC simulations are performed to calculate the yield for each gate and each of the margin calculation methods. Table 8.1 illustrates the fact that HMC method calculates margins more accurately compared with SPCMC method. 190 Table 8.1: Results of yield calculation for different margin calculation algorithms and various SFQ cells. OR2 and OR3 represent 2-input and 3-input OR gates. Yield (%) # Simulations σ 0.1 0.2 0.3 HMC SPCMC HMC SPCMC HMC SPCMC HMC SPCMC INV 100 100 99.7 99.4 97.2 95.9 382 280 AND2 89.2 75 69.1 52.5 58.4 45.6 334 364 OR2 99.9 93.8 95.5 75 86.7 61.5 334 280 AND3 98.4 73 83.3 49 67.4 41 430 336 OR3 99.8 92 98.0 73 91.5 62 398 322 Average (%) 97.5 86.8 89.1 69.8 80.3 61.2 375 289 Using HMC method average yield increases by 10.7%,19.3%, and 19.1% compared with SPCMC method assuming std values of 0.1,0.2, and 0.3 for process parameters, respectively. The results show that predictions based on LS method are highly optimistic as a result of ignoring dependency among different parameters. It is shown that average number of simulations for HMC method increases by less than 20% compared with LS method. Although HMC method improves the yield inside the reported margin polyhedron, however the yield values are still low for large std values (i.e., σ = 0.2,0.3). In the following section, we present an efficient method to find a margin polyhedron such that if all parameters lie within this polyhedron, the percentage of false-positives in this region is below a given threshold, say 5%. 8.3.1 Feasible Parameter Region Calculation The HMC method still suffers from two weaknesses: 1) The number of circuit parameters can be large (e.g., in tens of parameter values). 2) Its performs search only in the all-positive and all-negative parameter subspaces. We rely on the notion of a hyper-parameter and cast a set of cell parameters as being derived from this hyper-parameter by a simply multiplicative factor. Considering three hyper- parameters corresponding to inductances, JJ critical currents, and biasing currents, we can write the original vector of parameter valueshv 1 ,...,v n i ashu 1 ,u 2 ,u 3 i, where, as a result of 191 simple ordering of cell parameters and without loss of generality,v 1 =γ 11 ×u 1 ,v 2 =γ 12 ×u 2 ,..., andv j =γ 21 ×u 2 ,v j+1 =γ 22 ×u 2 ,..., andv k =γ 31 ×u 3 ,...,v n =γ 3,n−k+1 ×u 3 . Note that the above parameter reduction is equivalent to assuming that the underlying source of variability that affects cell parameters v 1 ,...,v j−1 is one and the same (systematic variations due to process manufacturing tends to affect all inductances in the same way), and similarly the circuit parametersv j+1 ,...,v k are affected by the same variability source (again manufacturing process introduces variability in the JJ critical currents which affects all JJs in the cell in the same way), etc. Then, when applying the HMC method, we simply do the binary search and subsequent local neighborhood search in the three dimensional space of u 1 ,u 2 ,u 3 . And hence the large dimensionality problem is solved. From this point on, for the sake of simplifying the description, we will switch to the hyper-parameter space (although everything that follows will also apply to the original parameter space albeit with the curse of large dimensionality.) The next problem can be addressed by exploring all eight parameter value subspaces (i.e., octants) one at a time. Consider without loss of generality, an octant where starting from nominal values ofhU 1 ,U 2 ,U 3 i, which denote the center of the full search space (i.e., the nominal center), hyper-parameters u 1 and u 2 will only change in the positive direction (can only increase from their nominal valuesU 1 andU 2 ), whereas hyper-parameteru 3 changes in the negative direction (can only decrease with respect to its nominal valueU 3 .) We call this octant Oct(+,+,−). In this case we conduct multi-dimensional searches in the space ofhu 1 ,u 2 ,u 3 i where u 1 to u 3 are bounded in ranges [U 1 ,U 1 × (1 +α 1 ×PM 1 )], [U 2 ,U 2 × (1 +α 2 ×PM 2 )] and [U 3 × (1−β 3 ×|NM 3 |)], respectively. Hyper-parameter upper and lower bounds (e.g., PM 1 , PM 2 , and NM 3 ) are calculated using SPCMC method. For definitions of PM 1 , PM 2 , NM 3 , and α 1 , α 2 , β 3 , the reader can refer to the section 8.2. 192 Our margin calculation algorithm, Multiple Hyper-parameter Change Margin Calculation (MHCMC), performs multi-dimensional binary searches to get the extreme (most distant) corner points in each octant that is a PPV vector for the cell under consideration. Consider without loss of generality, octant Oct(+,+,−). First, we perform a three-dimensional search in the space ofhu 1 ,u 2 ,u 3 i between the nominal center ofhU 1 ,U 2 ,U 3 i and the vector hU 1 × (1 +PM 1 ),U 2 × (1 +PM 2 ),U 3 × (1−NM 3 )i. If this is a FPV vector, we will check the next vector (i.e.,hU 1 (1 +PM 1 /2),U 2 (1 +PM 2 /2),U 3 (1−NM 3 /2)i), etc. After a desired level of resolution is achieved this three-dimensional binary search is over and we record the most distant corner point in this octant which is a PPV vector. Additionally, we perform 3 two- dimensional binary searches in this octant. Each time, we choose two of the hyper-parameters for the binary search and fix the third hyper-parameter at its nominal value. Then, we perform a two-dimensional binary search over the range of these two hyper-parameters and find the most distant corner point from the center. In the above example, one of the two-dimensional searches is performed as follows. We fix hyper-parameter u 2 and do a two-dimensional binary search between vectorshU 1 ,U 2 ,U 3 i andhU 1 × (1 +PM 1 ),U 2 ,U 3 × (1−NM 3 )i. The two-dimensional binary search is repeated for u 1 and u 2 while u 3 is fixed at U 3 and for u 2 and u 3 when u 1 is fixed at U 1 . As a result, in each octant 4 corner points are found. The MHCMC algorithm repeats the above procedure for all the octants. Feasible parameter region for each cell is estimated as the convex hull of all the corner points in different octants. Figures 8.4 - 8.6 depict the corner points (shown using black spheres) and their corre- sponding convex-hull (boundaries of the FPR, shown in red) for 3 SFQ cells (AND2, OR2, and INVerter). U I , U J , and U L are hyper-parameters corresponding to biasing currents, JJ critical currents, and inductances, respectively. As shown, some of the corner points are not included in the convex hull. It is observed that the convex hull for the INV cell is more symmetrical with respect to the nominal center (i.e., (1,1,1)) compared with the other two cells. Additionally, the volume of the FPR for the INV cell is larger than those of AND2 and OR2 cells (i.e., parameter margins are wider). Therefore, INV cell is expected to be more 193 robust to variations compared with the other two cells. In the next section, we evaluate the parametric yield inside the margin polyhedron reported by MHCMC method. Figure 8.4: Feasible parameter regions calculated using the MHCMC algorithm for Inverter cell. Corner points are shown using black spheres. U I , U J , and U L are hyper-parameters corresponding to biasing currents, JJ critical currents, and inductances, respectively. Figure 8.5: Feasible parameter regions calculated using the MHCMC algorithm for Two input AND cell. Corner points are shown using black spheres. U I , U J , and U L are hyper-parameters corresponding to biasing currents, JJ critical currents, and inductances, respectively. 8.3.1.1 Parametric Yield Evaluation Results To evaluate the effectiveness of the proposed algorithm, we calculated the parametric yield for different cells using Monte-Carlo (MC) simulations, similar to [155][31]. We only simulated the MC samples which lie inside the boundaries of the reported margin polyhedron. In 194 Figure 8.6: Feasible parameter regions calculated using the MHCMC algorithm for Two input OR cells. Corner points are shown using black spheres. U I , U J , and U L are hyper-parameters corresponding to biasing currents, JJ critical currents, and inductances, respectively. these simulations, it is assumed that hyper-parameters are independent random variables with a normal distribution. Mean values are assumed to be equal to the nominal values of the hyper-parameters and three different standard deviations for each hyper-parameter are considered (σ = 0.1, 0.2, and 0.3). Sources of the variability and some of the parameter distributions in the MIT Lincoln Laboratory SFQ5ee fabrication process are reported in [156] [19]. Results of the parametric yield values for various cells and sigma values are listed in Table 8.2. We performed 20,000 MC simulations for each entry in the table. As shown, the yield inside margin polyhedron calculated using MHCMC method for all of the cells is very close to 1 (i.e., the total number of false-positive samples inside the reported margin polyhedron is below 5%.) We also calculated the parametric yield inside margin polyhedrons reported by the HMC method as the baseline of our comparison. Note that since the HMC method reports separate margins for all the parameters, and we consider the hyper-parameters as sources of the variability, boundaries of the margin polyhedron reported by the HMC method in the hyper-parameter space are calculated as the critical margin among all parameters corresponding to each hyper-parameter. Assume u J 195 Table 8.2: The parametric yield values, number of simulations and volumes of the margin polyhedrons for various SFQ cells calculated using the MHCMC and HMC methods. OR2 and OR3 denote 2-input and 3-input OR gates. Yield (%) # Simulations Normalized Volume σ 0.1 0.2 0.3 Cell Algorithm HMC MHCMC HMC MHCMC HMC MHCMC HMC MHCMC HMC MHCMC INV 100 100 99.7 100 97.2 100 382 235 0.8 1 AND2 89.2 99 69.1 96.9 58.4 95.5 334 185 3.0 1 OR2 99.9 99.9 95.5 99.6 86.7 99.4 334 219 1.3 1 AND3 98.4 99.9 83.3 98.8 67.4 98.8 430 161 0.7 1 OR3 99.8 100 98.0 99.9 91.5 99.9 398 204 2.9 1 Average (%) 97.5 99.9 89.1 99.1 80.3 98.7 375 200 1.23 1 to be the corresponding hyper-parameter for all JJ critical currents v 1 ,...,v k . Negative and positive margins of u J are calculated as min j=1,...,k |NM j | and min j=1,...,k PM j , respectively. As shown in Table 8.2, the yield values for the INV cell are larger than the yield values for the other cells. One of the reasons is that the INV is a single input cell while the others are two or three input cells. Multi-input cells’ operations are affected by the arrival times of their inputs, which complicates their design [30]. The other reason is that the INV cell has a more symmetrical FPR and wider margins compared with the other cells (i.e., corner points of its margin polyhedron are further away from the nominal center than those of the other cells.) Additionally, the yield values for the AND2 cell are lower than the OR2 cell. In the AND2 cell, to produce an output value of 1, both input pulses should arrive within a fixed time window, whereby in OR2 cell, once a pulse arrives at any of the inputs, an output value of 1 is produced. As observed in Table 8.2, the average yield for the MHCMC method is larger than that of the HMC method for all sigma values and cells. Additionally, the average yield of MHCMC method is larger than that of the HMC method by 10% and 18% for the sigma values of 0.2 and 0.3, respectively. This is because the HMC method ignores all the possible orthants and only performs a search in two all-positive and all-negative subspaces. The yield values inside the margin polyhedron reported by the HMC method drops as the sigma values increase as samples which are more distant from the nominal center are tested, and hence more false-positive samples are found inside the HMC margin polyhedron. 196 The total number of simulations for the MHCMC is about 46% less than that of the HMC method. This is because the HMC method considers each parameter of a cell individually and searches over its upper and lower bounds, while MHCMC clusters parameters and performs the search only on hyper-parameter values. The normalized volumes of the margin polyhedrons for each cell are listed in Table 8.2 (the volumes are normalized to the volume of the margin polyhedron calculated by the MHCMC method.) The average volume of the margin polyhedrons reported by the HMC method for 5 cells is 1.23× that of the MHCMC method. As shown, in some cases the volumes reported by MHCMC are larger than those reported by the HMC. For instance, using the MHCMC method for the cell AND3, in a margin polyhedron that is 1.4× that of the HMC, the total percentage of false-positive samples is reduced from 16.7% to 1.2% for a sigma value of 0.2 and from 32.6% to 1.2% for a sigma value of 0.3. Although the reported yield for all the cells using MHCMC method is near 1, one might argue that the calculated margins might be too conservative (i.e., the volumes of the margin polyhedrons are too small) which automatically leads to high yield values. In other words, although nearly all the points inside the FPR are true-positives, there may be a large number of positive points (i.e., PPV vectors) outside this margin polyhedron. To evaluate how conservative the reported FPR is, we performed a stratified sampling to evaluate the yield outside the boundaries of the reported FPR. We created multiple contour bands (CB)s outside the FPR. The corner points for each contour band expand the corners of the previous region (band) by 10%. That is, the corner points for the first CB are 1.1× those of the original feasible parameter region (OFR). An example of the OFR and the first two contour bands in a two-dimensional space are shown in Fig. 8.7. Then, to calculate the parametric yield for each CB, we created MC samples based on the distribution of the hyper-parameters and only simulated the samples that are within the boundaries of that CB. The total number of sample points in each contour band was set to be 20,000. It was assumed that hyper-parameters 197 Figure 8.7: An example of the original feasibility region of a cell in two-dimensions (purple region), along with the first two contour bands (light blue and pink regions) around it. Each contour band expands the corners of the previous one by 10%. have independent normal distributions with mean values equal to the their nominal values and standard deviations of 0.3. Results of the parametric yield inside each of the two CBs for all the cells are shown in Fig. 8.8. As it is observed, for all of the cells, the yield drops below 60% in the second CB. We did not continue the MC simulation for the next CBs as the expected yield values would be small. The yield values for different hyper-parameter distributions (i.e., σ = 0.1,0.2) also drop significantly after the second CBs. Results show that the calculated feasible parameter region is not too conservative as yield drops below 0.9 for some of the cells even in the first CB. Fig. 8.8 shows that the first CB still has high yield values (e.g., above 0.8). This suggests that the corners of the feasible parameter region can still be expanded to capture more points near the boundaries of the OFR for which the cell still works correctly. In the next section, we introduce a methodology to explore the points outside the OFR and to expand the OFR. 8.3.1.2 Feasible Parameter Region Expansion In this section we present a method to expand the feasible parameter region by finding PPV vectors outside the boundaries of the original feasible parameter region (OFR) using a machine learning based approach. 198 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 OFR 1st CB 2nd CB INV AND2 OR2 AND3 OR3 Figure 8.8: The parametric yield values inside the original feasible parameter region (OFR) and the 2 contour bands (CB)s around it for 5 SFQ cells. The feasible parameter region expansion (FRE) algorithm starts creates a classifier to predict whether a cell functions correctly given parameter values. First, we create training and test data using MC simulations, considering the distribution of hyper-parameters. The attributes of these data sets are the parameter values. Labels (i.e., classes) are 0 or 1; if the cell works correctly given input parameter values (input values define a PPV vector) label is 1. Otherwise, it is 0 (input values define a FPV vector.) To classify the points both inside and outside the boundaries of the OFR, we perform MC simulations in the OFR and the first two CBs around the OFR and add those data point to the training data set. Note that we do not consider points outside the second CB, as the expected yield values are low (cf., Fig. 8.8). Although everything that follows can also be applied to next CBs. We use a Random Decision Forest (RDF) algorithm to create a classifier to predict the operation of the cells. RDF is a classification method that trains multiple decision trees (DT)s on random samples of the training set. For each point in the test data set, the RDF classifier outputs the class that is the mode of the output of individual decision trees [157]. We train the classifier on the training set to distinguish between data points that are PPV vectors and the ones which are FPV vectors. This model is then used to build the expanded feasible parameter region (EFR) 199 of the SFQ cells. Given a sample point in the test set, if it is inside the OFR or within the first two CBs, FRE algorithm classifies the point using the trained RDF model. Data points that are labeled as PPV vectors are considered to be inside the boundaries of the EFR. Using the proposed MHCMC and EFR algorithms, we can quickly estimate the parametric yield of a cell without the need for extensive MC simulations and just by creating a training set using a small number of simulations, training the RDF model on the training set and applying the trained model on the MC samples (i.e., input vectors in the test set) to predict the operation of the logic cells. The parametric yield is then calculated as the percentage of positive samples in the test data set. In the next section, we present the simulation setup and report the accuracy of the trained RDF models for different cells on unseen data (i.e., the test set). 8.3.1.3 Yield Modeling Results The MHCMC and FRE methods are written in Python. We used the PSCAN2 [154] package for simulating the operation of the SFQ cells. Scikit-learn package was used to train the RDF model [158]. Correct operation of a cell is determined using a hardware description language (HDL) model [153]. This model describes the order in which junctions switch and the behavior of all internal nodes within the cell [29]. For generating the training and test data sets, we performed MC simulations inside the OFR and the first two CBs. In these simulations, it is assumed that hyper-parameters are independent random variables with a normal distributions, mean values equal to their nominal values, and three different sigma values (σ = 0.1, 0.2, and 0.3) similar to section 8.3.1.1. Training data set consists of the data points generated inside the OFR (500 points), and within the first two CBs (500 points each). Size of the training and test sets are set to be 1500 and 58,500 points, respectively (i.e., for each sigma value, we performed 60,000 MC simulations.) We repeated the same data generation and model training for 5 different SFQ cells. The accuracy of the trained model on the test data set for different cells are reported in Table 8.3. The average accuracy of the 200 trained model for 5 different SFQ cells is equal to 97%, 96%, and 95% assuming sigma values of 0.1, 0.2, and 0.3, respectively. Simulations results show that the trained model can predict the operation of the cells accurately. Table 8.3: The accuracy of the RDF classifier on the test data set for various SFQ cells and hyper-parameter distributions. Accuracy (%) Cell σ 0.1 0.2 0.3 INV 99.1 96.4 95.7 AND2 93.9 95.7 95.1 OR2 98.4 95.8 94.8 AND3 96.2 95.6 95.8 OR3 98.8 97.3 96.8 Average (%) 97.3 96.2 95.6 8.4 Summary This chapter presents novel methods for calculating parameter margins for single flux quantum (SFQ) logic cells in a superconducting electronic circuit. The proposed approaches calculate margin polyhedrons in which a cell will almost always function correctly if all the parameter values lie inside this polyhedron. This is achieved through a small number of simulations. Monte-Carlo simulations validate that the average yield of the cells within this margin polyhedron is above 98%. Furthermore, this chapter presents a machine learning based algorithm to predict if a cell functions correctly given a set of parameter values, both inside and outside the aforementioned polyhedron. This model can be used to estimate the parametric yield accurately and without the need for expensive Monte-Carlo simulations. The results show an average accuracy of 96% for the proposed model. The proposed margin calculation and yield estimation algorithms can both be used to evaluate the robustness of a logic cell throughout the optimization process efficiently. 201 Chapter 9 Conclusion Superconductive electronics (SCE) based on the Josephson junction (JJ) single flux quantum (SFQ) logic cells are considered to be a within-reach “beyond-CMOS” technology. With switching speeds in the hundreds of GHz and energy dissipation of 10 −19 or less Joules per transition for rapid SFQ (RSFQ). These superconducting SFQ-based logic circuit families can thus provide the needed speed and energy efficiency for future beyond-Exascale computing. In this dissertation, we presented several novel design methodologies for electronic design automation of SFQ logic circuits considering the primary objectives, constraints, and major differences of this logic family with conventional CMOS technology. Particular, we presented novel methodologies on three Placement, Clock Network Synthesis and Margin and Yield Calculation for superconducting circuits. Major contributions of this dissertation can be enumerated as follows: • In Chapter 3, we present a novel timing driven global placement algorithm targeting su- perconductive electronic circuits. The proposed approach minimizes the total wirelength of the circuit while targeting a specific clock frequency by imposing hard constraints on the maximum interconnect length for all the nets on the timing-critical paths of a circuit. This is achieved by utilizing the powerful framework of alternating direction method of multipliers (ADMM) [21]. Precisely, the proposed timing driven placement algorithm (TDP-ADMM) converts the placement problem with constraints on the path 202 delays into an unconstrained problem by decomposing the timing driven problem into two sub-problems, one optimizing the total wirelength and the other optimizing the circuit timing. Through an iterative process, the solution of the two sub-problems are modified to close the gap between a wirelength driven and a timing driven placement. Compared to an state-of-the-art academic global placement tool, TDP-ADMM improves the worst and total negative slack for seven single flux quantum benchmark circuits by an average of 26% and 44%, respectively, with an average overhead of 1.98% in terms of total wirelength. • In Chapter 4, we presented a novel clock tree structure called HL-tree clock network, which is a combination of H-tree (a zero-skew clocking method) and L-tree (a linear clock propagation mechanism) networks. In HL-tree, an H-tree is employed to propagte the clock to cell-groups, and within each cell-group, a linear path composed of splitters provides the clock to all cells in that group. Cells in a group are then horizontally abutted which helps reducing the total layout area. We then presented a clock tree-aware placement approach that simultaneously minimizes the total wirelength of the signal nets and area overhead of the clock routing. The presented placement algorithm, recognizing the need for the said hybrid clock tree realization, generates placement solutions that intrinsically match the structure of the HL-tree clock networks by first grouping cells into a set of “super-cells” and building a modified netlist capturing dependencies among these super-cells, and taking advantage of a force-directed placement engine to find a global placement of these cells. Simulation results showed that the proposed HL-tree compatible placement methodology improves the average layout area for 13 benchmarks by 7.1%, using the HL-tree clock structure, when compared with a global placement accompanied by H-tree clock network. • In Chapter 5, we present a methodology for the physical synthesis of clock networks, i.e., placement of clock splitters and generation of balanced clock tree topologies aimed at minimizing the nominal clock skew (i.e., the maximum difference in the arrival time 203 of the clock signal at two different clock sinks) and routing resources consumed by clock networks and maximizing the clock frequency of SFQ circuits. This was achieved by employing an algorithm for a fully-balanced clock tree topology construction and a minimum-skew clock tree placement and legalization algorithm deploying a mixed integer linear programming (MILP) formulation to carry out clock tree construction, splitter insertion, and skew minimization under given placement blockages while considering splitter and interconnect delays. The proposed clock tree topology generation algorithm guaranteed that the maximum difference between the number of splitters from the clock source to any pair of sinks is zero. The effectiveness of the proposed clock synthesis algorithm was demonstrated using multiple SFQ circuits; The qCTS algorithm improved the average clock skew by 70% and the total and worst negative hold slack values by 80% and 60%, respectively, when compared with a well-known approach utilizing the deferred merge embedding (DME) algorithm. Furthermore, we extended the qCTS algorithm to perform the physical synthesis of imbalanced clock tree structures as well as asynchronous clock networks, indicating its capability in handling a variety of clocking methodologies. • In Chapter 6, we presented a timing-aware clock tree topology generation algorithm. qTopGen algorithm considers the timing information in the data path and the worst-case timing slacks to any clock sink and generates clock topologies while accounting for wire and splitter cell delays, the total wirelength of the clock tree, and the process variations and timing uncertainties. More precisely, we presented a balanced binary clock tree topology (CTT) generation algorithm that employs a level by level construction methodology, in a bottom-up manner. While targeting a near zero-skew clock signal propagation, the resulting clock trees are height balanced, i.e, there is an equal number of splitters from the root of the tree to any of the leaf nodes (note that in some cases a single output of an splitter cell is used i.e., the splitter will simply function as a delay element). At each level of the tree, qTopGen solves an integer linear programming (ILP) 204 problem to determine which nodes in the clock tree should be paired up (i.e., become siblings by assigning them the same parent node). The objective function of this ILP formulation was the weighted sum of the total wire length and total negative slack of the clock tree. By minimizing the said objective function, we simultaneously minimized the routing cost of the clock tree and optimized the assignment of sink nodes to appropriate branches of the clock tree to increase the efficacy of common path pessimism removal (CPPR) technique [23], which in turn helped control the adverse and uncertain effects of process-induced variations on the worst-case clock skews. The statistical timing analysis results for ten benchmark circuits indicated that the qTopGen improves the total wirelength and the total negative hold slack by 4.2% and 64.6%, respectively, on average, compared with the wirelength-driven balanced topology generation approach presented in Chapter 5. • In Chapter 7, we presented a physical design methodology for timing closure, in particular resolving hold time violations utilizing the hold buffer insertion technique, in SFQ circuits. The presented timing variation-aware approach improves the total layout area and minimizes the performance overheads by applying common-path pessimism removal (CPPR) to remove the pessimism associated with the common clock paths to pairs of sequentially adjacent gates [26, 27]. Furthermore, we presented an incremental placement algorithm to place the added buffers and minimize the perturbations to the original placement solution to further preserve the layout area and minimize the overheads. In particular, we developed the first timing variation-aware hold time fixing approach for SFQ circuits, which considered both local and global timing uncertainties and worst-case scenarios in terms of hold slacks and effectively employed the common path pessimism removal technique to reduce the number of inserted hold buffers on each timing path. We then presented an incremental placement methodology for hold buffer to generate high quality solutions in terms of placement metrics, such as the layout area and maximum clock frequency and evaluated the presented approach using 205 dynamic timing analysis with a grid-based placement-aware variation model [25] on multiple ISCAS’85 benchmark circuits and the functionalities of circuits. Compared with a methodology using fixed constant margins for fixing all timing paths [25], our variation-aware approach reduced the average number of hold buffers by 8.4% with 6.2% increase in timing yield and 21.9% with a 1.7% increase in timing yield. Our methodology also enabled a trade-off between timing yield and layout area by tuning algorithmic parameters. • Finally, in Chapter 8 we presented novel margin calculation methods for SFQ cells with a large number of parameters. Particularly, we presented novel margin calculation algorithms to estimate a set of margins for all parameters in a logic cell such that if all parameters are spread inside these margins, yield is nearly one. The presented algorithms works irrespective of the cell structure and topology and could be used to estimate the yield of a logic cell with multiple parameters having different process variations efficiently without the need for costly MC simulations. We then extended the margin calculation algorithm by clustering parameters into hyper-parameters and reducing the parameter value search space, considered mixed positive and negative changes to variability sources and efficiently calculated corners of a feasible parameter region for each SFQ logic cells. If all parameter values lie within the convex hull of the reported corners, the parametric yield tends to be near one. The presented approach calculate margin polyhedrons in which a cell will almost always function correctly if all the parameter values lie inside this polyhedron, using only a small number of simulations. Monte-Carlo simulations validated that the average yield of the cells within this margin polyhedron was over 98%. Finally, we presented a machine learning based approach to model the parameteric yield of logic cells. This model is used to estimate Monte-Carlo (MC) based yield in a significantly smaller amount of time compared with doing simulation for each MC sample and can be used to evaluate the robustness of a cell throughout the optimization process efficiently, using a small number of simulations. 206 References 1. Koomey, J. Worldwide electricity used in data centers. Environ. Res. Lett. 3, 034008– 1–034008–8 (2008). 2. Duzer, T. V. & Turner, C. W. Principle of Superconducting Circuits (Elsevier, New York, 1981). 3. Likharev, K. K. & Semenov, V. K. RSFQ logic/memory family: A new Josephson- junction technology for sub-terahertz-clock-frequency digital systems. IEEE Transaction on Applied Superconductivity 1, 3–28 (1991). 4. Chen, W., Rylyakov, A. V., Patel, V., Lukens, J. E. & Likharev, K. K. Rapid single flux quantum T-flip flop operating up to 770 GHz. IEEE Transaction on Applied Superconductivity 9, 3212–3215 (1999). 5. Polonsky, S. Delay insensitive RSFQ circuits with zero static power dissipation. IEEE Transaction on Applied Superconductivity 9, 3535–3538 (1999). 6. Silver, A. H. & Herr, Q. P. A new concept for ultra-low power and ultra-high clock rate circuits. IEEE Transaction on Applied Superconductivity 11, 333–336 (2001). 7. Oberg, O. T., Herr, Q. P., Ioannidis, A. G. & Herr, A. Y. Integrated power divider for superconducting digital circuits. IEEE Transaction on Applied Superconductivity 21, 571–574 (2011). 8. Y. Yamanashi, T. N. & Yoshikawa, N. Study of LR-loading technique for low-power single flux quantum circuits. IEEE Transaction on Applied Superconductivity 17, 150– 153 (2007). 9. Eaton, L. R. & Johnson, M. W. Superconducting constant current source 2009. 10. D. E. Kirichenko, A. F. K. & Sarwana, S. No static power dissipation biasing of RSFQ circuits. IEEE Transaction on Applied Superconductivity 21, 776–779 (2011). 11. Tanaka, M., Ito, M., Kitayama, A., Kouketsu, T. & Fujimaki, A. 18-GHz, 4.0-aJ/bit operation of ultra-low-energy rapid single-flux-quantum shift registers. Japan Journal of Applied Physics 51, 053102–1—053102–4 (2012). 12. Mukhanov, A. Energy-efficient single flux quantum technology. IEEE Transaction on Applied Superconductivity 21, 760–769 (2011). 13. D. S. Holmes, A. L. R. & Manheimer, M. A. Energy-Efficient Superconducting Comput- ing, Power Budgets and Requirements. IEEE Transaction on Applied Superconductivity 23 (2013). 207 14. Gaj, K., Herr, Q. P., Adler, V., Karniewski, A., Friedman, E. G. & Feldman, M. J. Tools or the computer-aided design of multigigahertz superconducting digital circuits. IEEE Transaction on Applied Superconductivity 9 (1999). 15. Fourie, C. J. & Volkmann, M. H. Status of Superconductor Electronic Circuit Design Software. IEEE Transaction on Applied Superconductivity 23 (2013). 16. Holmes, D. S., Ripple, A. L. & Manheimer, M. A. Energy-Efficient Superconduct- ing Computing—Power Budgets and Requirements. IEEE Transactions on Applied Superconductivity 23, 1701610–1701610. doi:10.1109/TASC.2013.2244634 (2013). 17. Shahsavani, S. N., Lin, T. R., Shafaei, A., Fourie, C. J. & Pedram, M. An Integrated Row-Based Cell Placement and Interconnect Synthesis Tool for Large SFQ Logic Circuits. IEEE Transactions on Applied Superconductivity 27, 1–8. doi:10.1109/TASC. 2017.2675889 (2017). 18. Webpage for FreePDK45. 19. Tolpygo, S. K., Bolkhovsky, V., Weir, T. J., Johnson, L. M., Gouker, M. A. & Oliver, W. D. Fabrication Process and Properties of Fully-Planarized Deep-Submicron Nb/Al Josephson Junctions for VLSI Circuits. IEEE Transactions on Applied Superconductivity 25, 1–12. doi:10.1109/TASC.2014.2374836 (2015). 20. Lee, D.-J. & Markov, I. L. Obstacle-Aware Clock-Tree Shaping During Placement. Trans. Comp.-Aided Des. Integ. Cir. Sys. 31, 205–216 (2012). 21. Boyd, S., Parikh, N., Chu, E., Peleato, B. & Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning 3, 1–122 (2011). 22. Han, K., Kahng, A. B. & Li, J. Optimal Generalized H-Tree Topology and Buffering for High-Performance and Low-Power Clock Distribution. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1–1 (2018). 23. Garg, V. Common path pessimism removal: An industry perspective: Special Session: Common Path Pessimism Removal in 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2014), 592–595. doi:10.1109/ICCAD.2014.7001412. 24. Zhang, B. & Pedram, M. qSTA: A Static Timing Analysis Tool for Superconducting Single-Flux-Quantum Circuits. IEEE Trans. Appl. Supercond., 30, 1–9 (2020). 25. Tadros, R. N. & Beerel, P. A. Optimizing (HC) 2 LC, A Robust Clock Distribution Network For SFQ Circuits. IEEE Trans. Appl. Supercond., 30, 1–11 (2020). 26. Zejda, J. & Frain, P. General framework for removal of clock network pessimism in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. (2002), 632–639. 27. Garg, V. Common path pessimism removal: An industry perspective: Special Session: Common Path Pessimism Removal in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des. (2014), 592–595. 28. Hansen, M. C., Yalcin, H. & Hayes, J. P. Unveiling the ISCAS-85 Benchmarks: A Case Study in Reverse Engineering. IEEE Des. Test 16 (1999). 208 29. Gaj, K., Herr, Q. P., Adler, V., Krasniewski, A., Friedman, E. G. & Feldman, M. J. Tools for the computer-aided design of multigigahertz superconducting digital circuits. IEEE Transactions on Applied Superconductivity 9, 18–38 (1999). 30. Katam, N., Shafaei, A. & Pedram, M. Design of multiple fanout clock distribution network for rapid single flux quantum technology in 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC) (2017), 384–389. 31. Hamilton, C. A. & Gilbert, K. C. Margins and yield in single flux quantum logic. IEEE Transactions on Applied Superconductivity 1, 157–163 (1991). 32. Zheng, L. High-speed Rapid-single-flux-quantum Multiplexer and Demultiplexer Design and Testing tech. rep. (CALIFORNIA UNIV BERKELEY GRADUATE DIV, 2007). 33. Bunyk, P., Leung, M., Spargo, J. & Dorojevets, M. FLUX-1 RSFQ microprocessor: Physical design and test results. IEEE transactions on applied superconductivity 13, 433–436 (2003). 34. Alpert, C. J., Mehta, D. P. & Sapatnekar, S. S. Handbook of algorithms for physical design automation (Auerbach Publications, 2008). 35. Alpert, C. J., Caldwell, A. E., Kahng, A. B. & Markov, I. L. Hypergraph Partitioning with Fixed Vertices [VLSI CAD]. Trans. Comp.-Aided Des. Integ. Cir. Sys. 19, 267–272 (2006). 36. Tang, M. & Yao, X. A Memetic Algorithm for VLSI Floorplanning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 37, 62–69 (2007). 37. Markov, I. L., Hu, J. & Kim, M. C. Progress and Challenges in VLSI Placement Research. Proceedings of the IEEE 103, 1985–2003. doi:10.1109/JPROC.2015.2478963 (2015). 38. Kahng, A. B. New game, new goal posts: A recent history of timing closure in 2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC) (2015), 1–6. 39. Sherwani, N. A. Algorithms for VLSI physical design automation (Springer Science & Business Media, 2012). 40. Electronic Design Automation: Synthesis, Verification, and Test (eds Wang, L.-T., Chang, Y.-W. & Cheng, K.-T. () (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2009). 41. Kim, M.-C., Hu, J., Lee, D.-J. & Markov, I. L. A SimPLR Method for Routability-driven Placement in Proceedings of the International Conference on Computer-Aided Design (San Jose, California, 2011), 67–73. 42. Wang, C.-K., Huang, C.-C., Liu, S. S.-Y., Chin, C.-Y., Hu, S.-T., Wu, W.-C., et al. Closing the Gap Between Global and Detailed Placement: Techniques for Improving Routability in Proceedings of the 2015 Symposium on International Symposium on Physical Design (Monterey, California, USA, 2015), 149–156. 43. Pan, D. Z., Halpin, B. & Ren, H. Timing-Driven Placement in Handbook of Algorithms for Physical Design Automation (2008). 209 44. Garey, M. R., Johnson, D. S. & Stockmeyer, L. Some Simplified NP-complete Problems in Proceedings of the Sixth Annual ACM Symposium on Theory of Computing (Seattle, Washington, USA, 1974), 47–63. 45. Guibas, L. J. & Stolfi, J. On computing all north-east nearest neighbors in the L1 metric. Information Processing Letters 17, 219–223 (1983). 46. Garey, M. R. Computers and Intractability: A Guide to the Theory of NP-completeness, Freeman. Fundamental (1997). 47. Spindler, P., Schlichtmann, U. & Johannes, F. M. Kraftwerk2—A Fast Force-Directed Quadratic Placement Approach Using an Accurate Net Model. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 27, 1398–1411. doi:10. 1109/TCAD.2008.925783 (2008). 48. Sun, W.-J. & Sechen, C. Efficient and effective placement for very large circuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 14, 349–359 (1995). 49. Caldwell, A. E., Kahng, A. B. & Markov, I. L. Can recursive bisection alone produce routable placements? in Proceedings of the 37th Annual Design Automation Conference (2000), 477–482. 50. M.-C. Kim D.-J. Lee, I. L. M. SimPL: An Effective Placement Algorithm in Int’l. Conf. on Computer-Aided Design (ICCAD) (2010), 649–656. 51. Chan, T. F., Cong, J., Shinnerl, J. R., Sze, K. & Xie, M. mPL6: Enhanced Multilevel Mixed-size Placement in Proceedings of the 2006 International Symposium on Physical Design (ACM, San Jose, California, USA, 2006), 212–214. 52. Kim, M.-C. & Markov, I. L. ComPLx: A competitive primal-dual lagrange optimization for global placement in Proceedings of the 49th Annual Design Automation Conference (2012), 747–752. 53. Lin, T., Chu, C., Shinnerl, J. R., Bustany, I. & Nedelchev, I. POLAR: A high perfor- mance mixed-size wirelengh-driven placer with density constraints. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 447–459 (2015). 54. Quinn Jr, N. R. The placement problem as viewed from the physics of classical mechanics in Papers on Twenty-five years of electronic design automation (1988), 67–72. 55. Kleinhans, J. M., Sigl, G., Johannes, F. M. & Antreich, K. J. GORDIAN: VLSI placement by quadratic programming and slicing optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 10, 356–365 (1991). 56. Chen, T.-C., Hsu, T.-C., Jiang, Z.-W. & Chang, Y.-W. NTUplace: a ratio partitioning based placement algorithm for large-scale mixed-size designs in Proceedings of the 2005 international symposium on Physical design (2005), 236–238. 57. Naylor, W. C., Donelly, R. & Sha, L. Non-linear optimization system and method for wire length and delay optimization for an automatic electric circuit placer US Patent 6,301,693. 2001. 58. Hill, D. Method and system for high speed detailed placement of cells within an integrated circuit design US Patent 6,370,673. 2002. 210 59. Khatkhate, A., Li, C., Agnihotri, A. R., Yildiz, M. C., Ono, S., Koh, C.-K., et al. Recursive bisection based mixed block placement in Proceedings of the 2004 international symposium on Physical design (2004), 84–89. 60. Brenner, U. BonnPlace legalization: Minimizing movement by iterative augmentation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32, 1215–1227 (2013). 61. Ren, H., Pan, D. Z., Alpert, C. J. & Villarrubia, P. Diffusion-based placement migration in Proceedings of the 42nd annual Design Automation Conference (2005), 515–520. 62. Darav, N. K., Kennings, A., Tabrizi, A. F., Westwick, D. & Behjat, L. Eh?Placer: A High-Performance Modern Technology-Driven Placer. ACM Trans. Des. Autom. Electron. Syst. 21, 37:1–37:27 (2016). 63. Caldwell, A. E., Kahng, A. B. & Markov, I. L. Can recursive bisection alone produce routable placements? in Proceedings of the 37th Annual Design Automation Conference (2000), 477–482. 64. Viswanathan, N. & Chu, C.-N. FastPlace: efficient analytical placement using cell shifting, iterative local refinement, and a hybrid net model. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24, 722–733 (2005). 65. Desai, M. P., Cvijetic, R. & Jensen, J. Sizing of clock distribution networks for high per- formance CPU chips in Proceedings of the 33rd annual Design Automation Conference (1996), 389–394. 66. Shih, X.-W., Lee, H.-C., Ho, K.-H. & Chang, Y.-W. High variation-tolerant obstacle- avoiding clock mesh synthesis with symmetrical driving trees in Computer-Aided Design (ICCAD), 2010 IEEE/ACM International Conference on (2010), 452–457. 67. Su, H. & Sapatnekar, S. S. Hybrid structured clock network construction in Proceedings of the 2001 IEEE/ACM international conference on Computer-aided design (2001), 333–336. 68. Lin, C. & Zhou, H. Clock skew scheduling with delay padding for prescribed skew domains in Design Automation Conference, 2007. ASP-DAC’07. Asia and South Pacific (2007), 541–546. 69. Guthaus, M. R., Sylvester, D. & Brown, R. B. Process-induced skew reduction in nominal zero-skew clock trees in Proceedings of the 2006 Asia and South Pacific Design Automation Conference (2006), 84–89. 70. Gaj, K., Friedman, E. G. & Feldman, M. J. Timing of multi-gigahertz rapid single flux quantum digital circuits. Journal of VLSI signal processing systems for signal, image and video technology 16, 247–276 (1997). 71. Dikaiakos, M. D. & Steiglitz, K. Comparison of tree and straight-line clocking for long systolic arrays. Journal of VLSI signal processing systems for signal, image and video technology 2, 287–299 (1991). 72. Cong, J., Kahng, A. B., Koh, C.-K. & Tsao, C.-W. A. Bounded-skew Clock and Steiner Routing. ACM Trans. Des. Autom. Electron. Syst. 3, 341–388 (1998). 211 73. Alpert, C. J., Mehta, D. P. & Sapatnekar, S. S. Handbook of Algorithms for Physical Design Automation 1st (Auerbach Publications, Boston, MA, USA, 2008). 74. Kahng, A. B., Lienig, J., Markov, I. L. & Hu, J. VLSI Physical Design: From Graph Partitioning to Timing Closure 1st (Springer Publishing Company, Incorporated, 2011). 75. Kim, M., Hu, J., Li, J. & Viswanathan, N. ICCAD-2015 CAD contest in incremen- tal timing-driven placement and benchmark suite in 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2015), 921–926. 76. Srinivasan, A., Chaudhary, K. & Kuh, E. S. RITUAL: a performance driven placement algorithm. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 39, 825–840. doi:10.1109/82.204130 (1992). 77. Halpin, B., Chen, C. Y. R. & Sehgal, N. Timing Driven Placement Using Physical Net Constraints in Proceedings of the 38th Annual Design Automation Conference (ACM, Las Vegas, Nevada, USA, 2001), 780–783. doi:10.1145/378239.379065. 78. Kahng, A. B. & Wang, Q. An analytic placer for mixed-size placement and timing- driven placement in IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004. (2004), 565–572. 79. Wonjoon Choi & Bazargan, K. Incremental placement for timing optimization in ICCAD-2003. International Conference on Computer Aided Design (IEEE Cat. No.03CH37486) (2003), 463–466. 80. Chen, J., Yang, L., Peng, Z., Zhu, W. & Chang, Y.-W. Novel Proximal Group ADMM for Placement Considering Fogging and Proximity Effects in Proceedings of the International Conference on Computer-Aided Design (San Diego, California, 2018), 3:1–3:7. 81. Zhu, Z., Chen, J., Peng, Z., Zhu, W. & Chang, Y. Generalized Augmented Lagrangian and Its Applications to VLSI Global Placement* in 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC) (2018), 1–6. doi:10.1109/DAC.2018.8465922. 82. Foga¸ ca, M., Flach, G., Monteiro, J., Johann, M. & Reis, R. Quadratic timing objectives for incremental timing-driven placement optimization in 2016 IEEE International Conference on Electronics, Circuits and Systems (ICECS) (2016), 620–623. 83. Jung, J., Nam, G.-J., Chung, W. & Shin, Y. Integrated Latch Placement and Cloning for Timing Optimization. ACM Trans. Des. Autom. Electron. Syst. 24, 22:1–22:17 (2019). 84. Mangiras, D., Stefanidis, A., Seitanidis, I., Nicopoulos, C. & Dimitrakopoulos, G. Timing-Driven Placement Optimization Facilitated by Timing-Compatibility Flip-Flop Clustering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1–1. doi:10.1109/TCAD.2019.2942001 (2019). 85. Hsu, M., Chang, Y. & Balabanov, V. TSV-aware analytical placement for 3D IC designs in 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC) (2011), 664–669. 86. Lu, J., Chen, P., Chang, C.-C., Sha, L., Huang, D. J.-H., Teng, C.-C., et al. ePlace: Electrostatics-Based Placement Using Fast Fourier Transform and Nesterov’s Method. ACM Trans. Des. Autom. Electron. Syst. 20, 17:1–17:34 (2015). 212 87. Cheng, C., Kahng, A. B., Kang, I. & Wang, L. RePlAce: Advancing Solution Quality and Routability Validation in Global Placement. IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems 38, 1717–1730. doi:10.1109/TCAD. 2018.2859220 (2019). 88. Lin, Y., Dhar, S., Li, W., Ren, H., Khailany, B. & Pan, D. Z. DREAMPIace: Deep Learning Toolkit-Enabled GPU Acceleration for Modern VLSI Placement in 2019 56th ACM/IEEE Design Automation Conference (DAC) (2019), 1–6. 89. Likharev, K. & Semenov, V. RSFQ Logic/Memory Family: A New Josephson-Junction Technology for Sub-Terahertz-Clock-Frequency Digital Systems. IEEE Transactions on Applied Superconductivity 1, 3–28 (1991). 90. Mukhanov, O. A. Energy-efficient single flux quantum technology. IEEE Transactions on Applied Superconductivity 21, 760–769 (2011). 91. IBM ILOG CPLEX 92. Viswanathan, N., Pan, M. & Chu, C. FastPlace 3.0: A Fast Multilevel Quadratic Placement Algorithm with Placement Congestion Control in Proceedings of the 2007 Asia and South Pacific Design Automation Conference (IEEE Computer Society, Washington, DC, USA, 2007), 135–140. doi:10.1109/ASPDAC.2007.357975. 93. Caldwell, A. E., Kahng, A. B., Kahng, A. B., Kahng, A. B., Kahng, A. B. & Markov, I. L. Can Recursive Bisection Alone Produce Routable Placements? in Proceedings of the 37th Annual Design Automation Conference (ACM, Los Angeles, California, USA, 2000), 477–482. doi:10.1145/337292.337549. 94. The epfl combinational benchmark suite https://lsi.epfl.ch/benchmarks. Accessed: 2019. 95. Pasandi, G. & Pedram, M. A Dynamic Programming-Based, Path Balancing Technology Mapping Algorithm Targeting Area Minimization in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2019), 1–8. 96. Zhang, B., Wang, F., Gupta, S. & Pedram, M. A Statistical Static Timing Analysis Tool for Superconducting Single-Flux-Quantum Circuits in 2019 IEEE International Superconductive Electronics Conference (ISEC) (2019), 1–5. 97. Burleson, W. P., Ciesielski, M., Klass, F. & Liu, W. Wave-pipelining: A tutorial and research survey. IEEE TRANS. ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 6, 464–474 (1998). 98. M. R. Hestenes, E. S. Methods of Conjugate Gradients for Solving Linear Systems. Journal of Research of the National Bureau of Standards 49, 409–436 (1952). 99. Burkard, R. E. & C ¸ela, E. in Handbook of Combinatorial Optimization: Supplement Volume A (eds Du, D.-Z. & Pardalos, P. M.) 75–149 (Springer US, Boston, MA, 1999). 100. Chen, S. J. & Cheng, C. K. Tutorial on VLSI Partitioning 2000. 101. Fiduccia, C. M. & Mattheyses, R. M. A Linear-Time Heuristic for Improving Network Partitions in 19th Design Automation Conference (1982), 175–181. doi:10.1109/DAC. 1982.1585498. 213 102. Karypis, G. & Kumar, V. Multilevel K-way Hypergraph Partitioning in Proceedings of the 36th Annual ACM/IEEE Design Automation Conference (1999), 343–348. 103. Kim, M.-C., Lee, D.-J. & Markov, I. L. SimPL: An Effective Placement Algorithm. Trans. Comp.-Aided Des. Integ. Cir. Sys. 31, 50–60 (2012). 104. Adya, S. N. & Markov, I. L. Executable Placement Utilities, 2005. 105. Pedram, M. & Wang, Y. Design Automation Methodology and Tools for Superconductive Electronics in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2018), 1–6. 106. Fourie, C. J., Jackman, K., Botha, M. M., Razmkhah, S., Febvre, P., Ayala, C. L., et al. ColdFlux Superconducting EDA and TCAD Tools Project: Overview and Progress. IEEE Transactions on Applied Superconductivity 29, 1–7 (2019). 107. Shahsavani, S. N., Shafaei, A. & Pedram, M. A placement algorithm for superconducting logic circuits based on cell grouping and super-cell placement in 2018 Design, Automation Test in Europe Conference Exhibition (DATE) (2018), 1465–1468. 108. Tadros, R. N. & Beerel, P. A. A Robust and Tree-Free Hybrid Clocking Technique for RSFQ Circuits - CSR Application in 2017 16th International Superconductive Electronics Conference (ISEC) (2017), 1–4. 109. Kito, N., Takagi, K. & Takagi, N. A Fast Wire-Routing Method and an Automatic Layout Tool for RSFQ Digital Circuits Considering Wire-Length Matching. IEEE Transactions on Applied Superconductivity 28, 1–5 (2018). 110. Kahng, A. B., Lienig, J., Markov, I. L. & Hu, J. VLSI Physical Design: From Graph Partitioning to Timing Closure 1st (Springer Publishing Company, Incorporated, 2011). 111. Gaj, K., Friedman, E. G. & Feldman, M. J. Timing of Multi-Gigahertz Rapid Single Flux Quantum Digital Circuits. J. VLSI Signal Process. Syst. 16, 247–276 (1997). 112. Takagi, K., Ito, Y., Takeshima, S., Tanaka, M. & Takagi, N. Layout-Driven Skewed Clock Tree Synthesis for Superconducting SFQ Circuits. IEICE Transactions 94-C, 288–295 (2011). 113. Yoshio Kameda, S. Y. & Hashimoto, Y. A New Design Methodology for Single-Flux- Quantum (SFQ) Logic Circuits Using Passive-Transmission-Line (PTL) Wiring. IEEE Transaction on Applied Superconductivity 17, 508–511 (2007). 114. Edahiro, M. A Clustering-Based Optimization Algorithm in Zero-Skew Routings in 30th ACM/IEEE Design Automation Conference (1993), 612–616. 115. Cong, J., Kahng, A. B. & Robins, G. Matching-based methods for high-performance clock routing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 12, 1157–1169 (1993). 116. Jackson, M. A., Srinivasan, A. & Kuh, E. S. Clock routing for high-performance ICs in Design Automation Conference, 1990. Proceedings., 27th ACM/IEEE (1990), 573–579. 117. Weste, N. H. E. & Harris, D. M. CMOS VLSI Design: A Circuits and Systems Perspective 4th (Addison-Wesley, San fransisco, CA, 2005). 214 118. Griva, I., Nash, S. & Sofer, A. Linear and Nonlinear Optimization: Second Edition (Society for Industrial and Applied Mathematics, 2009). 119. Lin, T., Edwards, T. & Pedram, M. qGDR: A Via-Minimization-Oriented Routing Tool for Large-Scale Superconductive Single-Flux-Quantum Circuits. IEEE Transactions on Applied Superconductivity 29, 1–12. doi:10.1109/TASC.2019.2915771 (2019). 120. Tolpygo, S. K. Superconductor digital electronics: Scalability and energy efficiency issues (Review Article). Low Temperature Physics 42, 361–379. doi:10.1063/1.4948618 (2016). 121. Tadros, R. N. & Beerel, P. A. A Robust and Self-Adaptive Clocking Technique for SFQ Circuits. IEEE Transactions on Applied Superconductivity 28, 1–11 (2018). 122. Tadros, R. N. & Beerel, P. A. Optimizing (HC) 2 LC, A Robust Clock Distribution Network for SFQ Circuits. IEEE Transactions on Applied Superconductivity (2019). 123. Cong, J., Kahng, A. B., Koh, C.-K. & Tsao, C.-W. A. Bounded-skew Clock and Steiner Routing. ACM Trans. Des. Autom. Electron. Syst. 3, 341–388 (1998). 124. Shahsavani, S. N., Lin, T., Shafaei, A., Fourie, C. J. & Pedram, M. An Integrated Row-Based Cell Placement and Interconnect Synthesis Tool for Large SFQ Logic Circuits. IEEE Transactions on Applied Superconductivity 27, 1–8 (2017). 125. Fourie, C. J. Extraction of DC-Biased SFQ Circuit Verilog Models. IEEE Transactions on Applied Superconductivity 28, 1–11 (2018). 126. Shahsavani, S. N., Lin, T., Shafaei, A., Fourie, C. J. & Pedram, M. An Integrated Row-Based Cell Placement and Interconnect Synthesis Tool for Large SFQ Logic Circuits. IEEE Transactions on Applied Superconductivity 27, 1–8. doi:10.1109/TASC. 2017.2675889 (2017). 127. Shahsavani, S. N. & Pedram, M. A Minimum-Skew Clock Tree Synthesis Algorithm for Single Flux Quantum Logic Circuits. IEEE Transactions on Applied Superconductivity 29, 1–13. doi:10.1109/TASC.2019.2943930 (2019). 128. Ting-Hai Chao, Yu-Chin Hsu, Jan-Ming Ho & Kahng, A. B. Zero skew clock routing with minimum wirelength. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 39, 799–814. doi:10.1109/82.204128 (1992). 129. Likharev, K. K. & Semenov, V. K. RSFQ logic/memory family: a new Josephson- junction technology for sub-terahertz-clock-frequency digital systems. IEEE Trans. Appl. Supercond., 1, 3–28 (1991). 130. Reed, D. A. & Dongarra, J. Exascale Computing and Big Data. Commun. ACM 58, 56–68. doi:10.1145/2699414 (2015). 131. Mukhanov, O. A. Energy-Efficient Single Flux Quantum Technology. IEEE Trans. Appl. Supercond., 21, 760–769 (2011). 132. Holmes, D. S., Ripple, A. L. & Manheimer, M. A. Energy-Efficient Superconducting Computing—Power Budgets and Requirements. IEEE Trans. Appl. Supercond., 23, 1701610–1701610 (2013). 215 133. Fourie, C. J. & Volkmann, M. H. Status of Superconductor Electronic Circuit Design Software. IEEE Trans. Appl. Supercond., 23, 1300205–1300205 (2013). 134. Bunyk, P., Likharev, K. & Zinoviev, D. RSFQ technology: Physics and devices. Int. J. High Speed Electron. Syst., 11. doi:10.1142/S012915640100085X (2001). 135. Vernik, I. V., Herr, Q. P., Gaij, K. & Feldman, M. J. Experimental investigation of local timing parameter variations in RSFQ circuits. IEEE Trans. Appl. Supercond., 9, 4341–4344 (1999). 136. Han, K., Kahng, A. B. & Li, J. Optimal Generalized H-Tree Topology and Buffering for High-Performance and Low-Power Clock Distribution. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 39, 478–491 (2020). 137. Huang, T., Wu, P. & Wong, M. D. F. Fast path-based timing analysis for CPPR in 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2014), 596–599. doi:10.1109/ICCAD.2014.7001413. 138. Huang, T., Wu, P. & Wong, M. D. F. UI-Timer: An ultra-fast clock network pessimism removal algorithm in 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) (2014), 758–765. doi:10.1109/ICCAD.2014.7001436. 139. Krig Gaj, E. G. F. & Feldman, M. J. Timing of multi-gigahertz rapid single flux quantum digital circuits. Journal of VLSI signal processing systems for signal, image and video technology 16, 247–276 (1997). 140. Hubert, K. Digital Integrated Circuit Design: From VLSI Architectures to CMOS Fabrication (Cambridge University Press, 2008). 141. Shahsavani, S. N., Zhang, B. & Pedram, M. A Timing Uncertainty-Aware Clock Tree Topology Generation Algorithm for Single Flux Quantum Circuits in DATE (2020), 278–281. 142. Shahsavani, S. N. & Pedram, M. A Minimum-Skew Clock Tree Synthesis Algorithm for Single Flux Quantum Logic Circuits. IEEE Trans. Appl. Supercond., 29, 1–13 (2019). 143. Shell, M. ABC: A System for Sequential Synthesis and Verification https://people. eecs.berkeley.edu/ ˜ alanmi/abc/. 144. “IBM ILOG CPLEX 12.10” http://www.ilog.com/products/cplex/. 145. Grafarend, E. W. Linear and Nonlinear Models: Fixed Effects, Random Effects, and Mixed Models 553 (Walter de Gruyter, 2006). 146. M.A.Bender & M.F.Colton. The LCA problem revisited. Proc. 4th Latin American Symposium on Theoretical Informatics 1776 LNCS, 88–94. doi:10.1007/10719839_9 (2000). 147. Tadros, R. N., Fayyazi, A., Pedram, M. & Beerel, P. A. SystemVerilog Modeling of SFQ and AQFP Circuits. IEEE Transactions on Applied Superconductivity 30, 1–13 (2020). 148. Xiong, J., Zolotov, V. & He, L. Robust Extraction of Spatial Correlation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26, 619– 631 (2007). 216 149. Tolpygo, S. K., Bolkhovsky, V., Oates, D. E., Rastogi, R., Zarr, S., Day, A. L., et al. Superconductor Electronics Fabrication Process with MoNx Kinetic Inductors and Self-Shunted Josephson Junctions. IEEE Trans. Appl. Supercond., 28, 1–12 (2018). 150. M¨ uller, L. C., Gerber, H. R. & Fourie, C. J. Review and comparison of RSFQ asyn- chronous methodologies. J. Phys. Conf. Ser. 97, 12109 (2008). 151. Gaj, K., Herr, Q. & Feldman, M. Parameter variations and synchronization of RSFQ circuits in Proc. Conf. Series-Inst. Phys. 148 (1995), 1733–1736. 152. Fourie, C. Digital Superconducting Electronics Design Tools ;Status and Roadmap. IEEE Transactions on Applied Superconductivity 28, 1–12 (2018). 153. Rylyakov, A. V. & Polonsky, S. V. All-digital 1-bit RSFQ autocorrelator for radioas- tronomy applications: design and experimental results. IEEE Transactions on Applied Superconductivity 8, 14–19 (1998). 154. P. Shevchenko. PSCAN2: Superconductor Circuit Simulator http://pscan2sim.org/ downloads.html (2015). 155. Soheil Nazar Shahsavani, B. Z. & Pedram, M. Accurate Margin Calculation for Single Flux Quantum Logic Cells. Proceedings of Design, Automation and Test in Europe, 2018 (2018). 156. Tolpygo, S. K. Superconductor digital electronics: Scalability and energy efficiency issues (Review Article). Low Temperature Physics 42, 361–379 (2016). 157. Ho, T. K. Random decision forests in Proceedings of 3rd International Conference on Document Analysis and Recognition 1 (1995), 278–282 vol.1. 158. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011). 217
Abstract (if available)
Abstract
Josephson Junction (JJ)-based superconducting logic families have been recognized as a potential “beyond-CMOS” technology for their extremely low energy dissipation and ultra-fast switching speed. In particular, single flux quantum (SFQ) technology has appeared as an extremely promising technology with a verified speed of 370GHz for simple digital circuits and switching energy per bit in the order of 10⁻¹⁹ J at 4.2 Kelvins. ❧ This dissertation presents algorithms and methodologies for electronic design automation of superconducting circuits, especially algorithms for the physical design and optimization of SFQ logic family. In particular, novel algorithms for automated global and timing driven placement, clock network design, topology generation and synthesis, timing closure, and margin and yield calculation are introduced. ❧ First, a timing-driven placement methodology is presented that optimizes the critical path delays as well as the total wirelength, taking advantage of the alternating direction method of multipliers (ADMM) mathematical framework. The proposed algorithm models the placement problem as an optimization problem with constraints on the maximum wirelength delay of timing-critical paths and employs the ADMM algorithm to decompose the problem into two sub-problems, one minimizing the total wirelength of the circuit and the other minimizing the delay of timing-critical paths of the circuit. Through an iterative process, a placement solution is generated that simultaneously minimizes the total wirelength and satisfies the setup time constraints. ❧ Next, two clock tree structures for SFQ circuit are presented: (i) H-tree clock network which maximizes the frequency of a circuit by propagating a near zero-skew clock network to all the sequential elements and (ii) HL-tree clock network which employs a combination of H-tree and linear clock networks to propagate the clock signal globally to groups of cells and locally within each cell-group. These two clock structures offer a trade-off between the performance of the circuit and total chip area. Two placement solutions are then presented to generate placements compatible with the aforementioned clock structures. The first placement solution employs a global placement algorithm to minimize the total wirelength of a circuit. The second solution creates an HL-tree compatible placement using a novel cell clustering approach. ❧ Subsequently, novel linear programming based methodologies for generating clock topologies and physical design of clock networks are presented. In the proposed methodology, a combination of objective functions such as the maximum clock skew and timing slacks in addition to process and timing-induced uncertainties are considered, aiming at optimizing the power, performance, and area of superconducting circuits. ❧ Moreover, a novel timing closure algorithm is presented that takes advantage of the balanced nature of clock trees in the SFQ circuits, common path pessimism removal methodology, and hold buffer insertion technique to increase the timing yield and reduce the power and energy consumption of SFQ circuits. ❧ Finally, two novel margin calculation methods are introduced. These methods calculate a set of parameter margins for each SFQ logic cell such that if all parameter values lie within the boundary of the calculated margins, parametric yield values are near one. The proposed multiple hyper-parameter change margin calculation (MHCMC) method improves the state-of-the-art by accounting for global sources of variation, clustering cell parameters into hyper-parameters, and considering the co-dependency of these hyper-parameters when calculating a feasible parameter region.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
PDF
Analog and mixed-signal parameter synthesis using machine learning and time-based circuit architectures
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Charge-mode analog IC design: a scalable, energy-efficient approach for designing analog circuits in ultra-deep sub-µm all-digital CMOS technologies
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
Asset Metadata
Creator
Nazar Shahsavani, Soheil
(author)
Core Title
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2021-05
Publication Date
05/09/2021
Defense Date
03/11/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clock network synthesis,electronic design automation,logic circuits,OAI-PMH Harvest,physical design,placement,single flux quantum,superconducting electronics,superconductivity
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Gupta, Sandeep (
committee member
), Nakano, Aiichiro (
committee member
)
Creator Email
nazarsha@alumni.usc.edu,nazarsha@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112720052
Unique identifier
UC112720052
Identifier
etd-NazarShahs-9615.pdf (filename)
Legacy Identifier
etd-NazarShahs-9615
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Nazar Shahsavani, Soheil
Type
texts
Source
20210510-wayne-usctheses-batch-836-shoaf
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
clock network synthesis
electronic design automation
logic circuits
physical design
placement
single flux quantum
superconducting electronics
superconductivity