Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Clocking solutions for SFQ circuits
(USC Thesis Other)
Clocking solutions for SFQ circuits
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CLOCKING SOLUTIONS FOR SFQ CIRCUITS by Ramy N. Tadros A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2019 Copyright 2019 Ramy N. Tadros Our integrity sells for so little, but it is all we really have. It is the very last inch of us. And within that inch, we are free. {Valerie ii Acknowledgment First, I would like to express my immense gratitude to my PhD advisor and mentor, Prof. Peter Beerel; This thesis would have never existed within the realms of reality without his patience and guidance. I would also like to thank my defense committee: Prof. Massoud Pedram, and Prof. Leana Golubchik. In addition to my qualication committee: Prof. Sandeep Gupta, and Prof. Murali Annavaram. I would like to show gratitude to the USC ColdFlux team. Starting from our PI, Prof. Massoud Pedram, and the faculty: Prof. Peter Beerel, Prof. Sandeep Gupta, Prof. Murali Annavaram, and Prof. Shahin Nazarian. And my col- leagues: Naveen Katam, Ting-Ru Lin, Soheil Shahsavani, Fangzhou Wang, Bo Zhang, Arash Fayyazi, Ghasem Pasandi, Haipeng Zha, and Souvik Kundu. Special gratitude to my colleagues Naveen Katam and Soheil Shahsavani for their codes, and thanks to Arash Fayyazi for collaborating in the SV modeling part. Thanks to Dylan Hand for his IT help with Sierra server. Thanks to Olivia Chen (Yokohama National University) for her help in the AQFP part. Very special thanks to Diane Demetras, our mother leader in the ECE depart- ment. Special thanks to Annie Hua for her time and help during those ve years. Also thanks to the rest of the sta: In particular, Shane Goodo, Estela Lopez, David Ho, Cathy Huang, Benjamin Paul, and Kathy Kassar. Memorable thanks to Tim Boston, may his soul rest in peace. I would also like to thank my lab mates (A{Z): Huimei Cheng, Sourya Dey, Dylan Hand, Moises Herrera, Weizhe Hua, Fei Huang, Souvik Kundu, Arnab Sanyal, Matheus Trevisan, Qili Wang, Yang Zhang, and Mutian Zhu. And my oce mates (A{Z): Valeriu Balaban, Yannick Bliesner, Gaurav Gupta, Hanieh Hashemi, Naveen Katam, Panagiotis Kyrakis, Rebecca Lee, Ting-Ru Lin, Pezh- man Mamdouh, Martin Martinez, Fernando Monteiro, Nirakar Poudel, Thanos Rampokos, Alireza Shafaei., Jayson Sia, Yilda Valle, Filipe Vital, Yanzhi Wang, Luhao Wang, Kwame Wright, Yuankun Xue, Jizhe Zhang, and Ridha Znaidi. I would like to show gratitude to my family: Marie-Th er ese Darous, Nagy Tadros, Caroline Tadros, Youssef Darous, Vivianne Zarifa, Antoinette Tadros, Malak Fahim, and Fady Azmy. And my friends (A{Z): Ahmed Abdalbaaki, Mar- ianne Abdelmalek, Abdelrahman H. Ahmed, Ahmed AlAskalany, Ahmed Aldash, iii Andrew Alfonse, Ragui Amir, Abanoub Asaad, Mostafa Ayesh, Samir Barsoum, Joseph Boshra, Arsani Boshra, Mina Botros, Ahmed Elbaradei, Mohamed Elgeweily, Amr Elnakeeb, Hoda Elsafty, Mariam Enany, Remon Ezzat, Ahmed Ezzat, Yahya Ezzeldine, Benoit Fouad, Mina Gad, Sameh Gadallah, Hassan Ghozlan, Andrew Guirguis, Peter Guirguis, Sherief Hammam, Mohamed Hassan, Hesham Hosney, David Ibrahim, Samer Idres, Mohamed Khalifa, Mimi Li, Kirollos Lipton, Ademir Machado, Samer Makar, Remon Makram, Youssef Makram, Ramy Mankarios, John Meshreki, Shady Milad, Mina Monir, Moheb Morris, Fady Mosaad, Heba Mostafa, Hanna Moussa, Peter Nabih, Peter Nashaat, Maged Nashat, Tamer Nashed, Remon Nasry, Nader Nassif, Coralie Poulard, Rewis Raafat, Kiro Raouf, Karim Rawy, Rodolfo Reta, David Robert, Sam Saad, Moataz Salah, Simone Samir, Sameh Sarwat, Adly Sawires, George Sedra, Ayman Selmy, Fady Sobhy, Ahmed Soliman, Zhanerke Temirgaliyeva, Youhanna William, Amir Youssef, and Victor Zakhary. Special acknowledgment to Alexa Davenport. Memorable mention to my deceased friends Simone Samir, Michael Sabry, Joseph Mohsen, and Mark Antoine; May their souls rest in peace. iv Table of Contents Acknowledgment iii List of Figures ix List of Tables xii Abstract xiii Related Publications xv 1 Introduction 1 1.1 SFQ Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Clocking in SFQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Background 10 2.1 SCE Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 SFQ vs CMOS . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.1.3 Timing Uncertainty in SFQ . . . . . . . . . . . . . . . . . . 24 2.2 Timing Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.1 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.2 Timing Violations . . . . . . . . . . . . . . . . . . . . . . . . 30 2.2.3 Clock Distribution Networks (CDN) . . . . . . . . . . . . . 31 2.3 Clocking in SFQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 2.3.1 Natural Self-Timing . . . . . . . . . . . . . . . . . . . . . . 39 2.3.2 SFQ clocking techniques . . . . . . . . . . . . . . . . . . . . 42 2.3.3 Previous SFQ Chips Clocking . . . . . . . . . . . . . . . . . 49 3 Hybrid Clover-Leaves Clocking (HCLC) 56 3.1 Clocking of Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.2 The Proposed Clocking Scheme: HCLC . . . . . . . . . . . . . . . . 61 v 3.2.1 The Architecture . . . . . . . . . . . . . . . . . . . . . . . . 61 3.2.2 Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.2.3 Extending and Scaling . . . . . . . . . . . . . . . . . . . . . 64 3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.1 Simulation Platform . . . . . . . . . . . . . . . . . . . . . . 65 3.3.2 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4 Asynchronous Clock Distribution Networks (ACDN) 70 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.2 Marked Graph Background . . . . . . . . . . . . . . . . . . . . . . . 73 4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4 Denitions And Nomenclature . . . . . . . . . . . . . . . . . . . . . 76 4.4.1 Basic Denitions . . . . . . . . . . . . . . . . . . . . . . . . 76 4.4.2 Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.4.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.5 ACDN Theoretical Foundation . . . . . . . . . . . . . . . . . . . . . 84 4.6 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.6.1 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . 86 4.6.2 Cycle Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.6.3 Firing Period . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.6.4 Uncertainty Condition . . . . . . . . . . . . . . . . . . . . . 93 4.7 Comparison to Previous Work . . . . . . . . . . . . . . . . . . . . . 96 4.7.1 Circuits and Systems Literature . . . . . . . . . . . . . . . . 97 4.7.2 Theoretical Literature . . . . . . . . . . . . . . . . . . . . . 98 4.7.3 Insight From [53, 54] . . . . . . . . . . . . . . . . . . . . . . 101 4.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5 Hierarchical Chains of Homogeneous Clover-Leaves Clocking ((HC) 2 LC) 111 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2 Proposed Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.2.1 Hierarchical Chains . . . . . . . . . . . . . . . . . . . . . . . 113 5.2.2 Bottom Level . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.2.3 Top Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3.1 ACDN Theory . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.3.2 Cycle Time And Clock Skew . . . . . . . . . . . . . . . . . . 122 5.3.3 Area Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.4 Optimizing (HC) 2 LC . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.4.1 Optimizing Insertion Delay . . . . . . . . . . . . . . . . . . . 126 vi 5.4.2 Gate Assignment . . . . . . . . . . . . . . . . . . . . . . . . 130 5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6 Two-Phase Clocking (2PC) 140 6.1 The Clocking Technique . . . . . . . . . . . . . . . . . . . . . . . . 140 6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2.1 Timing Violations . . . . . . . . . . . . . . . . . . . . . . . . 142 6.2.2 100% Timing Yield . . . . . . . . . . . . . . . . . . . . . . . 144 6.2.3 Performance Limits . . . . . . . . . . . . . . . . . . . . . . . 145 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7 Tools Platform 149 7.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 7.2 CDN Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 7.3 HDL Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 7.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 7.3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 155 7.3.3 Proposed Models . . . . . . . . . . . . . . . . . . . . . . . . 158 7.4 Dynamic Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 161 7.5 Variations Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8 Experimental Results 165 8.1 Monte Carlo Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8.2 (HC) 2 LC Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 8.3 Evaluating the Optimizations . . . . . . . . . . . . . . . . . . . . . 171 9 Summary, Future Work, and Conclusions 173 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 A ACDN proofs 185 A.1 Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 A.2 Cycle Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 A.3 Firing Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 A.4 Uncertainty Condition . . . . . . . . . . . . . . . . . . . . . . . . . 201 B SystemVerilog Models 206 B.1 SystemVerilog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 B.2 SFQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 B.2.1 Proposed SFQ Interface . . . . . . . . . . . . . . . . . . . . 208 B.2.2 Our Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 vii B.2.3 Our Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 B.3 AQFP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 B.3.1 Proposed AQFP Interfaces . . . . . . . . . . . . . . . . . . . 214 B.3.2 Our Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 B.4 SFQ/AQFP Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 223 References 227 viii List of Figures 2.1 Josephson Junction (JJ) . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Passive transmission line (PTL) [4] where two overdamped junctions are connected by a matched microstrip line. L is the line length, is the microstrip impedance, is the time of ight of the pulse, c is the speed of transmission which approaches the speed of light, andI b is the dc bias current. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 Josephson transmission line (JTL) [4] where junctions are connected in parallel by superconducting strips of a relatively low inductance and dc- current biased to their pre-critical state (I b <I 0 ). . . . . . . . . . . . . 14 2.4 Illustration of the SFQ basic convention and representation of bits using an elementary cell of the SFQ circuits [4]. . . . . . . . . . . . . . . . . 15 2.5 The schematic of a simple interferometer loop with a clock input which behaves as a D- ip op circuit [10]. . . . . . . . . . . . . . . . . . . . . 16 2.6 Some SFQ basic cells symbols. . . . . . . . . . . . . . . . . . . . . . . 18 2.7 The schematic of a buer/inverter AQFP gate. WhenI x is activated, an SFQ is stored in either the left loop (`0') or the right loop (`1') [33]. . . . 19 2.8 An illustration of the 4-phase clocking for AQFP circuits [37]. . . . . . . 20 2.9 An example of the three-level signaling of AQFP for a sequence of three inverters; excitation 1 is for the rst inverter and so on, respectively [33]. 21 2.10 Two sequentially adjacent gates in a synchronous system. x represents the clock input of the xth gate. Data and clock paths parameters are depicted as red and blue respectively. INT stands for the interconnect between the gates, REG and COMB stand for register and combinational logic respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.11 An abstract illustration of a clocking grid [59]. . . . . . . . . . . . . . . 33 2.12 An abstract illustration of an H-tree [38]. . . . . . . . . . . . . . . . . 33 2.13 An abstract illustration of a clocking serpentine [59]. . . . . . . . . . . 34 2.14 An abstract illustration of clocking spines with serpentine routing [38]. . 35 2.15 A typical CDN architecture for high performance IC design [58]. . . . . 38 2.16 A reproduction of the abstraction of the operating region [10] of a circuit composed of two sequentially adjacent cells as a function of the clock period and the clock skew. . . . . . . . . . . . . . . . . . . . . . . . . 40 ix 2.17 Natural ow clocking [10]. . . . . . . . . . . . . . . . . . . . . . . . . 42 2.18 The concept of the data-driven self-timed (DDST) scheme [18]. . . . . . 45 2.19 The general structure of the pulse-driven dual-rail logic (PDDRL) scheme [65]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 2.20 The pipeline structure of the delay insensitive technique of [11]. . . . . . 47 2.21 The functional diagram of RSFQ-AT [20]. . . . . . . . . . . . . . . . . 48 2.22 Block diagram of the FLUX-1 microprocessor [76]. . . . . . . . . . . . . 50 2.23 Block diagram of the microprocessor system of [19]. . . . . . . . . . . . 51 2.24 Block diagram of the TIPPY microprocessor of [17]. . . . . . . . . . . . 52 2.25 Block diagram of the micro-architecture of the CORE1 microprocessor [77]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.26 Structural diagram of the 16-bit sparse tree adder [79]. . . . . . . . . . 54 2.27 Block diagram of the butter y processing unit for signed numbers calcu- lation fabricate in [80]. . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.1 A generic diagram for an N-bits CSR. . . . . . . . . . . . . . . . . . 58 3.2 A 64-bit CSR clocked using mixed-CSR clocking divided into 4 stages. This architecture was proposed by [14]. . . . . . . . . . . . . 59 3.3 The proposed HCLC for a 32 gates CSR with M=2. . . . . . . . . . . . 62 3.4 (a-e) The bars are the yield (left y-axis) at dierent non-idealities stan- dard deviation, while the line is the average cycle time (right y-axis). The gures are for dierent pipeline lengths, and the percentage next to a design name is the clock cycle added margin w.r.t. the setup constraint. (f) The yield with non-idealities 3 = 100% divided by the cycle time versus dierent pipeline lengths. . . . . . . . . . . . . . . . . . . . . . 68 4.1 Basic SFQ cells and their MG models. . . . . . . . . . . . . . . . . . . 74 4.2 An example of a timed MG. . . . . . . . . . . . . . . . . . . . . . . . 77 4.3 An example for a MG with two sources and more than one token per circuit. T sys = (C) M(C) = 6 2 = 3. . . . . . . . . . . . . . . . . . . . . . . 86 4.4 An example for a MG whose source does not belong to a critical circuit. T sys = max C2C (C) M(C) = 3 1 = 3. . . . . . . . . . . . . . . . . . . . . . 94 4.5 The MG model of a simple hierarchy of the (HC) 2 LC clocking technique in Chapter 5. LX stands for the X th level. . . . . . . . . . . . . . . . . 98 4.6 An example for a MG that violates the sucient constraints but still satisfy (4.1) and (4.19). T sys = max C2C (C) M(C) = 8 2 = 4. . . . . . . . . 108 5.1 The Hierarchical Chain's Link (HCL). . . . . . . . . . . . . . . . . . . 114 5.2 A chain of C HCLs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3 A homogeneous clover with N leaves and L gates per leaf. . . . . . . . . 117 5.4 The top loop structure. . . . . . . . . . . . . . . . . . . . . . . . . . . 119 x 5.5 Abstract diagram of a 16-clovers (HC) 2 LC network constructed by min- imizing the hierarchy depth, with four links per chain (C=4). . . . . . . 128 5.6 Abstract diagram of a 16-clovers (HC) 2 LC network constructed by min- imizing the insertion delay, with four links per chain (C=4). . . . . . . . 128 5.7 An illustrative example of Algorithm 3 in action. gX is the gate with ID: X, and its logic level is shown between brackets. . . . . . . . . . . . . . 136 6.1 Two-phase clocking (2PC). . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2 Abstraction of the operating region of 2PC. . . . . . . . . . . . . . . . 144 7.1 Overview of the qHC2LC ow. . . . . . . . . . . . . . . . . . . . . . . 150 7.2 qHC2LC: Constructing the CDN. . . . . . . . . . . . . . . . . . . . . 151 7.3 qHC2LC: Dynamic co-simulation. . . . . . . . . . . . . . . . . . . . . 162 7.4 An illustration of the grid based variations model [60]. The rst coordi- nate is the hierarchy level (top starts from 0), and the second is an ID within the level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 8.1 The MC (1000 runs) results over the ISCAS'85 benchmark circuits. The y-axis is the yield improvement of (HC) 2 LC over zero-skew trees at the same cycle time, T sys . We show the results at dierent values of T sys . The x-axis is the amount of variations applied on the circuit, represented by the standard deviation of the gates delays, vars . We vary vars by varying L as previously mentioned. The values shown at each data point is the yield (%) of the (HC) 2 LC circuit at this point. The results were obtained using 1000 runs as further discussed in Section 8.2. . . . . . . 168 8.2 The values ofT ins measured in terms of ov for the ISCAS'85 benchmark circuits. prev (red colors) is (HC) 2 LC constructed without any optimiza- tions, while opt (green colors) is (HC) 2 LC optimized as in Section 5.4. The experiment was done for two (HC) 2 LC congurations using C/N/L of 4/4/4 and 6/4/8. For every conguration and benchmark, the bars show the minimum, the average, and the maximum values from top to bottom. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 B.1 Example of 4-phase AQFP inter-rows clocking [37]. . . . . . . . . . . . 222 B.2 Circuit schematic of the SFQ/AQFP interface [35]. . . . . . . . . . . . 224 B.3 Circuit schematic of the AQFP/SFQ interface [34]. . . . . . . . . . . . 224 xi List of Tables 8.1 Some detailed results from the MC analysis of (HC) 2 LC across ISCAS'85 benchmark circuits. . . . . . . . . . . . . . . . . . . . . . 170 8.2 Comparison between the unoptimized (HC) 2 LC constructed as in [98], and the optimized construction following the work of Sec- tion 5.4; across ISCAS'85 benchmark circuits. . . . . . . . . . . . . 170 xii Abstract Living on the verge of the IoT era, the entire world is excited about the potential of mining the monumental amounts of data that would become available in the near future. However, this abundance of data requires supercomputers faster and more powerful than ever that do not require a neighboring nuclear plant for power! Single Flux Quantum (SFQ) technology has the potential to meet the booming demands for lower power consumption and higher operation speeds in the elec- tronics industry and future exascale supercomputing systems. Nevertheless, the promised benets of three orders of magnitude lower power at an order of magni- tude higher performance have yet to be attained. In particular, ultra-high-speed clocking of large scale SFQ circuits in the presence of unprecedented levels of tim- ing uncertainties represents a tough obstacle for the technology to advance. In this thesis, we propose an innovative self-adaptive clocking technique which is designed to be resilient in such uncertain environments. Our proposed hierarchical chains xiii of homogeneous clover-leaves clocking, (HC) 2 LC, inherits its robustness from spa- tially correlated cell delays and from the timing robustness of the SFQ traditional counter- ow clocking. Our simulations show that averaging over the ISCAS'85 benchmark circuits, at the same speed, and with only an area overhead of 9.00%, (HC) 2 LC achieves 52.3% and 211.8% yield improvement over zero-skew trees at low and medium ranges of of gate delays, respectively. xiv Related Publications [Runner-up of the SNF contest]{ R. N. Tadros and P. A. Beerel, \A Robust and Tree-Free Hybrid Clocking Technique for RSFQ Circuits { CSR Application," International Superconductive Electronics Conference (ISEC), Sorrento, Italy, May 2017. R. N. Tadros and P. A. Beerel, \A Robust and Self-Adaptive Clocking Technique for RSFQ Cicruits { The Architecture," Circuits and Systems (ISCAS), 2018 IEEE International Symposium on, Florence, Italy, May 2018. R. N. Tadros and P. A. Beerel, \A Robust and Self-Adaptive Clocking Technique for SFQ Cicruits," in IEEE Transactions on Applied Superconductivity, vol. 28, no. 7, pp. 1{11, Oct 2018. [Under Journal Review]{ R. N. Tadros and P. A. Beerel, \A Theoretical Founda- tion for Timing Synchronous Systems Using Asynchronous Structures." [To Be Finalized]{ R. N. Tadros and P. A. Beerel, \Using Asynchronous Clock Dis- tribution Networks for Timing SFQ Circuits," a chapter to be included in a book titled \Asynchronous Circuit Applications", by IET (The Institution of Engineer- ing and Technology), the editors are Jia Di, and Scott Smith. [Under Journal Review]{ R. N. Tadros and P. A. Beerel, \Optimizing (HC) 2 LC, A Robust Clock Distribution Network for SFQ Circuits." [Under Journal Review]{ R. N. Tadros, A. Fayyazi, M. Pedram, and P. A. Beerel, \SystemVerilog Modeling of SFQ and AQFP Circuits." xv Chapter 1 Introduction Whilst the present has big data and supercomputers, the future must analyze even bigger data and therefore requires even more powerful computers. Nevertheless, the power consumption of a modernly designed exa-scale computing platform is in the range of dozens of megawatts [1]. Following the most optimistic of assump- tions, future supercomputers would require an amount of power similar to what is generated by a small power plant [2]. Hence the need for low power and high performance processors. 1.1 SFQ Technology Meeting these booming demands for lower power consumption and higher opera- tion speeds is challenging because the VLSI industry is approaching the physical limit of semiconductor scaling. Many in the VLSI community are seeking the future of More-of-Moore in beyond-CMOS devices [3]. With a theoretical poten- tial of three orders of magnitude lower in power (in the case of non-resistive bias 1 networks [2]) at an order of magnitude higher in frequency, superconducting elec- tronics {and Single-Flux-Quantum (SFQ) in particular{ has been conveyed as the denite near future back in the late 1980s [4], despite the cryocooling overhead [2]. Nevertheless, these promises have yet to be practically met for complex designs such as a microprocessor. Among many suspects, variability and scalability have been long term obstacles for the technology to advance, compete, and replace sili- con CMOS. These issues have been confounded with an absence of an established tool ow, a main trump of CMOS digital design. First, regarding scalability, CAD tools and their development and evolution throughout the decades of technology advancements have enabled CMOS technology to thrive. The superconducting electronics (SCE) community lacks such an established ow [5{7]. Second, regard- ing variability, SFQ has high level of timing uncertainties [5, 8, 9]. In such an uncertain environment, the operation at high frequencies is extremely challenging. This forced a 1 THz device to function at a disastrous 20 GHz clock frequency [8]. Although zero-skew clock trees are not robust enough [4, 10], radical asynchronous solutions are too expensive [11, 12], and hybrid approaches are too custom to be generalized [13]. 1.2 Clocking in SFQ In particular, clocking becomes more complex when the data ow of the circuit has loops. Such algorithmic loops are inherently present in any state machine, 2 processor, or architecture with feedback. The challenges associated with these loops were discussed in detail for a small circular shift register (CSR) in [14, 15]. Several larger SFQ processors used variants of the globally asynchronous locally synchronous approach (GALS) [13, 16], TIPPY [17] used the data-driven self- timed technique of [18], and the processor in [19] used asynchronous techniques [20]. All of these processors relied on massive manual and custom optimizations, and none of them used a complete synchronous technique. This suggests that clocking algorithmic loops is still {as stated in 1996 [10]{ an unresolved problem. In particular, the problem of clocking in SFQ is more challenging than CMOS because of the following. • Due to the peculiar characteristics of SFQ, every logic gate requires a clock to operate. This results in deep gate-level pipelines which is challenging because of: a) There are many more clock sinks than traditional CMOS circuits which aggravates the clock distribution network (CDN) design issues, b) The setup and hold times become a larger fraction of the cycle time which magnies the signicance of clock skew values, make the circuit more prone to timing violations, and hence further motivates the work in this thesis, and c) pipeline starving [21, 22] where large portions of the pipeline are empty of data which re ects a loss of resources. 3 • As the literature shows, high frequency design of CDN is not straightforward even in CMOS due to various electromagnetic eects. An ultra-high fre- quency operation in SFQ requires a reliable ultra-high speed CDN with con- trolled values of clock skew and jitter. Typically, this is done in CMOS using hierarchical CDNs where grids (meshes), spines, and adaptive de-skewing methods, all of which are either not applicable to SFQ or never implemented. • SFQ technology exhibits an unprecedented level of timing uncertainties due to global process variations, local mismatches, RLC parasitics, bias distri- bution network, ux trapping, thermal uctuations, quantum uctuations, local resistors heating, jitter accumulation, and state-dependency operations. Those uncertainties exacerbate the CDN design challenges. • The absence of established commercial and/or academic design ow [5, 6]. CAD tools and their development and evolution throughout the decades of technology advancements are what made CMOS thrive. The SCE commu- nity desperately lacks such an established ow. This forced designers to either blindly design their circuits without reliable evaluation or to spend unreasonable amount of time in custom and manual optimizations, some- times performing several chip re-spins to obtain working parts. 4 1.3 Thesis Contributions This thesis studies the clocking problem in SFQ circuits, discusses and provides a feasible solution aimed at mitigating the timing stumbling block. Its contributions are summarized as follows. First, we propose the Hybrid Clover-Leaves Clocking (HCLC) which is an asyn- chronous clock distribution network (ACDN) for the clocking of a circular shift register (CSR), the simplest form of an algorithmic loop that can be used to study timing [14]. For a 32-gate CSR, the proposed technique [23] achieves up to a 93% yield improvement at the same cycle time compared to zero-skew clock trees. Nev- ertheless, HCLC is ill-suited for more generic and complex pipelines than a basic CSR loop. In particular, its cycle time isO(N gates ) which is highly impractical for large scale VLSI designs. Second, this thesis's main contribution is the proposal of a robust and self- adaptive clocking technique for generic and complex pipelines, the cycle time of which is independent of the total number of gates. The proposed hierarchical chains of homogeneous clover-leaves clocking ((HC) 2 LC) inherits its robustness from: 1) the spatial correlation of various sources of variations [8, 9, 24{27], and 2) the timing robustness of traditional counter- ow clocking [4]. We theoretically prove the timing properties of (HC) 2 LC using the ACDN theory that we establish as well. Our simulations show that averaging over the ISCAS'85 benchmark circuits, at the same speed, and with only an area overhead of 9.00%, (HC) 2 LC achieves 5 52.3% and 211.8% yield improvement over zero-skew trees at low and medium ranges of of gate delays, respectively. Third, we optimize (HC) 2 LC to achieve lower power and area overheads in addition to higher variation robustness. This is done by optimizing the algorithms used to build the CDN. In particular, we present an algorithm that optimizes the insertion delay of (HC) 2 LC. Averaging over the ISCAS'85 benchmark circuits, the maximum T ins is thus reduced by 32.3%, and the T ins range is reduced by 75.1%. Also, this results in area reduction of 48.44%, and yield improvement of 77.31% over the unoptimized version. Moreover, we present two algorithms: a placement-aware gates-to-clovers assignment algorithm, and a logic-level-aware gates-to-clock-sinks assignment algorithm. This optimized mapping exploits the spatial correlation of uncertainties and the intrinsic robustness of counter- ow clocking, which results in 5.29% area reduction, and 48.97% yield improvement on average over the ISCAS'85 benchmark circuits, compared to gates assigned arbitrarily. In total, averaging over the benchmark circuits, our optimizations to (HC) 2 LC achieve 51.28% of area reduction, and 166.74% of yield improvement when compared to the unoptimized (HC) 2 LC. Fourth, we present the theoretical foundation for asynchronous clock distribu- tion networks (ACDN), where an asynchronous circuit is used to generate the peri- odic signals necessary for the timing of a fully-synchronous system. This approach 6 takes advantage of the natural adaptiveness of asynchronous circuits to delay uncer- tainties while still providing the performance advantages of synchronous clocking. In particular, we state the sucient conditions for an asynchronous system to provide strictly and well-determined periodic signals for a large number of clock sinks. We model these systems using timed Marked Graphs (MGs) from Petri Net (PN) theory [28] and prove the following. In a live and safe MG with strictly a single token per directed circuit, if the graph has a single source that belongs to a critical directed circuit, then all transitions (or clock sinks) will re periodically and with well-determined clock time occurrences. This structure can thus provide the timing of all clock sinks required in a synchronous system in the absence of a clock source. Additionally, we prove that by only probing a single transition and adjusting a single programmable delay line, we are able to force the source to belong to a critical circuit in the case of not-knowing its exact critical value, and thus forcing a well dened synchronous periodicity. This is particularly useful in systems with high levels of timing uncertainty. Fifth, we arm that the use of the previously proposed two-phase clocking (2PC) [10] shall be considered as a serious contender towards solving the clocking problem. Particularly, such clocking represents another dimension where extreme performance overhead meets relaxed complexity and area overheads. In this thesis, we formalize this approach, quantify the overheads associated with it, and theorize its performance limits. 7 Sixth, we present SystemVerilog models for SFQ and AQFP gates, and the SFQ/AQFP interface circuits. Our models are compatible with SDF (standard delay format) and friendly with the existing tools, they oer an easy debugging platform, modular and qualied for generalization and reproduction, elegant and conceptual, can encompass many circuit parameters, environmental eects, and advanced SCE phenomena. Seventh, we build a tools platform to build and evaluate the dierent SFQ clocking strategies, qHC2LC. In its current form, our qHC2LC takes an RTL cir- cuit as the input, synthesizes it, constructs the clock network (zero-skew tree or (HC) 2 LC), generates a test bench, and performs a co-sim evaluation for design performance, functionality, timing violations, and area overheads. We run the qHC2LC successfully over ISCAS'85 combinational benchmark circuits [29]. Eighth, we propose a grid-based placement-aware variation model to model the spatial-correlation timing uncertainties. Also, we incorporate a placement-aware and spatial-correlation-variations-aware Monte Carlo (MC) analysis into qHC2LC in order to compare and evaluate yield. 1.4 Thesis Organization The thesis organization is as follows. Chapter 2 gives a background about 1) SCE technology fundamentals, the main dierences from CMOS, and its various sources of uncertainties. 2) Timing fundamentals such as synchronization basics, 8 timing violations, and clock distribution networks. 3) Various clocking techniques in SFQ and a review about some implementations in literature. Then, the proposed HCLC for CSR applications is in Chapter 3. The ACDN theoretical work is in Chapter 4. Chapter 5 discusses our main contribution which is the (HC) 2 LC. After that, the two-phase clocking is discussed in Chapter 6. Chapter 7 explains our qHC2LC tool, our proposed variations model, and our proposed SV models. Then, Chapter 8 shows the experimental results. Finally, Chapter 9 summarizes the thesis's contributions, discusses the possible future work, and cites our conclusions. 9 Chapter 2 Background In this chapter, we provide the background necessary for the understanding of the concepts and terms used throughout this thesis. We start by a brief introduction to superconducting electronics (SCE) and Single Flux Quantum (SFQ) in particular. We then put the light on its high level of timing uncertainty. For the second part of this chapter, we recount the fundamentals of timing and synchronization in VLSI with a focus on clock distribution networks. 2.1 SCE Technology Superconductivity [30, 31] is the zero electrical resistance and the expulsion of mag- netic eld in certain materials when cooled below a characteristic critical temper- ature. It is not the idealization of perfect conductivity based on classical physics, but rather a quantum mechanical phenomenon. 10 2.1.1 Fundamentals JJs and SFQ The main basis of superconducting electronics is the Josephson Junction (JJ) depicted in Figure 2.1. As illustrated in Figure 2.1a, a JJ is formed by two super- conductors separated by a small discontinuity. This discontinuity can be in the form of either a construction, point contact, or a thin insulating tunnel barrier [31]. The junction electrical symbol is shown in Figure 2.1b. The superconductor material that is commonly used is Niobium (Nb) which is the element with atomic number 41. The works in [30{32] explain the quantum phenomenon of superconductivity and the tunneling properties of JJs. The IV characteristics of a single junction is abstractly reproduced in Figure 2.1d. First, the o-state (or `0' state) is when the current owing through the junction is less than a critical value, I 0 , which is the Josephson super current or critical current. During this state, the current can ow with no voltage drop, i.e., the JJ acts as a short circuit. That state is commonly referred to as the superconducting state. However, there was a debate about that naming after the discovery of the tunneling of the JJs since both states occur literally in the superconducting phase [32]. Based on the Ginzburg-Landau interpretation, the o-state is where a Cooper-pair forms a pair-tunneling state. On the other hand, in the on-state (or `1' state), it is rather a single-particle tunneling 11 state. During this state, the junction possesses a non-linear resistance value as shown in the gure. The value of V gap is in the range of few mV and it depends on the material used (which is Nb throughout this thesis). From Figure 2.1d, it is obvious than the JJ's IV characteristics exhibit a hysteresis. This is commonly referred to as the junction being underdamped [4]. For SFQ, and in order to get rid of this hysteresis, a small resistance is connected in parallel as in Figure 2.1c resulting in the junction being overdamped. The resulting IV characteristics of the combined structure is reproduced in Figure 2.1e. It is worth mentioning that the hysteresis is resolved because during the on-state, the small resistance steers the current from the JJ permitting it to return to its o-state without the need of a complete removal of current. In the rst paper of the IEEE transactions on applied superconductivity, Likharev and Semenov [4] summarized the ground properties of Single Flux Quan- tum (SFQ) technology, where overdamped JJs are connected in a way such that binary information is presented in short quantized pulses {called uxons (see Fig- ure 2.1f), or Single Flux Quantum (SFQ) pulses{ instead of voltage dc levels. An overdamped JJ could leap producing a uxon in a whooping 0.5 ps back in the 1980s! 12 (a) Schematic of a JJ. (b) JJ symbol. (c) Overdamped JJ. (d) IV characteristics of an underdamped JJ. (e) IV characteristics of an overdamped JJ. (f) An SFQ pulse and a simplied graphical representation of it [10]. Figure 2.1: Josephson Junction (JJ) Signaling In SFQ, the transmission of SFQ pulses across the junctions can occur using either passive transmission lines (PTL) or its active counterpart; the Josephson trans- mission lines (JTL). Figure 2.2 shows a PTL where two overdamped junctions are connected by a matched microstrip line. The transmission of pulses approaches the speed of light. On the other hand, Figure 2.3 shows a JTL where the junctions are joined by low inductance strips. Note that the bias current has to be set to be 13 (a) Schematic. (b) Simulation. Figure 2.2: Passive transmission line (PTL) [4] where two overdamped junctions are con- nected by a matched microstrip line. L is the line length, is the microstrip impedance, is the time of ight of the pulse, c is the speed of transmission which approaches the speed of light, and I b is the dc bias current. (a) Schematic. (b) Simulation. Figure 2.3: Josephson transmission line (JTL) [4] where junctions are connected in parallel by superconducting strips of a relatively low inductance and dc-current biased to their pre-critical state (I b <I 0 ). slightly lower than the critical current value of the junction. This is the basis of the uxons transmission in all SFQ gates. Those unique basics alter our conventional understanding of the representation of bits. Likharev and Semenov [4] thus formalize the SFQ basic convention as follows: 14 (a) The general scheme. (b) Signal consequence. Figure 2.4: Illustration of the SFQ basic convention and representation of bits using an elementary cell of the SFQ circuits [4]. Arrival of the SFQ pulse to a terminal S i during the current clock period has a meaning of the binary `1' value of the signal S i , while absence of the pulse during this period is understood as the binary `0' value of this signal. This concept is illustrated in Figure 2.4. From a certain point of view, this means that the propagation delay of any gate in the case of a logic `0' is always zero, and the rise-fall delay mismatch is always of innite value. This is a good opportunity to point out that blindly treating SFQ integrated circuits the same way designers treat conventionally CMOS circuits shall never be the optimum solution. Basic SFQ Cells A peculiar characteristic of logic cells in SFQ is that they are all synchronous in the sense that they need an evaluative clock signal. They all share the structure shown in Figure 2.5 which is an inductive storage loop including a comparator 15 Figure 2.5: The schematic of a simple interferometer loop with a clock input which behaves as a D- ip op circuit [10]. [10]. This structure is known as the interferometer loop [4]. A persistent current [4] ows indenitely in the loop in a certain direction based on the initial bias values (counter-clockwise in the gure). This specic direction makes J 1 critical, i.e. state `0'. If a SFQ occurs at the input, J 1 leaps to the on-state and thus reversing the direction of the persistent current making J 2 critical instead, i.e. state `1'. Either current direction would circulate forever without loss, until the state of the loop is evaluated by an SFQ pulse on theCLK input. This evaluation restores the initial state `0' of the interferometer. Every logic circuit is constituted of one or more interferometers. Note that SFQ circuits need a current biasing network. Traditionally, this bias network was resistively based, which is the case for Rapid SFQ or Resistive SFQ, known as RSFQ. On the other hand, there are some other cells which perform other non- logic functions. Those cells are asynchronous by nature [4]. Figure 2.6 shows the symbols of the basic asynchronous SFQ cells used throughout this thesis. First, the splitter; one of the dierences between SFQ and CMOS is that the fan-out 16 of SFQ cells is strictly one, which falls directly from the fact that the signals are interpreted as a single SFQ pulse. Figure 2.6a introduces the splitter cell which solves the fan-out issue and uses current amplication in order to generate two separate pulses copying the pulse on its input port as soon as it occurs. Second, Figure 2.6b shows the con uence buer which generates an output pulse whenever a pulse is detected at either of its input ports. If the two input pulses occur within a certain setup time, only one pulse is generated. That is what distinguishes it from a combinational OR gate. Third, the coincidence junction (or C-junction) is depicted in Figure 2.6c, it is similar in functionality to the Muller C-element [10]. An output pulse is spawned if and only if both input ports sense an incoming pulse. Its functionality follows a nite state machine (FSM) as described in [4]: if two pulses arrive sequentially on one input while the other input received none, the second pulse does not change the state and has basically no eect. Its primed counterpart is illustrated in Figure 2.6d where using a tweak in the bias current distribution circuit results in the junction being initialized in a dierent state of the FSM. This means that the rst pulse on in 2 would result in an output pulse, and it shall behave onward as an unprimed C-junction. AQFP Another SCE logic family is the adiabatic quantum- ux-parametron (AQFP), which possesses a theoretical potential of six orders of magnitude reduction in 17 (a) Splitter (b) Con uence buer (c) Coincidence junction (d) Primed coincidence junction Figure 2.6: Some SFQ basic cells symbols. power [33] when compared to the state-of-the-art semiconductor circuits, but at a limited performance. And thus stems our interest in such logic family; SFQ oers low power at high performance while AQFP oers ultra-low power at a lower per- formance. Thus, hybrid SFQ-AQFP systems [34, 35] seems to be a good option when targeting both low-power and high-performance which is demanded by every application nowadays. For parts of the chip that are not critical to performance, AQFP could be used to save the SoC (System-On-Chip) overall power budget. Moreover, SFQ PTLs could be used as a long interconnect option between AQFP blocks [34, 35] since long interconnects are indeed a challenge for AQFP [33]. AQFP logic in an energy-ecient superconductor logic with zero static power and very small dynamic power due to adiabatic switching operations [36]. They use the adiabatic parametron to reduce the bit energy of operation to the order of the thermal energy [33]. The circuits operation is based on the polarity of AC currents resulting from the coupling of inductors connected to a SQUID (superconducting quantum interference device) comparator. 18 Figure 2.7: The schematic of a buer/inverter AQFP gate. When I x is activated, an SFQ is stored in either the left loop (`0') or the right loop (`1') [33]. An example of an AQFP gate is shown in Figure 2.7. The parametron is a two-junction SQUID (J 1 , J 2 , L 1 , and L 2 in the gure). When the input, I in , is applied, an SFQ is stored in either loop based on its direction and the polarity of the mutual inductive coupling within the loops. Then, when the excitation current, I x , is applied to the SQUID, the resultant I out direction is determined by whether an SFQ is stored in either of the loops. This excitation current is perceived as a clocking signal from an abstract point of view. AQFP circuits thus possess two peculiar characteristics: First, each gate requires an AC clocking signal, or an excitation signal. The polarity of this signal determines the exact time moment at which the gate evaluates. Depending on the number of phases of the dierent input AC excitation currents, and on the polarity 19 Figure 2.8: An illustration of the 4-phase clocking for AQFP circuits [37]. of the additional DC shifting currents, the circuit would posses a certain number of phases. The state-of-the-art AQFP designs [37] use four phases as shown in Figure 2.8. They use two AC signals with 90 degrees shift, and they use a DC shift to produce the other two. Second, the representation of bits uses some sort of three-level convention [33], but with current values. Figure 2.9 shows an example of the signals for a sequence of three inverters. A logic `0' is represented by the current value going to the lowest level, then after a certain period of time, it returns to the middle level. Similarly, a logic `1' is represented by the current value going to the highest level, then after a certain period of time, it returns to the middle level. When a gate evaluates, it samples the inputs current values. Those values have to be either at 20 Figure 2.9: An example of the three-level signaling of AQFP for a sequence of three inverters; excitation 1 is for the rst inverter and so on, respectively [33]. the highest level for `1' or lowest level for `0'. Then, after the gate propagation delay, the output current should produce the correct logic correspondingly. 2.1.2 SFQ vs CMOS There are plenty of characteristics that distinguish SFQ from conventional CMOS integrated circuits; some of which are more crucial than others. We hereby describe some of them. 1. Representation of bits: As mentioned in this section, SFQ circuits follow the SFQ basic convention, while CMOS circuits is potential level based. 2. Gate-level deep pipelines: Intrinsically, every logic cell requires a clock or a synchronization signal to operate. This modies the structure designers traditionally use to abstract a sequential netlist where they use two ip ops 21 (FF) separated by a cloud of combinational logic. In SFQ, there is no com- binational cloud and there are only clocked gates. This results in gates- deep pipelines which is challenging because of a) pipeline starving [21, 22] where the bigger part of the pipeline is empty of data which re ects a loss of resources, and b) There are many more clock sinks than traditional CMOS circuits which aggravates the CDN design issues and hence further motivates this thesis. 3. Interconnects: Simple connecting wires do not exist in the conventional sense. As above mentioned, interconnects in SFQ ICs are either a passive transmission line (PTL) which has to be matched, or an active transmission line (JTL) [4]. In particular, passive components are better for long distances, while active components are used for short ones [10]. Which is on the contrary side of CMOS. 4. Power Consumption: The power consumption in RSFQ circuits is inde- pendent of the operation in contrast to CMOS [38]. To elaborate, in CMOS, the static power is conventionally negligible compared to dynamic power consumption. Meanwhile, in RSFQ, the dynamic power is around 13nW/- gate compared to 800nW/gate of static power according to [2]. This 60X ratio is due to the resistive dc current bias network whereas the logic cir- cuits are superconducting, i.e. they have zero conductance by denition. 22 This makes the power directly proportional to the area in traditional RSFQ. Recently, non-resistive dc biasing techniques were introduced. ERSFQ [2] use an inductive bias network which is data dependent and requires a re-occurring re-balancing [39, 40], and eSFQ [2] use a passive network that injects the bias current via clocked interferometers into an inductive network. Both ERSFQ and eSFQ have zero static power consumption. Needless to say, the future of SFQ {if any{ relies massively on the development of those dc biasing tech- niques. Only by achieving an amazing 200X less power than CMOS {taking the cryo-cooling into consideration{ at higher speeds, SCE could potentially lure the VLSI industry to supersede their Silicon deity for exascale computing [2]. 5. Inverting: For almost a century, the logic circuits world absorbed the idea that all circuits are of inverting property due to the complementary `C' in CMOS. Nonetheless, because of the intrinsic properties of SFQ circuits, an inverting gate design is not straight-forward. Since the logic `0' is repre- sented by the SFQ convention as nothing happening during a time period, an inverter gate is supposed to react by producing a pulse in case of nothing happened, which is counter-intuitive. It is worth mentioning that all com- mercial and non-commercial tools alike treat and optimize circuits knowing that the base gates are inverting-based. Not optimizing them accordingly shall result in a massive waste of resources. 23 6. Destructive Read-out (DRO): In SFQ logic gates, the evaluation of the stored data in the interferometer using a clock pulse is a destructive process. The gates are thereby called DRO cells. CMOS registers are fundamentally on the contrasting side. 7. Rise-Fall Mismatch: As previously mentioned, the fall delay of all gates in SFQ is zero by denition. This particularly high rise-fall mismatch might be exploited by circuit optimizers in order to solve some of the SFQ variability issues. 8. Fan-out: The fan-out of SFQ cells is strictly one. This necessitates the use of a tree of splitters (see Figure 2.6a) to obtain a higher fan-out. 9. Cooling: For the sake of completeness, we emphasize that SFQ circuits need to be cryo-cooled. This is commonly done either using the immersion of the chips into Helium compressed tanks known as dewars, or the use of integrated cryo-coolers. It is worth mentioning that the critical temperature of the Niobium is 9.3 K while the liquid Helium's temperature is 4 K. 2.1.3 Timing Uncertainty in SFQ As mentioned in Chapter 1, SFQ technology exhibits a high level of timing uncer- tainty. We hereby recount the known sources of uncertainties. It is worth empha- sizing that these do not represent a limitation to the technology or fabrication 24 process{which has quite advanced over the past two decades [7, 41, 42]{, but rather factors aecting the actual propagation delay of any SFQ gate, and it is crucial to be taken into consideration during the design process. • Global process variations: Manufacturing induced variations have a sig- nicant impact on the physical and timing parameters in SFQ [8, 9]. • Local mismatches: Due to photolithography diraction limits, manufac- turing tolerances, and layout inaccuracies and asymmetries, local uncertain- ties should be considered [8, 9, 27]. • RLC parasitics: Though designers tend to consider the lumped elements impedance values as precise, many parasitics could aect the practical values of those impedances in addition to their mutual interactions [24]. • Bias distribution network: Mismatches in the bias current alter the gates operation and modify the timing parameters [25]. It is true that the use of several o-chip sources of bias currents enables calibration that can reduce these variations eects. However, as we try to scale these circuits, given that the number of o-chip bias sources is limited, the bias distribution will likely remain a signicant source of uncertainty. In particular, several researchers propose dividing large chips into several islands that are biased serially, e.g., using current recycling [43, 44], to reduce the magnitude of the needed bias 25 current. However, this further complicates the distribution system and there- fore may induce other additional unknowns into the timing parameters of various islands. Moreover, using complex passive bias distribution networks in ERSFQ or eSFQ [2, 45] exacerbates those variations, as the operation and switching of gates dynamically modies the bias distribution [39]. Addition- ally, using alternating current bias as in AQFP [33, 36] introduces dierent types of uncertainty due to the phase shift and the intra-gate clock delay. • Flux trapping: During cooling, devices experience a transitional partially superconducting phase. When a `normal' region is surrounded by supercon- ducting regions, some quanta of ux can reside and become trapped, altering the critical current of the associated JJs [46, 47]. • Fluctuations: a)Thermal: Cooling is imperfect leading to thermal uctu- ations that can lead to storage and decision errors [8, 48]. b) Quantum: Variations in transport properties of Niobium can lead to variations in the critical current of associated JJs [8]. • Local resistors heating: Both RSFQ and ERSFQ/eSFQ use overdamped junctions where a resistor is shunted to every JJ (see Section 2.1), while regu- lar RSFQ use resistors for the bias distribution networks as well. In addition to other resistors used in various gates or components. Those resistors cause thermal noise which aects the timing parameters of the nearby gates [26]. 26 • Jitter accumulation: When gates are connected in a sequence, like a sequence of JTLs or splitters, there are some additional temporal variations that aect the delays of the gates the further we go down-stream the sequence [49]. • State dependency: The stored state of the load cells aect the delay of the driving cells [50]. • Splitters asymmetry: A splitter's arms do not always have the same delay, however, surrounding factors such as biasing or loading breaks such an ideal assumption [50]. 2.2 Timing Fundamentals Based on the previous discussion, we know that yield failures can emerge from either fabrication malfunctions, ux trapping during cooling, dc biasing, or timing violations. This thesis is solely concerned about the latter. Towards that end, the remainder of this chapter provides a background on timing and clocking required for the understanding of this thesis and its contributions and challenges. In this section, we revise the timing aspects that are similar to conventional CMOS while Section 2.3 is inclined towards the history of clocking in SFQ in particular. 27 2.2.1 Synchronization A pair of registers are sequentially-adjacent if only combinational logic (no sequen- tial elements) exists between the two registers [51]. Figure 2.10 shows two sequentially-adjacent cells in a synchronous system [10]. Note that we use to denote the maximum delay value, to denote the minimum (contamination [38]) delay value. The dierence between a CMOS system and an SFQ system {depicted in Figure 2.10b and Figure 2.10a respectively{ is that in CMOS, there is a combinational logic block between the register cells. While in SFQ, the logic cells are themselves the registers, and there is no more logic in between the cells; Only the interconnects delay takes place between them. This is because all regular SFQ logic gates are clocked. The main timing conditions that characterize conven- tional synchronous clocking [51] are: 1) Steady-state periodicity; where an SFQ pulse occurs on the clock input every time period T sys and 2) Well-determined skew value; where the dierence of the arrival time of the clock between any two clock sinks is well-dened. Those conditions could be formalized for any clock sink, Gate n , as follows [52] (i) n =T up +T skewn +iT sys ; 8ii up (2.1) wherei represents the i th occurrence of the clock signal. Here, T up andi up are the warming-up period and occurrence index respectively, which are constants that 28 (a) In SFQ system. (b) In CMOS system. Figure 2.10: Two sequentially adjacent gates in a synchronous system. x represents the clock input of the xth gate. Data and clock paths parameters are depicted as red and blue respectively. INT stands for the interconnect between the gates, REG and COMB stand for register and combinational logic respectively. represent the transient phase of the clocking system before entering the periodic steady-state [52{54]; Those concepts are revisited and described in more details in Section 4.3. T skewn is the known skew atGate n with respect to a common reference. In case of a system with feedback {which is the case of most sequential systems{ the clock skew is said to be conserved [51]. In other words, the clock skew in a feedback path between any two gates,G u andG v , is related to the clock skew of the forward path by the following relationship T skewuv =T skewvu (2.2) 29 2.2.2 Timing Violations The previous concepts are crucial for the system to avoid timing violations and are usually checked through some form of static timing analysis (STA) [38]. In partic- ular, in order to respect setup time on an arbitrary data path from a Gate u to a Gate v as depicted in Figure 2.10a, the system should satisfy the setup constraint as follows [10] T sys DATA +T setup +T skew ; (2.3) where DATA = CQ + INT is the maximum delay from the clock pulse at u to the data pulse arrival at in v . T setup is the setup time of Gate v . T skew is the clock skew between the two gates. Secondly, in order to respect hold time on that data path, the system should satisfy the hold constraint as follows [10] T skew + DATA T hold : (2.4) Here, DATA = CQ + INT is the minimum delay that a clock pulse at u can take before its associated data pulse reaches in v . T hold is the hold time of Gate v . The value of T skew is known as the systematic clock skew, which is the dierence in the arrival time of the clocking signal at various clock sinks [38]. However, when we take the previously discussed uncertainties into consideration, this value cannot be considered as xed. Instead, it should be dealt with as a random variable. Additionally, other high-frequency environmental variations such 30 as noise, or any irregularities in the clock source result in clock jitter [38]. Similar to clock skew, clock jitter stiens the timing constraints and forces the system design to accommodate for larger margins, and hence limits performance [10, 38, 51]. 2.2.3 Clock Distribution Networks (CDN) As dened in [51], the clock distribution network (CDN) is the network designed to generate the clock signal waveform and deliver it to each register for synchro- nization. Conventionally, the common CDN structure is based on equipotential clocking, where the entire network is considered a surface which must be brought to a specic voltage at each half of the clock cycle. Even in CMOS, the design of CDN is not a trivial task; Especially for high frequency. This is due to the following: • Reverse scaling [55] where interconnects do not scale with the same rate as devices. When the channel length decreases, the gate delay decreases, but the interconnect delay increases. This keeps the insertion delays {which is the delay from the clock source to the clock sinks{ at the same value while the clock cycle decreases. The work of [56] proposed the use of courser than ordinary lines for the global clock distribution to combat reverse scaling. 31 • At high frequencies, the clock waveform strays away from the ideal step response due to parasitics, and Elmore delay [57] {which is commonly used to model the interconnects delay for design{ becomes less accurate [51]. • At high frequencies, delay uncertainty increases due to process and envi- ronmental variations. As well as voltage gradients, capacitive and inductive coupling, and load impedance variations. This introduces on-chip inductance and lossy transmission line behavior starts to manifest in the CDN [51, 58]. • The insertion delay has to be minimized because the probability density functions (pdf) of the skew values depends on the absolute value of the delay. A larger insertion delay would result in more skew uncertainties [59]. We hereby cite the most common CDN techniques used in CMOS VLSI. 1. Grids: A clock grid is a mesh of horizontal and vertical wires driven from the middle or edges. The mesh is ne enough to deliver the clock to points nearby every clocked element [38]. A grid based network is shown in Figure 2.11. It results in transmission line eects if a return path is not taken care of carefully, as in a reference plate. Meanwhile, it provides better skew against variations and unbalanced load. Some of the disadvantages are a signicant insertion delay, the waste of large amount of metal resources, and higher power consumption [38, 59]. 32 Figure 2.11: An abstract illustration of a clocking grid [59]. Figure 2.12: An abstract illustration of an H-tree [38]. 2. H-trees: It is a fractal structure built by drawing an H shape, then recur- sively drawing H shapes on each of the vertices [38] as shown in Figure 2.12. Nominally, it results in zero skew and it is more ecient than grids due to the use of less wires. Nevertheless, variations and non-uniform loads complicate the use of H-trees and they are more prone to random skew and jitter. 33 Figure 2.13: An abstract illustration of a clocking serpentine [59]. 3. Serpentines: They use length-matched point-to-point wires as shown in Figure 2.13. Nominally, it results in zero skew and it is simple to design. However, it results in huge capacitance and wires congestion. Also, it is prone to intra-die variations and inductive self coupling [59]. 4. Spines: As with the grid, the clock buers are located in a few rows across the chip. However, instead of driving a single clock grid across the entire die, the spines drive length-matched serpentine wires to each small group of clocked elements [38]. An illustration is shown in Figure 2.14. Same as trees, spines have the disadvantage of possible large local skews between nearby elements driven by dierent serpentines. 5. Ad-Hoc: The clock is routed haphazardly with some attempt to equalize wire and buers delay to achieve zero systematic skew [38]. This is what most of the available design tools do, however, it does not suit high frequencies. Moreover, one must choose a very conservative skew budget if ad-hoc method is used. 34 Figure 2.14: An abstract illustration of clocking spines with serpentine routing [38]. We herewith mention some additional notes on CMOS CDNs. • Commonly, designers insert hold buers on fast-path delays to ght against local skew. However, global skew may lead that any path traversing the boundary will be subject to higher skew [59]. • Power dissipation in the local clock buers is larger than the power dissipated in the global distribution [59]. • As previously mentioned, the total delay of the distribution has not scaled with the chip cycle time. As a rule of thumb: the faster the distribution, the more robust it will generally be [59]. • Most clock skew sources do not follow Gaussian distributions. Taking the root sum square of the sources is inappropriate. Monte Carlo is a better approach for skews [38]. 35 • Intra-die process variations exhibit spatial correlation, meaning that devices close to each other are more likely to have similar characteristics than those spaced far apart [60]. • At GHz frequencies, RLC models have to be considered and thus the use of EM solvers to analyze the CDN in addition to shielding and return paths analysis [58]. • One of the commonly used techniques in CDN design is the adaptive digital de-skewing [61] where programmable delay lines are connected in a feedback loop so that the systematic skew is modied online and automatically based on the actual delay values. • Another important technique that is broadly adopted to improve the perfor- mance is to steal time from short paths and give it as an additional slack to critical paths. This is done by using a dierent value than zero-skew in some local parts of the circuits (see equations (2.3) and (2.4)). Many dier- ent terms are used to describe this technique such as double-clocking, cycle stealing, and useful clock skew [51]. • In modern systems where they use dynamic voltage and frequency scaling (DVFS), CDNs design are more complicated because all interconnect models are frequency dependent [58]. 36 In modern high performance IC designs, most CDNs include a hybrid hier- archy of various techniques and architectures [58]. Figure 2.15 shows a typical architecture of a CDN with local, regional, and global clock distributions. First, each sink of the regional clock distribution is a local clock buer that feeds a local clock distribution. Local topology is directly integrated with routing where the local clocks are typically routed before the signal nets. An ad-hoc local balanced buer-trees is commonly used to reduce local skew. Second, regional clocks are routed on intermediate or global metal lines. Spines or grids are used, or sometimes trees if there is a limited power budget and performance can be traded-o. Very large and custom designed buers with decoupling caps are used instead of stan- dard cells, with a high level of redundancy to eliminate systematic skew. Third, symmetrical trees are used for the global distribution to reduce die-to-die varia- tions. Most of the wires should be shielded and this is the level where de-skewing is incorporated, whether static or on-line active de-skewing. On the other hand, as previously discussed, SFQ chips have some peculiar characteristics that unfortunately impedes this three-level strategy. First, it is not clear how to perform de-skewing in SFQ. In particular, the mechanisms needed to monitor the clock skew and adjust programmable delay lines accordingly has not been well studied. Second, due to the quantum based transmission, grids seem physically incompatible for SFQ pulses. Third, all logic gates are clocked [4] forcing the pipeline to be gate-level deep (see Section 2.1.2), which results in an 37 Figure 2.15: A typical CDN architecture for high performance IC design [58]. extremely higher number of clock sinks than is typical for CMOS chips. Those factors show that the ultra-high frequency clock distribution in SFQ is a problem of another level of complexity than CMOS's. The following section discusses the previous eorts made towards solving this problem. 2.3 Clocking in SFQ This section brie y surveys the previously proposed solutions for clocking and CDNs in SFQ. It starts with explaining the natural ow clocking, then cites a few dierent attempts to replace the conventional zero-skew trees, and last part goes through some previous SFQ processors and other circuits designs focusing on how the clocking was done in each of them. 38 2.3.1 Natural Self-Timing There exist two naturally self-timed clocking techniques that characterize SFQ signaling. We will refer to them in this thesis as ow clocking. In order to con- ceptually perceive those techniques, let us start by re-writing the timing violations equations (2.3) and (2.4) from Section 2.2.2 as follows T sys DATA +T setup +T skew ; (2.5) T skew + DATA T hold : (2.6) The authors of [10] have brilliantly abstracted those equations to a graph repro- duced in Figure 2.16. The horizontal axis is the skew value T skewuv from gate G u to a gate G v which are sequentially adjacent as the ones in Figure 2.10a. The vertical axis is the clock cycle time T sys . Between any two sequentially adjacent gates, there will be a single operating point at the intersection of a pair of the actual T skew and T sys values. T min is the theoretical minimum limit of the cycle time for any synchronous system as follows T min =T hold +T setup (2.7) 39 Figure 2.16: A reproduction of the abstraction of the operating region [10] of a circuit composed of two sequentially adjacent cells as a function of the clock period and the clock skew. The above equation is a pipelining fundamental and is intuitively the minimum value that could satisfy both the setup and the hold constraints. This is the darker- red region underneathT min ; no operating point can be located there. skew 0 is the minimum skew value that can be used before violating the hold constraint in (2.6) skew 0 = DATA +T hold (2.8) Any operating point to the left of this vertical barrier shall result in a hold violation as shown. Note that if the T skew <skew 0 , no value of T sys could make the system function properly which re ects how hold violations directly impact yield and not performance. On the other hand, any operating point on the right of the tilted 40 line results in a setup time violation. At that point, increasing T sys shall solve the issue trading performance for functionality. T 0 is the minimum cycle time that can be used assuming zero-skew in (2.5). T 0 = DATA +T setup (2.9) Obviously, zero-skew clocking is any operating point lying on the vertical axis. Note that the setup time limit is a unity slope line drawn from (2.5). It can also be seen that zero clock skew is in no respect advantageous or special compared to other values of clock skew [10]. The natural ow clocking is a straight line clocking where the magnitude of the clock skews equals the propagation delay through the clock path between two adjacent cells. This value equals the delay of a single splitter plus the delay of an interconnecting JTL [10]. First, counter ow clocking is shown in Figure 2.17a where the clock is linearly connected in the opposite direction to the data ow resulting in an always positive clock skew. This means that characteristically, the counter ow clocking is a robust design strategy since it gives more slack to the hold constraint. However, it strains more the setup constraint resulting in a lower performance. On the other hand, in concurrent- ow clocking depicted in Figure 2.17b, the clock is linearly connected in the same direction as the data ow resulting in an always negative clock skew. This means that characteristically, 41 (a) Counter ow clocking. (b) Concurrent- ow clocking. Figure 2.17: Natural ow clocking [10]. using the concurrent ow clocking could achieve the maximum fastest performance which will be only limited by the intrinsic speed of the gates rather than the clock network. Nevertheless, the hold constraint gets more restricted resulting in a less robust architecture. Despite how natural and simple those clocking strategies are, they are straight line clocking. Therefore, in case of a feedback loop where the summation of skew over any loop has to be zero following the conservation of skew discussed in Section 2.2.1, those strategies cannot be solely employed. Hybrid structures are discussed in Chapter 3. 2.3.2 SFQ clocking techniques As previously mentioned, designing a CDN for CMOS clocking at high frequency is very challenging. While for SFQ which exhibits a large level of timing uncertainty 42 (see Section 2.1.3), the CDN has an extremely higher number of clock sinks due to gate-level pipelining, and SFQ lacks de-skewing buers or grids. This means that the problem is tougher. Moreover, the exclusive use of zero-skew clock trees on a very large scale seems impractical. Section 2.3.3 reviews the previously designed processors in SFQ, and shows that none of those processors used a fully syn- chronous CDN. The reasons of the inability to use a zero-skew clock tree on a very large scale SFQ chip can be summarized as follows. • The uncertainty of physical design due to fabrication in addition to parame- ters uctuations [25, 62] is inconsistent with a zero-skew scheme that requires an exhaustive knowledge of every parameter in every branch. • A large scale inductive clock network carrying a very high speed wave is susceptible to various electromagnetic eects. Especially that no grid or adaptive de-skewing buers [58] can be naturally implemented in SFQ. • On-chip temperature uctuations intensify clock jitter and skew [48, 62]. • Flux trapping [46] aects the current bias distribution network [63], and hence alters any tree designed symmetry. We hereby review some of the clocking and circuits techniques that were proposed to provide more robust solutions than zero-skew trees, more practical solutions than natural self-timed ow clocking, faster solutions than asynchronous handshaking [22] which was established to be a slow option in SFQ [4]. 43 It is worth mentioning that in SFQ community, they tend to use the concept of margins as a measure of robustness. A circuit with high margin values for the various circuit parameters such as the inductance of the interconnects, the junctions area, the dc bias current intensities, etc means that those parameters can vary by those margin values and still functions properly. Lower margins indicates a very low yield because of the variability. The authors in [18] proposed the data-driven self-timed (DDST) scheme whose basis is shown in Figure 2.18. They use single ended logic followed by a dual rail D-FF. The local timing signal is generated from the complementary input data. The complementary output data are regenerated by the D-FF that carries the timing information to the next block. Dual rail provides more robustness but it is expensive, DDST went around that by using dual rail only in the D-FF parts. Nevertheless, there is a strong timing assumption that has to be made to satisfy the setup constraint of the D-FF. Which means the need for a delay line that should be matched in delay value to the amount of logic used. This resulted in narrow margins. Additionally, they faced issues in interfacing this technique to the other conventions. In [64], they tried to embed some sort of vector handshaking to their design as an attempt to make it faster. Unfortunately, they ended up with a design with even lower margins. The pulse-driven dual-rail logic (PDDRL) was proposed in [65, 66] and its general structure is depicted in Figure 2.19. It uses dual rail logic gates with input 44 Figure 2.18: The concept of the data-driven self-timed (DDST) scheme [18]. completion sensing circuitry similar to the one used in DDST. The internal timing signal T is generated after all inputs have arrived and responsible of resetting the logic gate to be ready for the following set of inputs. The main advantage is that this should achieve average case delay without any timing assumptions except the isochronous fork [22] that implies that the input signals arrive before T . Despite the very good margins achieved by such designs, they are slower than simple SFQ timed logic gates and the area overhead is 4X at least (recall that power is linearly proportional to area in SFQ). Moreover, asynchronous structures have testing limitations. The authors of [11, 67] have proposed a delay insensitive (DI) [22] approach depending only on dual-rail circuitry. They designed a set of primitive DI cells and a special join cell that transforms any two dual-rail inputs to the four truth table 45 Figure 2.19: The general structure of the pulse-driven dual-rail logic (PDDRL) scheme [65]. outputs. Then, based on how to wire the truth table outputs, one can imple- ment any desired logic. Then, they used an output completion sensing circuitry to implement a handshaking-based pipeline that shall perform faster than con- ventional handshaking. This pipeline structure is shown in Figure 2.20. The advantages of this technique is the absence of timing constraints since it is DI, and the simplicity and the modularity of the design. The resulting robustness was signicantly higher than previous eorts, however, the complexity of dual-rail and the expensive implementation in terms of area and power was not appealing for the SFQ community; Especially that it was slower than synchronous SFQ because of the completion detection parts. 46 Figure 2.20: The pipeline structure of the delay insensitive technique of [11]. The Boolean RSFQ or BSFQ work of [68, 69] seemed promising at the rst glance because they went out of the box trying to use the DC/SFQ [4] circuits to implement a technique that is between the level-based (as in conventional CMOS) and the SFQ basic convention. They converted every level signal to two SFQ signals: a set and reset. At the positive edge, a pulse is generated on the set signal while a pulse is generated on the reset signal at negative edges. Nevertheless, they failed to integrate the circuits together as any bigger design resulted in extremely low margins. Moreover, their architecture was synchronous in nature, so it did not solve the clocking problem in the rst place. The authors of [70{72] have proposed the RSFQ asynchronous timing or RSFQ-AT clocking technique. The concept of RSFQ-AT is illustrated in Fig- ure 2.21. Every data signal has a clock signal associated with it in a source synchronous manner. The delay value D is for local timing constraints and has to match the delay of the logic gates (similar in concept to the asynchronous 47 Figure 2.21: The functional diagram of RSFQ-AT [20]. bundled-data channels [22]). Their claimed advantages is the high modularity, a good interface with any other convention, the possibility to embed its design within semiconductor CAD tools, and that it will result in average case performance. The disadvantages are that local timing constraint results in bad margins (similar to DDST) and the limitation in testing same as any asynchronous structure. The work in [73, 74] proposed the reduction of conventional synchroniza- tion overhead by the use of hybrid wave-pipelining approach where they increased the clock frequency by allowing multiple data waves to exist in any stage of the pipeline. The two major design challenges that face wave-pipelining are 1) pre- venting collision of unrelated data waves, and 2) balancing (equalizing) delay paths in order to reduce dierences between the longest and shortest delays through the combinational logic. The authors claimed that those could be solved by holding signals so that the next stage does not start operating until all the signals from the previous stage are available. In hybrid datapaths, the stage with the largest delay dierence determines the clock cycle time of the datapath. They introduced 48 a special clockless cell, C-FF, which is known as a resettable Muller-C element. We do not believe that SFQ is ready for wave-pipelining since normal pipelines already have robustness issues in clocking. 2.3.3 Previous SFQ Chips Clocking In this section, we review some of the processors in SFQ literature and how the clocking was performed in them. Moreover, we go through some additional design cases and review their clocking as well. SFQ CPUs First, FLUX-1R [13, 75, 76] is an 8-bit microprocessor. It uses a globally asyn- chronous locally synchronous (GALS) architecture where they divided the 5000 gates to 5 locally synchronous islands. Then they used post-Niobium calibration to de-skew some buers to synchronize the ve dierent global clocks. They used a complete manual layout where they embedded PTL transceivers to every gate, and they used a sky ground plane to reduce the cross talk and the interaction between dierent return paths. The 8-bit was achieved using bit-stream process- ing with 8 single-bit ALUs as depicted in Figure 2.22. The clock frequency was 24 GHz achieving 1.26 GOPS (Giga operations per second) at 1.75 m technology. To obtain better results, they had to re-fabricate another design; the rst being 49 Figure 2.22: Block diagram of the FLUX-1 microprocessor [76]. FLUX-1 and the enhanced version is FLUX-1R. The block diagram of the FLUX-1 architecture is depicted in Figure 2.22. Second, the processor in [19] used a hybrid clocking structure that com- bined the conventional synchronous RSFQ with the RSFQ-AT (see Section 2.3.2) technique, but the overall system operated asynchronously. The design was a 4-bit RISC (Reduced Instruction Set Computer), it was not fabricated, and it was only constituted of 800 logic gates. Simulation results show 1.6 GOPS in 1m technol- ogy. The block diagram of the microprocessor system is depicted in Figure 2.23. Third, TIPPY [17] is an 8-bit microprocessor. It used the DDST technique (see Section 2.3.2) where every register block has its own local clock generator and its own local clock tree. The ALU is bit serialized so it has several 1-bit ALUs. 50 Figure 2.23: Block diagram of the microprocessor system of [19]. The design has only 500 gates and worked at 6.25GHz. They have re-designed and re-fabricated twice to achieve those results in TIPPY-3. The block diagram of the TIPPY architecture is depicted in Figure 2.24. Fourth, several Japanese institutions have collaborated over the years in the COmplexity REduced (CORE) project [16, 77, 78]. The main concept of the CORE project is to reduce the circuit complexity in exchange of using high clock rate SFQ circuits. Therefore, they decided to use the bit serial processing as the most suitable way to realize such simple architecture, which reduces the number of JJs and JTLs and decreases the diculty of the timing design, though it increases 51 Figure 2.24: Block diagram of the TIPPY microprocessor of [17]. required clock cycles per instruction. The results mentioned here are the ones of their most recent eort, COREe4 [16] in 2016. It is an 8-bit microprocessor with 20 instructions. The frequency of operation is 2 GHz while the bit serial processing speed is 50 GHz. The processor has around 700 gates. Same as CORE1 {whose architecture is shown in Figure 2.25 and is the same as the one in COREe4{ they use a GALS variant where each local part has its own clock. Some of those parts have a local clock tree, and some other use concurrent ow clocking. The blocks use a kind of asynchronous handshaking to communicate. All of these processors relied on massive manual and custom optimizations, and none of them used a fully synchronous CDN. This suggests our claim that 52 Figure 2.25: Block diagram of the micro-architecture of the CORE1 microprocessor [77]. clocking algorithmic loops in SFQ is still, as stated in 1996 [10], an unresolved problem. Other designs We hereby review a couple of other large designs in SFQ and discuss their clocking and their speeds. First, in the work of [79], the authors fabricated a 16-bit sparse tree adder as shown in Figure 2.26 which is constituted of 10,000 junctions. For clocking, they used the asynchronous hybrid wave-pipelined technique previously mentioned (see the end of Section 2.3.2). They simulated the design successfully at 38.5 GHz and the low speed testing of the fabricated chip functioned properly. However, the 53 Figure 2.26: Structural diagram of the 16-bit sparse tree adder [79]. high-speed test at 30 GHz has failed. We {personally{ believe this is due to the variations which authors did not anticipate. Second, in the work of [80], the authors fabricated a 64-bit butter y pro- cessing unit. The chip had 60,000 junctions and was successfully tested up to 51.6 GHz. They used a conventional zero-skew clock tree and the processing is bit-serialized as always. Meanwhile, to combat variability, they applied a dierent bias current to each block separately which they calibrated at a low speed test. It is crucial to point out that the previous two designs are of processing units which are linear pipelines, i.e. no algorithmic loops exist in the design. Still, our claim is intact; no one has succeeded to solely use a zero-skew tree in a large scale SFQ chip with algorithmic loops. 54 Figure 2.27: Block diagram of the butter y processing unit for signed numbers calcu- lation fabricate in [80]. 55 Chapter 3 Hybrid Clover-Leaves Clocking (HCLC) In this chapter, we propose the Hybrid Clover-Leaves Clocking (HCLC) [23] which is an asynchronous clock distribution network for the clocking of a circular shift register (CSR). This is a new clocking technique comprised of synchronized hybrid clock loops, whose frequency is intrinsically determined by the clock architecture. This tree-free scheme reduces the area, power, and complexities associated with traditional clock distribution networks. The timing constraints are locally dened within the hybrid clock loops of the architecture which improves robustness to variations. As an example, for a 32-gates CSR, under a model of moderate local and global variations, our experimental results show up to 93% yield improvement at the same cycle time compared to zero-skew tree clocking. The remainder of the chapter is organized as follows. We start by a more detailed discussion about the clocking of loops in particular. Then we explain the proposed clocking scheme, and suggest some ideas of extending and scaling it. 56 After that, we discuss the experiments we performed to support our claims. In the end, the conclusions are discussed. 3.1 Clocking of Loops The clocking of an algorithmic loop can be studied using the simple circuit of a CSR. Clocking a CSR is more complex than a linear pipeline because of the additional timing constraint that the overall clock skew of a CSR must be zero [10, 14], which emerges from the skew theorem in [51]. Consequently, as explained in [14], the exclusive use of either the concurrent ow clocking or the counter- ow clocking is not possible 1 This leaves us with only two solutions for clocking the CSR: either the CMOS conventional clocking with a zero skew tree [4, 10], or with a mixture of concurrent and counter ow clocking [14, 15]. First, let us consider zero-skew trees. The problems with this clocking were extensively explained in Chapter 2. Figure 3.1 shows a generic illustration of the architecture used for CSR. In a zero-skew scheme, the clocks come from a single source which is spread out evenly using splitters and JTLs so that 8i;j clk[i] clk[j] = 0; (3.1) 1 The use of concurrent ow is theoretically possible with a large clock margin equal to the summation of the skews over the entire loop taking into consideration that the clocking between the rst and last element would be a counter- ow. Nevertheless, this is highly impractical to be considered. 57 Figure 3.1: A generic diagram for an N-bits CSR. where x is the instant at which the eventx occurs. The worst setup constraint is T sys CQ + JTL + MUX +T setup ; (3.2) where T sys is the system cycle time, and T x is the x characteristic of the gates, assuming they are all the same for simplicity. The worst hold constraint is CQ + JTL T hold : (3.3) Second, it is the mixture of concurrent and counter- ow clocking style, referred to here as mixed-CSR. An example of this implementation is shown in Figure 3.2. Another implementation was proposed in [14, 15] where they used 58 Figure 3.2: A 64-bit CSR clocked using mixed-CSR clocking divided into 4 stages. This architecture was proposed by [14]. a tunable bias line to mitigate a racing problem between clock and data in the concurrent clocking portion of the design. The issues with this clocking are hereby cited. • The structure of the clock distribution requires a single point of divergence and convergence that is not clearly mappable to generic pipelines. • The global timing constraints dene the clock frequency and are adversely aected by uncertainties. • Timing jitter resulting from using a clock source aggravates setup constraints. • The benet of the technique relies on the ability to superimpose the clock and data ows which breaks down for complex pipelines. 59 In conclusion, this hybrid clocking technique scheme is specic to a CSR. In this chapter, when comparing with the mixed-CSR, we eliminate the tunable bias and instead feed the clock from the middle (at D out ) of the CSR to avoid the aforementioned race. Thus, the clocking is counter- ow for the rst half of the CSR (G0G( N 2 1)) and concurrent for the second half (G( N 2 )G(N1)), such that clk[i] clk[i+1] =T flow for i = 0 :: N 2 2; clk[i] clk[i+1] =T flow for i = N 2 :: N 2; clk[ N 2 1] = clk[ N 2 ] and clk[0] = clk[N1] (3.4) Where T flow is the delay of a gate from the clock input to its clock output { which typically equals to one splitter delay as a splitter is commonly used here{ assuming for simplicity that it is the same for all gates. Assuming MUX > T flow , the worst setup constraint is the same as the zero-skew in (3.2). Otherwise, the worst constraint is within the counter- ow half as follows T sys CQ + JTL +T setup +T flow ; (3.5) Meanwhile, the worst hold constraint is within the concurrent half as follows CQ + JTL T flow +T hold : (3.6) 60 3.2 The Proposed Clocking Scheme: HCLC In this section, we introduce the Hybrid Clover-Leaves Clocking (HCLC) which is a hybrid tree-free clocking scheme. It is the rst step towards our main contribution in this thesis which is proposed in Chapter 5. 3.2.1 The Architecture The HCLC architecture is illustrated on a 32-bit CSR in Figure 3.3. It consists of M pairs of hybrid clocked loops or leaves, where each pair has one concurrent ow leaf, and one counter- ow leaf. Each pair of leaves has a single clock input (Clk IN ) that is split to both of them similar to the mixed-CSR in [14]. The two output clocks of the pair are combined using a coincidence junction to provide a single feedback clock output. The feedback clocks are combined from the M pairs using a tree of coincidence junctions to produce Clk FB . The architecture is tree-free, it does not have a clock input signal, but rather aGO pulse that starts the operation by producing the rst pulse of Clk IN . This signal is split using a tree of splitters to feed every pair of leaves. Clk IN is generated by either Clk FB or GO using a con uence buer. The frequency of operation is intrinsically determined by the longest clock path. The whole clover is considered counter- ow clocked relative to the input D in , and this is achieved by the output Clk OUT . The key advantages of the HCLC over zero-skew H-trees are: i) Resilient to uncertainties, stemming from being self-adaptive as its frequency is determined by 61 Figure 3.3: The proposed HCLC for a 32 gates CSR with M=2. the speed of its slowest leaf. and ii) More robust against setup and hold violations. This is because all the timing constraints are local and hence benets from the spatial correlation of most of the sources of variations mentioned in Section 2.1.3. This correlation between data and clock and how it adds to the timing robustness was analyzed in the CMOS work of [58, 81]. Section 3.3 demonstrates the higher robustness to the uncertainty. Addi- tionally, HCLC benets from the elimination of the clock distribution tree with all the associated design complexities such as area and power overhead, inductive network eects, clock jitter and timing uncertainty, and global timing constraints 62 (Section 3.2.2). Moreover, HCLC eliminates the electromagnetic problems associ- ated with the synchronous switching of the whole chip at the same time, is more scalable, and eases the incorporation of conditional clocking and clock gating. 3.2.2 Timing Analysis Two main points make the HCLC a very promising clocking technique: i) the frequency is self determined locally in each leaf, which makes it resilient to uncer- tainties; and ii) all the timing constraints are local. First, the clover's cycle time T sys is T sys = (L max + 1) sp + cf + log 2 N ov (3.7) whereL max is the maximum leaf length. In other words, it is the number of gates in the longest leaf, which equals Ngates 2M for total number of gatesN gates with equal- length leaves. N is the number of leaves per clover which equals 2M. cf is the con uence buer delay, ov is the overhead delay which is equal to the summation of the coincidence junction delay ( cc ) and the splitter delay ( sp ). And we assumed that T flow = sp since a splitter is practically used in ow clocking as previously mentioned. The worst setup constraint is the same as the mixed-CSR in (3.2) or (3.5) based on the MUX and T flow relation, and the worst hold constraint is inside the 63 concurrent ow loops as in (3.6). All of those timing constraints are determined locally within one leaf, and can be dierent among dierent leaves. 3.2.3 Extending and Scaling The focus of this chapter is specically on the CSR application rather than pre- senting the HCLC as a general clocking scheme. The latter is left for Chapter 5 where HCLC is modied and used within a bigger clocking scheme. Nevertheless, we suggest hereby a number of modications that could lead to extending this approach and scaling it to larger, more complex networks, and in a dierent way than in Chapter 5. • The most intuitive solution is grouping the smallest pipelines into clovers, then each clover can be grouped into a higher level clover recursively, creating a hierarchy of clovers within clovers. However, the clock frequency would depend on the total number of gates. • Conditional conjunction between clocks from dierent interleaving clovers could be used to accommodate for pipelines more complex than CSRs. • Inserting an NDRO (non-DRO) gate inside the clock path of the HCLC, then setting/resetting it using a system control unit, enables clock gating. 64 • HCLC is not restrained to algorithmic loops or circular pipelines, the same can be used for linear pipelines. Only the clock have to be constructed as shown in Figure 3.3, the data ow can be arranged dierently. • A dierent hybrid approach can be implemented as well. For instance, a single gate in any leaf can be replaced by a zero-skew cluster with a local H- tree, and hence the HCLC would be only used as the global part of a GALS system. • The clovering can only be used on low level gates, then the clovers could be grouped in a dierent hierarchical architecture that waits for the feed-back clock as a form of asynchronous acknowledgement. This is the conceptual basis of the (HC) 2 LC in Chapter 5. 3.3 Experimental Results In an attempt to quantify the benets and the feasibility of the proposed HCLC, we have built our own simulation platform to simulate the zero-skew, mixed, and HCLC for only the CSRs and compared their performance and yield. 3.3.1 Simulation Platform We used the SystemVerilog models presented in Section 7.3 to build the CSR in Figure 3.1 using the dierent clocking schemes. We used DFFs as in [8] for the 65 gates, and JTLs as the data path between gates. The multiplexer used is a special design using two NDRO cells to save the selection setting, two splitters for the selection signals, and a con uence buer for the output. Our test bench operates as follows. First, a load pulse is generated, then a sequence of bits is fed to the CSR, then a circulate pulse is generated, after that we wait for a number of complete circulations. The output data sequence is compared to a golden sequence, and timing violations are checked using the timing check directives of SystemVerilog incorporated into our DFF model. The test continues with dierent input sequences. All the gates delays and parameters were chosen following the work in [8]. We assumed the gates delays are independent Gaussian random variables, then ran Monte Carlo with global and local variations, using 3 global = 20% and 3 local = 10%, as in [20, 25]. Also, we assumed another factor to represent non-ideal distribution of delays, resulting from layout asymmetries and mismatches. We simulated over dierent values of standard deviation as shown in Section 3.3.2. For synchronous schemes, we assumed a uniformly distributed clock source jitter with = 5:8% as in [15]. 3.3.2 Comparisons We used our simulation platform to simulate the HCLC with dierent M val- ues, the zero-skew, and the mixed-CSR. The results are shown in Figure 3.4. For 66 synchronous architectures, we simulated using three dierent cycle times with dif- ferent margins, based on the respective setup constraints. The results show that the HCLC provides a much more robust solution than zero-skew clocking. They also show, from Figure 3.4f, that HCLC is better scalable over the zero-skew. The mixed-CSR gives the same robustness as the HCLC. However, as mentioned in Section 3.1 {in addition to other drawbacks{ this scheme is CSR-specic and is fundamentally ill-suited for a chip-scale clocking scheme. Moreover, from Figure 3.4f, HCLC achieves the same yield with faster operation than the mixed-CSR scheme. 3.4 Conclusion This chapter presents the Hybrid Clover-Leaves Clocking (HCLC) scheme. The frequency is intrinsically determined within the hybrid loops, making HCLC tree- free, eliminating the area and power overhead and routing complexities associated with clock distribution networks. All the timing constraints are locally dened within a clover, making HCLC robust. This chapter also describes a SystemVer- ilog simulation platform to prove the feasibility of the proposed technique and quantify and compare its robustness and speed to previous methods. As an exam- ple, for a 32-gates CSR, under a model of moderate local and global variations, our experimental results show up to 93% yield improvement at the same cycle time compared to zero-skew tree clocking. 67 0 50 100 Yield (%) ourWork [M=1] Zero [25%] Zero [50%] Zero [100%] Mixed [25%] Mixed [50%] Mixed [100%] 75 85 95 105 115 125 Cycle Time (ps) 3σ=25% 3σ=50% 3σ=100% (a) 8 gates. 0 50 100 Yield (%) our [M=2] Work [M=1] Zero [25%] Zero [50%] Zero [100%] Mixed [25%] Mixed [50%] Mixed [100%] 75 85 95 105 115 125 Cycle Time (ps) 3σ=25% 3σ=50% 3σ=100% (b) 16 gates. 0 50 100 Yield (%) our [M=4] Work [M=2] Zero [25%] Zero [50%] Zero [100%] Mixed [25%] Mixed [50%] Mixed [100%] 75 85 95 105 115 125 Cycle Time (ps) 3σ=25% 3σ=50% 3σ=100% (c) 32 gates. 0 50 100 Yield (%) our [M=8] Work [M=4] Zero [25%] Zero [50%] Zero [100%] Mixed [25%] Mixed [50%] Mixed [100%] 75 85 95 105 115 125 Cycle Time (ps) 3σ=25% 3σ=50% 3σ=100% (d) 64 gates. 0 50 100 Yield (%) our [M=16] Work [M=8] Zero [25%] Zero [50%] Zero [100%] Mixed [25%] Mixed [50%] Mixed [100%] 75 85 95 105 115 125 135 Cycle Time (ps) 3σ=25% 3σ=50% 3σ=100% (e) 128 gates. 0 20 40 60 80 100 120 140 0 0.2 0.4 0.6 0.8 1 Pipeline Length (gates) yield / Cycle Time (%/ps) Our Work[M=max] Zero[100%] Mixed[100%] (f) Yield per cycle time at 3 = 100%. Figure 3.4: (a-e) The bars are the yield (left y-axis) at dierent non-idealities standard deviation, while the line is the average cycle time (right y-axis). The gures are for dierent pipeline lengths, and the percentage next to a design name is the clock cycle added margin w.r.t. the setup constraint. (f) The yield with non-idealities 3 = 100% divided by the cycle time versus dierent pipeline lengths. 68 Nonetheless, the HCLC cycle time is O(N gates ) (see (3.7)) which is highly impractical for large scale VLSI. (HC) 2 LC in Chapter 5 extends HCLC beyond a CSR and uses it to build a robust clocking technique for generic pipelines, the cycle time of which is independent of the total number of gates. 69 Chapter 4 Asynchronous Clock Distribution Networks (ACDN) In this chapter, we present the fundamentals of asynchronous clock distribution networks (ACDN) [52]. In particular, we discuss the timing characteristics of the timing signals arriving at the clock sinks, and introduce the constraints that an asynchronous system must satisfy in order to obtain these characteristics. 4.1 Introduction Timing of synchronous systems is arguably one of the hardest obstacles to answer- ing the demands for low power consumption and higher operation speeds. The power and performance cost of delivering a clocking signal to every clock sink in the chip is considerable when compared to the cost of performing the logical oper- ations themselves. This is exacerbated with unprecedented levels of uncertainty in state-of-the-art silicon dimensions, in other beyond-CMOS technologies [3], and in SFQ in particular. 70 As discussed in Section 2.2.3, designing CDNs in synchronous systems has never been a trivial task [51, 58, 59]. For-seeing the growing uncertainty in circuit parameters, the asynchronous community has oered more robust solutions that are more naturally tolerant to variability [82]. For example, many designers have resorted to globally-asynchronous locally-synchronous systems as a middle-ground compromise [83{85]. Other researchers, assuming utter uncertainty, propose the use of extremely robust delay insensitive circuits [11, 86, 87]. The work in [88] suggested the use of a ring oscillator (RO) instead of the traditional Phase-Locked Loop (PLL) as a more variation tolerant clocking scheme in which the RO drives a zero-skew tree to all clock sinks. Argo [85] uses asynchronous on-chip routers orga- nized in a 2-D mesh network. These routers are interfaced to the fully-synchronous processing cores -which are mesochronously clocked- using asynchronous ripple FIFO (rst-input rst-output) buers. In other words, Argo is an example of a work that used asynchronous structures| the routers of the NoC in order to con- trol a synchronously constrained environment| time-division multiplexed network interfaces that are synchronously interfaced to the fully-synchronous cores. While this thesis proposes the use of a hierarchy of interconnected asynchronous oscilla- tors (see Chapter 5) to provide the clocking signals directly to the relatively large number of clock sinks present in gate-level pipelined superconducting electronics. Targeting a dierent compromise of both performance and robustness, this chapter provides the theoretical foundation for asynchronous CDNs. In particular, 71 we state the sucient conditions for an asynchronous system to provide periodic signals with well-dened clock skews for a large number of clock sinks. To illustrate the value of these conditions, this thesis presents a specic hierarchical clocking structure that with a single tunable delay satises these conditions (see Chapter 5). The clocking structure benets from both the natural robustness of asynchronous structures as they are adaptive to the worst-case delay of any local cycle, and the advantageous performance of synchronous clocking. We model these systems using timed Marked Graphs (MGs) from Petri Net (PN) theory [28] and provide sucient conditions for asynchronous structures to provide the clocking signals necessary for the clock sinks in a synchronous system. This is in contrast to the Argo system [85], for example, where they interface the asynchronous circuits to synchronous cores but do not use them to do the clocking, but rather they use a fully-synchronous clock source. The remainder of this chapter is organized as follows. First, Section 4.2 pro- vides a background on marked graph theory. After that, the problem formulation is stated in Section 4.3. Section 4.4 recites the denitions and the nomenclature we use in this chapter, and Section 4.5 introduce the ACDN theoretical foundation. Then, the main theoretical work and its derivations are presented in Section 4.6. After that, Section 4.7 compares this work to related previous literature. Finally, Section 4.8 and Section 4.9 provide a discussion and our conclusions, respectively. 72 4.2 Marked Graph Background Petri nets (PNs) are a graphical and mathematical modeling tool [28]. They are constituted of places (depicted as circles and denoted as p) that hold markings or tokens (depicted as black dots inside a place) which represent the conditions leading to this place were met, and transitions (depicted as bars and denoted as t) which represent actions or events. Arcs are used as interconnections between the places and the transitions to signify relational dependence. PNs can be timed by assigning a certain execution time value to either places or transitions. In case of assigning them to transitions, if all the input places to a transition have tokens in them, that transition is said to be enabled, and after its execution time, the transition is said to re, signaling the completion of that event. Once red, one token per input place is removed and one token is added to every output place. That is known as the ring rule. Marked graphs (MGs) are a subclass of PNs where each place has exactly one input transition and exactly one output transition [28]. Section 4.4 provides a formal denition of those concepts. Given their properties, timed MGs are well-suited to model the timing of circuits that exhibit no choice and are periodic. Any gate can be modeled as a transition with execution time associated with it equal to the gate delay. For SFQ, any interconnect can be modeled as an input place connected to a transition. Figure 4.1 shows some basic asynchronous SFQ cells (see Section 2.6) and their corresponding MG models. The splitter is shown in Figure 4.1a. A con uence 73 (a) Splitter. (b) Con uence buer with GO signal. (c) Coincidence junction. (d) Primed coincidence junction. Figure 4.1: Basic SFQ cells and their MG models. buer (CF) cannot be modeled as a MG due to its peculiar characteristics. How- ever, the combination of the CF and a start-up GO signal as the one employed in Figure 3.3 can be modeled as a MG as illustrated in Figure 4.1b assuming that this GO signal is the rst event to happen in the system and that it happens only once, which is the case in this thesis. The coincidence junction (C-junction) is depicted in Figure 4.1c. Its primed counterpart is illustrated in Figure 4.1d. 4.3 Problem Formulation In this section, we recall some of the synchronization basics discussed in Sec- tion 2.2.1 with more insight on periodicity that is necessary for this chapter. 74 In a synchronous system, each clock sink is expected to receive a clocking signal every time period known as the cycle time [59]. Moreover, the dierence in clock signal arrival time between any two clock sinks, i.e. the clock skew, should be well dened [51]. Conventionally, synchronous systems are designed to keep the skew value between any two sinks constant and ideally set to zero, though nothing is special about this particular value [10]. Additionally, the system usually starts with a transitional period that we call warming-up period [53, 54] until the timing source settles and the system then enters a periodic steady-state. For any CDN, the desired timing of all clock sinks can be represented as follows (i) n =T up +T skewn +iT sys ; 8t n 2T;8ii up (4.1) where (i) n is is the ith time occurrence of the clock signal at the nth clock sink t n , whereT is the set of all clock sinks. T sys is the system cycle time and T skewn is the clock skew value at the clock sink, t n , with respect to a common reference. Ignoring variations and asymmetries, this skew should equal to zero in zero-skew systems. T up is the warming-up period, and i up is the warming-up occurrences before the system enters a periodic steady state [53, 54]. It should be emphasized that the value of T skewn should be well dened for every transition. In other words, the clock path resulting in the nominal value of T skewn should be known during the design phase. This is to guarantee a safe operation without any setup or hold violations which must be checked during design 75 using some form of a static timing analysis (STA) [38]. It should also be noted that the periodic property in (4.1) is a desirable characteristic for any synchronous system. Otherwise, the synchronous datapath has to account for a time-varying clock period which is undesirable. Interestingly, the authors of [85] were forced to address this problem designing an asynchronous NoC interfaced to synchronous cores. 4.4 Denitions And Nomenclature 4.4.1 Basic Denitions For the sake of completeness and nomenclature consistency, we start this section with some basic denitions about timed Petri nets and Marked Graph (MG) theory [28]. It should be noted that the work in this thesis exclusively considers circuits modeled as MGs. ä Denition 1: Petri Net: The Petri Net (PN) in our manuscript is a 4-tuple, PN = (P;T;F;M 0 ) [28]. WhereP is the nite set of places,T is the nite set of transitions,F (PT )[ (TP) is the nite set of arcs describing the ow relation, andM 0 is the initial marking. In this chapter, we study timed PN; We choose to associate time delays with transitions. For every t n 2T , n 0 is used to represent this value. It is worth mentioning that we primarily focus on a deterministic PN where all the delay 76 Figure 4.2: An example of a timed MG. values are deterministically dened, i.e. not stochastic but we consider extensions in Section 4.8. ä Denition 2: Marked Graph: A marked graph (MG) is a PN such that each place p has exactly one input transition and exactly one output transition [28]. Figure 4.2 shows an example of a timed MG. Note that a live and safe MG (see Denitions 7 and 8) is a strongly connected graph by denition. ä Denition 3: Places: p n 2P denotes the nth place, and m n 2M 0 is the initial number of markings inp n . In Figure 4.2,p 1 andp 8 are examples of marked places, i.e. m 1 = m 8 = 1, while the rest of places are unmarked and thus have m = 0. 77 ä Denition 4: Transitions: t n 2T denotes the nth transition, and n is the execution time or the delay associated witht n . In Figure 4.2, the delays are shown in red below the associated transitions. ä Denition 5: Firing Rule: A transition t n is enabled if each input place p in of t n , where p in 2P^p in t n 2F, has a token. After n , one token is removed from each input p in , and one token is added to every output place p out , where p out 2P^t n p out 2F [28]. For example, in Figure 4.2, t 3 has one input place, p 3 , and it has an initial marking. Therefore, at t = 0, t 3 initiates the ring and consumes the marking ofp 3 . Then, att = 2,t 3 res since 3 = 2, and bothp 4 and p 5 become marked with one marking each. After that, botht 4 andt 5 initiate their ring and re at t = 10 and t = 4 respectively. ä Denition 6: Transition Firing: (i) n is the time at which transition t n initiates its ith ring. Following the same example of Figure 4.2, (0) 3 = 0, (0) 4 = (0) 5 = 2. ä Denition 7: Liveness: A PN is live if it is always possible to ultimately re any transition guaranteeing a deadlock-free operation [28]. Formally, a PN is live if (i) n is dened for8i 0;8t n 2T . ä Denition 8: Safeness: A PN is safe if the number of tokens in each place does not exceed one [28]. ä Denition 9: Sources: The sources setS is dened such that a transition t s 2S if it is enabled initially (These transitions are called roots in [53]). Note 78 thatST , and we set (0) s = 0 for all t s 2 S. In Figure 4.2, t 1 and t 3 are the only sources. 4.4.2 Paths In this subsection we include some established denitions regarding paths, in addi- tion to presenting the useful concepts of token limited directed path set and directed path length. ä Denition 10: Directed Path: A directed path [89] is a sequence of alter- nating transitions and places. t 1 p 1 t 2 p 2 t N is a directed path from t 1 to t N if t n p n 2F and p n t n+1 2F for 1 n N 1 [89, 90]. We use the symbol `@' as in h 1 @ h 2 to denote that the path h 1 is a section or a sub-sequence of the path h 2 . For example, in Figure 4.2, let us denote the path t 2 p 3 t 3 as h 1 , and the path t 1 p 2 t 2 p 9 t 7 p 8 t 2 p 3 t 3 as h 2 . Thus, h 1 @h 2 . ä Denition 11: Directed Path Delay: We dene the delay of a directed path as the sum of the execution times of the transitions forming the path h excluding the target transition, i.e. it is the delay from the enabling of the start transition till the enabling of the end transition. This can be formalized as follows for a path h from t start to t end (h) = 0 @ X t j 2h j 1 A end (4.2) In Figure 4.2, (h 1 ) = 1 and (h 2 ) = 12. 79 ä Denition 12: Directed Path Set: We deneH(t u ;t v ) as the set of all the possible directed paths from t u tot v . This set can be innite for a MG due to the presence of loops. For example, in Figure 4.2,H(t 2 ;t 7 ) = t 2 p 9 t 7 , t 2 p 9 t 7 p 8 t 2 p 9 t 7 , , t 2 p 3 t 3 p 5 t 5 p 6 t 6 p 1 t 1 p 2 t 2 p 9 t 7 , , t 2 p 3 t 3 p 4 t 4 p 7 t 6 p 1 t 1 p 2 t 2 p 9 t 7 p 8 t 2 p 9 t 7 , . ä Denition 13: Directed Path Synchronic Distance: In a similar concept to the synchronic distance between two transitions in [91], we dene the synchronic distance of a directed path,h2H(t u ;t v ), fromt u tot v as the number of times that one transition can re without ring the other according to that path. Formally, we write it as M(h) = X p j 2h m j (4.3) In Figure 4.2, M(h 1 ) = 1 and M(h 2 ) = 2. ä Denition 14: Token Limited Directed Path Set: We dene ^ H(t u ;t v ;m) as the set of all the directed paths from a transition t u to a transition t v which have a directed path synchronic distance less than or equal to the integer m. This can be formalized as follows ^ H(t u ;t v ;m) =fh :h2H(t u ;t v ) ^ M(h)mg (4.4) Using the same example as before, ^ H(t 2 ;t 7 ; 0) = t 2 p 9 t 7 , ^ H(t 2 ;t 7 ; 1) = t 2 p 9 t 7 , t 2 p 9 t 7 p 8 t 2 p 9 t 7 , and ^ H(t 2 ;t 7 ; 2) = t 2 p 9 t 7 , t 2 p 9 t 7 p 8 t 2 p 9 t 7 , t 2 p 9 t 7 p 8 t 2 p 9 t 7 p 8 t 2 p 9 t 7 , t 2 p 3 t 3 p 5 t 5 p 6 t 6 p 1 t 1 p 2 t 2 p 9 t 7 , t 2 p 3 t 3 p 4 t 4 p 7 t 6 p 1 t 1 p 2 t 2 p 9 t 7 . 80 ä Denition 15: Directed Path Length: We dene the length of a directed path h2H(t u ;t v ) as the number of transitions in that path excluding the end transition, t v . We denote the path length asjhj. We also useH l (t u ;t v ) to denote the set of all the possible directed paths from t u to t v with length l. Formally, it can be written as H l (t u ;t v )H(t u ;t v ) ; H l =fh : h2H(t u ;t v ) ^ jhjlg (4.5) Such a set can also be further constrained with a token bound as follows ^ H l (t u ;t v ;m) =fh : h2H l (t u ;t v ) ^ M(h)mg (4.6) Using the same example, H 1 (t 2 ;t 7 ) = t 2 p 9 t 7 , H 3 (t 2 ;t 7 ) = t 2 p 9 t 7 , t 2 p 9 t 7 p 8 t 2 p 9 t 7 , andH 3 (t 2 ;t 7 ; 0) = t 2 p 9 t 7 . ä Denition 16: Loop: A directed path t 1 p 1 t 2 p 2 t N is a loop, if t 1 = t N . From Figure 4.2, the following paths are loops: t 2 p 9 t 7 p 8 t 2 , t 2 p 9 t 7 p 8 t 2 p 9 t 7 p 8 t 2 , t 1 p 2 t 2 p 3 t 3 p 5 t 5 p 6 t 6 p 1 t 1 , and t 2 p 9 t 7 p 8 t 2 p 3 t 3 p 4 t 4 p 7 t 6 p 1 t 1 p 2 t 2 . ä Denition 17: Simple Directed Path: is a directed path from t 1 to t N where all the transitions are distinct [89, 90]. From Figure 4.2, the path t 2 p 9 t 7 is a simple directed path between t 2 and t 7 while the path t 2 p 9 t 7 p 8 t 2 p 9 t 7 is not. 81 4.4.3 Performance In this section we include some established denitions regarding circuits and cycle time [28, 53, 54, 90, 92, 93], in addition to presenting the concept of ring period which conveys the intuition of a strictly periodic clock signal in synchronous sys- tems (see Section 4.3). ä Denition 18: Directed Circuit: A sequence t 1 p 1 t 2 p 2 t N is a directed circuit, C, if the path from t 1 to p N1 is a simple directed path, p N1 t 1 2F, and t 1 = t N [92]. We let the setC denote the set of all the directed circuits in the PN. Two directed circuits that contain the same set of places and transitions are considered equivalent. We use M(C k ) to denote the total number of tokens in the places of thek th directed circuit,C k , and (C k ) to denote the sum of the execution times of the transitions formingC k . We also useC(t n ) to denote the set of circuits containing the transition t n . These can be formalized as follows for a circuit C k that starts and ends at transition t u . M(C k ) = X p j 2C k m j ; C k 2C (C k ) = 0 @ X t j 2C k j 1 A u ; C k 2C C(t n ) =fC : C2C^t n 2Cg (4.7) From Figure 4.2, the set C = C 1 ;C 2 ;C 3 , where C 1 = t 2 p 9 t 7 p 8 t 2 , C 2 = t 1 p 2 t 2 p 3 t 3 p 5 t 5 p 6 t 6 p 1 t 1 , and C 3 = t 1 p 2 t 2 p 3 t 3 p 4 t 4 p 7 t 6 p 1 t 1 . For those circuits, 82 M(C 1 ) = 1, M(C 2 ) = M(C 3 ) = 2, (C 1 ) = 10, (C 2 ) = 10, and (C 3 ) = 16. Also,C(t 1 ) = C 2 ;C 3 ,C(t 2 ) =C,C(t 7 ) = C 1 . ä Denition 19: PN Cycle Time: A PN has a PN cycle time which is dened in [28] as the time to complete a ring sequence leading back to the starting marking after ring each transition at least once. There is a crucial distinction that must be made here since the term cycle time is used dierently across many papers [28, 53, 54, 90, 92, 93]; We use the complete term PN cycle time as a property related to the initial marking and the system periodicity [54] as previously dened. Whilst we use cycle time for the other denition, which is oriented towards transition ring periodicity. ä Denition 20: Cycle Time: We dene the cycle time as the mean time to complete a ring sequence leading to the ring of each transition for exactly once. It can be formalized as follows [92]. ^ T n = lim i!1 (i) n i ; 8t n 2T (4.8) ä Denition 21: Firing Period: We dene the ring period of a transition as the exact cycle time, i.e. the time separation between the i th transition ring and the (i + 1) th ring. (i) n = (i+1) n (i) n ; 8t n 2T;8i 0 (4.9) 83 4.5 ACDN Theoretical Foundation In this thesis, we provide the theoretical foundation for asynchronous clock dis- tribution networks (ACDN). It should be noted that we choose timed MGs to model the CDNs as it is natural for clock delay elements to be modeled as tran- sitions with a specic time delay with no choice. Moreover, clock delay elements do not have memories to allow more than one token per place which guarantees safeness. Regarding our choice of using deterministic time delay, this is for sim- plicity and in order to establish the main timing characteristics for a theoretical foundation. Practically, the time delays shall be random variables due to jitter and other uncertainties, and the timing characteristics shall possess a stochastic aspect as well. This is discussed in Section 4.8. ä Denition 22: ACDNs are systems which can be modeled as timed MGs whose transitions rings follow the property in (4.1). ACDNs can thus be used as a CDN for a synchronous system. In particular, we provide sucient conditions for an asynchronous system to be an ACDN. Namely, if its MG model • does not have always-enabled-transitions, i.e. with no external inputs, • is live and safe, • has a single token per directed circuit, and 84 • has a single source; and this source belongs to a critical circuit (see Deni- tion 26), then the system is an ACDN. That is, all the transitions in this system maintain the property in (4.1) and can thus be used to provide the clocking signals for a synchronous system without the need for an external input clock. Specically, the rst condition excludes systems with externally generated clock signals. The second ensures the periodicity of (4.1). The third ensures that the clock period does not change from ring occurrence to another, i.e. the system periodicity equals to one [54] (see Section 4.7.3). The last condition guarantees that the value of T up equals to zero and that the skew value is well dened for every clock sink. Section 4.8 discusses these statements in more details and connects them to the theorems presented in this chapter. In an attempt to clarify the intuition of our work and to further motivate the signicance of those constraints, Figure 4.3 shows a MG example which violates two of the constraints: It has a directed circuit with more than one token and it has more than one source. The table next to the diagram shows the rst six transitions rings. The transitions ring cannot be described using (4.1) because the ring period oscillates. Focusing ont 1 for instance, substituting into (4.1) from the table, we get that (i) 1 = 3i. However, this only works for even values ofi; that is because its period oscillates between 2 and 4. Such period oscillation is an undesirable 85 Figure 4.3: An example for a MG with two sources and more than one token per circuit. T sys = (C) M(C) = 6 2 = 3. property for a clocking signal of a synchronous system as previously discussed in Section 4.3. 4.6 Theory In this section we provide the mathematical derivations that prove the theoretical foundation for ACDNs modeled as MGs. We prove that the previously mentioned conditions are sucient for an asynchronous system to guarantee the property in (4.1) enabling their use as a CDN for synchronous circuits. The detailed proofs for our theorems are included in Appendix A. 4.6.1 Execution Model In this section, we apply the ring rule in order to formalize the description of any transition ring. In other words, for any transition t n , we introduce an equation to formalize any transition occurrence (i) n for any arbitrary PN. 86 ä Denition 23: Execution Model: In a PN, 8t v 2T;8i 0 (i) v = max tu2T max h2 ^ H 1 (tu;tv;i) (iM(h)) u + u (4.10) As in [53], we dene the execution of the PN as the consistent assignment of time values to the transitions rings. Using our previous denitions, equation (4.10) is a direct translation of equation (1) in [53], which is a direct application to the ring rule of PNs [28]. This formalization follows the same conceptual understanding of the process graph unfolding [53] (See Section 4.7.3), and the unrolling of repetitive event-rule systems [54] (see Section 4.7.3). In simpler terms, equation (4.10) means that the time of execution of the ith occurrence of the transition t v depends on the ring of the transitions which have one place separating them from t v . That is, the ith occurrence of t v has to come after the (iM(h))th occurrence of the latest transition to re among those transitions. Since the path h here is limited to be of unity length, M(h) shall equal either zero or one depending on the initial marking of the place that separates t u and t v in h. Theorem 1 In a PN, 8t v 2T;8i 0 (i) v = max tu2T max h2 ^ H(tu;tv;i) (iM(h)) u + (h) (4.11) 87 | That is, the execution model can be generalized for any path with unlimited number of tokens. The proof is in Appendix A.1. ä Denition 24: For the i th ring of any transition t v 2T , (i) v , we denote any path that belongs to the arg max of the inner maximum of the right hand side of (4.11) as a critical path for the ring (i) v . 4.6.2 Cycle Time In this section we describe the system periodic behavior under the constraint of having a single token per directed circuit. ä Denition 25: MG-L is a live and safe MG [28]. ä Lemma 1: In a MG, the number of tokens in a directed circuit remains the same after any ring sequence. | Same as Theorem 1 in [90]. ä Lemma 2: In a MG-L, ^ T u =T sys ; 8t u 2T (4.12) 88 | Same as Theorem 2 in [90]; All transitions in a live and safe MG have the same cycle time, T sys . ä Lemma 3: In a MG-L, T sys max C k 2C (C k ) M(C k ) = MAX (4.13) | The lower bound on T sys is well established to be the maximum delay divided by the number of tokens for every circuit in the net [28, 53, 54]. In the example of Figure 4.2, MAX = max(10; 10=2; 16=2) = 10 ä Denition 26: We denote any circuit with the =M value of MAX as critical circuit. ä Denition 27: MG-T is a MG with exactly one token in every directed circuit. Formally, A MG is MG-T if M(C) = 1;8C2C (4.14) Theorem 2 In a MG-LT, 8t u 2T , ifC(t u )\ arg max C k 2C (C k ) M(C k ) 6=;; then (i) u = (im) u +m MAX ; 8im 1 (4.15) | That is, given a MG-LT, if a transition t u is part of a critical circuit, then the ith ring oft u ism times MAX its (im)th ring. The proof is in Appendix A.2. 89 Violating Example: As discussed in Section 4.5, in an attempt to clarify the intuition of Theorem 2, Figure 4.3 shows a MG example which violates the T constraint; It has a directed circuit with more than one token. It has only one directed circuit, and hence every transition is part of the critical circuit. The transitions ring cannot be described using (4.15) because MAX = 3, but as illustrated on the right side of Figure 4.3, the time period between successive rings of all transitions in the MG (t 1 ,t 2 , and t 3 ) oscillates between 2 and 4. ä Corollary 1: In a MG-LT, 8t s 2S, ifC(t s )\ arg max C k 2C (C k ) M(C k ) 6=;; then (i) s = MAX ; 8i 0 and thus ^ T n =T sys = MAX ; 8t n 2T (4.16) Proof: It comes directly without proof from Corollary 4 (See Appendix A.2) and Lemma 2. 4.6.3 Firing Period In this section, we formalize the theorem that under the constraints stated in Section 4.5, the asynchronous system satises the timing property in (4.1) and can 90 thus be used as a CDN for a synchronous system. ä Denition 28: MG-S is a MG with a single source. Formally, A MG is MG-S ifjSj = 1; s.t.S =ft s g (4.17) ä Denition 29: MG-SC is a MG-S whose source belongs to a critical circuit. Formally, A MG-S is MG-SC ifC(t s )\ arg max C k 2C (C k ) M(C k ) 6=; (4.18) Theorem 3 In a MG-LTSC, 8t n 2T;8i 0, (i) n = max h2 ^ H(ts;tn;0) (h) +iT sys (4.19) | That is, given a MG-LTSC, the ith ring of any transition t n is periodic with a period of T sys and has a constant delay shift equal to the maximum zero-tokens path from the source to t n . The proof is in Appendix A.3. ä Corollary 2: In a MG-LTSC, (i) u =T sys ; 8i 0;8t u 2T (4.20) 91 Proof: It comes directly without proof from Theorem 3. Violating Examples: 1. The example of Figure 4.3 which is discussed after Theorem 2 violates both the T and the S constraints; It has a directed circuit with more than one token and it has more than one source. Same as for Theorem 2, since the ring period oscillates, the transitions ring cannot be described using (4.19). It should be observed here that the dierence between Theorems 2 and 3 is that the former denes the periodicity (the iT sys part of (4.1)) where the latter {which possesses the same periodicity{ denes the constant delay shift (T skewn in (4.1)) that is specic to every transition in the graph. 2. In an attempt to further clarify the intuition of Theorem 3, Figure 4.4 shows a MG example which violates the SC constraint; Its source does not belong to a critical circuit. The source does belong to only one directed circuit which has =M = 2 while T sys = 3. This causes a warming up period before the system goes to a periodic steady-state. However, this is not the main issue since most clocking systems have warming-up periods. The crucial point is that the MG does not satisfy (4.1) or (4.19) because the source is not the main controlling point and does not control the entire net. In other words, the constant delay shift in the periodic equation (T skewn in (4.1)) is not well dened. In Theorem 3, the SC property forced the T skewn value to be the delay of the zero-tokens path from the source. On the other hand, when 92 SC is violated as in the example of Figure 4.4, this skew value is instead determined by the critical circuit which makes it not possible to express the ring of every transition using (4.1) with a well-dened notion of T skewn . We hereby substitute into (4.19) using the example to further elaborate on that point. The transitions ring should be written as follows (i) 1 = max h2 ^ H(t 1 ;t 1 ;0) (h) + 3i = 3i6= 1 + 3(i 1) 8i 2 (i) 2 = (t 1 p 2 t 2 ) + 3i = 1 + 3i6= 3i 8i 1 (i) 3 = (t 1 p 2 t 2 p 3 t 3 ) = 2 + 3i 8i 0 (i) 4 = (t 1 p 2 t 2 p 3 t 3 p 6 t 4 ) = 3 + 3i 8i 0 (4.21) We notice that the equations are wrong for t 1 and t 2 (the correct formula is written after the6= symbol). Both t 3 and t 4 are correct because they do belong to a critical circuit, and therefore the source has the power of initializing their rst ring only, and after that, they control their own rings as in (4.15) (see Theorem 2). 4.6.4 Uncertainty Condition In this section, we assume a more uncertain case, where we cannot know the value of MAX a priori. Thus, the fourth constraint (see Section 4.5) is violated. We show that by only probing a single transition, such a system could be readjusted to 93 Figure 4.4: An example for a MG whose source does not belong to a critical circuit. T sys = max C2C (C) M(C) = 3 1 = 3. be used as an ACDN as discussed in Section 4.8. It is worth mentioning that this is the main motivation behind the work done in this thesis; the case of uncertainty in systems with high level of variability. This section represents the most practical case. ä Corollary 3: In a MG-LTS, 8t u 2T , ifC(t u )\ arg max C k 2C (C k ) M(C k ) 6=;; then (i) u = MAX ; 8i 0 and thus ^ T u =T sys = MAX (4.22) Proof: It comes directly without proof from Lemma 9 (see Appendix A.4) and Lemma 2. 94 Theorem 4 In a MG-LTS, ifC(t s )\ arg max C k 2C (C k ) M(C k ) =;; then9i ? ;9t n 2C n 2C(t s ); and9t u 2T; t u 6=t s ; such that (i ? ) s + max h2 ^ H(ts;tn;0) ((h))< (i ? M(hu!n)) u + (h u!n ) (4.23) | That is, given a MG-LTS, ift s does not belong to a critical circuit, then there exists a certain occurrence i ? at which the value of the summation of the i ? th ring of t s and the maximum token-less path delay between t s and t n , where t n is a transition that shares a directed circuit with t s , has to be less than the value of the summation of the (i ? M(h u!n ))th ring of t u , where t u is a dierent transition fromt s , and the delay of a certain path fromt u tot n |h u!n . The proof is in Appendix A.4. In simpler words, if a MG satises the conditions mentioned above, there exist a period of time (i.e., after i ? occurrences), after which, it is sucient to probe the inputs of a single transition,t n , in order to verify the condition that the source transitiont s does not belong to a critical circuit, and thus x this situation by increasing the delay of a circuit containing t s . Otherwise, the critical circuit will be determined by some other transitiont u , which does not share a circuit with t s . 95 It is worth mentioning that in case of stochastic delay values (see Deni- tion 1 and Section 4.8), the conditions above will still hold, and the sequence of probing/increasing-delay will force t s to eventually be in a critical circuit. Clarifying Example: In an attempt to clarify the intuition of Theorem 4, Fig- ure 4.4 shows a MG example which violates the SC constraint as discussed in Section 4.6.3. t n in the theorem has to be t 2 because it is the only transition that shares a circuit with t 1 , the source. t u can be either t 3 or t 4 . In case of t 3 for example, the least-tokens path between t 3 and t 2 is the path t 3 p 4 t 2 . And the longest zero-tokens path between t 1 and t 2 is t 1 p 2 t 2 . Following this example, we can substitute those into (4.23) as follows (i ? ) 1 + 1< (i ? 1) 3 + 1 (4.24) From the table with the transitions ring, we can see that the previous inequality is valid for i ? 2. 4.7 Comparison to Previous Work This section discusses our work in the context of the previous literature. In other words, this section situates our contribution within the relevant previous works. 96 4.7.1 Circuits and Systems Literature First, any fully synchronous clocked system using an external clock source, such as a crystal oscillator in the case of PLL clocked systems, cannot be considered as an ACDN. Second, delay insensitive circuits [86], including the ones which use dual-rail logic [11, 87], though they are indeed asynchronous, do not have any synchronous clock sinks because the logic is intrinsically embedded into the datapath. The property of (4.1) is not guaranteed and they are generally not ACDNs. Third, systems that apply a GALS approach [83{85, 94, 95], being either mesochronous or plesiochronous, and whether they use pausable clock interfaces, FIFO's interfaces, or typical synchronizers, all require one or more external clock inputs. Additionally, due to metastability issues, they cannot be modeled using deterministic delays [82]. Thus, they are also outside the scope of ACDNs. We hereby mention a couple of systems which do t into the ACDN cat- egory. The rst being the work in [88] where the authors proposed the use of a RO instead of a PLL. The clock being generated on-chip, they showed that the clock variations will be correlated to the datapath variations. Then they used a conventional clock tree to deliver the RO's output to the clock sinks. Regarding the second system, which is our (HC) 2 LC in Chapter 5, it is particularly beneting from every property of an ACDN. Figure 4.5 illustrates the MG model of a three- levels hierarchy of (HC) 2 LC using small numbers of links in every chain (two links in L2 and three links in L1). The red loop is the top loop containing the single 97 Figure 4.5: The MG model of a simple hierarchy of the (HC) 2 LC clocking technique in Chapter 5. LX stands for the X th level. source t s which is shaded; a programmable delay line should be inserted into this loop. The red transition is the transition that could be probed (see Section 4.8 and Theorem 4). The blue transitions could be considered as the synchronous clock sinks, though they are actually clovers of many gates similar to the structure of HCLC in Chapter 3.2. This graph is an example of a MG that satises the design constraints of Section 4.5 and can thus be used as an ACDN (see Section 5.3.1 for more details). 4.7.2 Theoretical Literature First, the authors in [90, 92] have reached similar results to Theorem 3 in this thesis. However, they have not used the same conditions we specied. We henceforth show 98 that they have under-constrained their theorems. In particular, both Theorem 3 in [90] and Theorem 1 in [92] state that transitions of timed MGs satisfy (i) n = (0) n +iT sys ; 8i 0 (4.25) which is identical to the results of our Theorem 3 but without the theorem's con- straints. Figure 4.3 and Figure 4.4 show two counter-examples that support our claim. The tables next to the diagrams shows the rst six transitions rings. Fig- ure 4.3 shows a MG that violates both the single source and the single token per directed circuit constraints. Similar to the discussion in Section 4.5, the period of the transitions rings of the MG in Figure 4.3 oscillates and therefore their behav- ior cannot be described using (4.25). Regarding the example in Figure 4.4 which is discussed in Section 4.6.4, that MG has indeed a single token per circuit, and a single source. However, the source does not belong to the critical circuit. This causes: rst, a warming up period before the system goes to a periodic steady-state which already violates (4.25), and second, a constant delay shift that varies from one transition to another based on how they are connected to the critical circuit. This is discussed in Section 4.6.4 and it is more complicated than being described using the simple theorem of (4.25). On the other hand, the works in [53] and [54] (see Section 4.7.3) have well understood the not-straightforward nature of asynchronous structures, especially 99 regarding their periodicity. Using their respective mathematical approaches, they have deciphered the periodicity problem and provided methods and algorithms for nding the timing parameters and describing the timing behavior of asynchronous structures while making the least amount of assumptions. Nevertheless, since they were analyzing arbitrary MGs, their contributions were generic and theoretical, and they did not delve into conditions that guaranteed more restrictive forms of periodicity. As electrical engineers, we focus in this thesis on providing sucient constraints for an asynchronous system that guarantee the property in (4.1) and hence the possibility to be used for timing synchronous circuits. Our mathemat- ical derivations leverage the intuition of the practical concepts which motivated the math in the rst place, and thus facilitating the connection between theory and practice. A more detailed insight regarding these papers is presented in the following section. In particular, the key contribution of this chapter is that we state sucient conditions that force the system's periodicity to strictly equal one as in (4.1), the warming-up period T up to equal zero, and more crucially, to strictly dene the skew value of every transition with respect to a certain reference. Instead of analyzing arbitrary networks and building algorithms to describe the behaviour of a given graph as in [53, 54], our work provides design constraints for asynchronous structures that can be used as CDNs for synchronous systems. In other words, our contribution is in graph design rather than graph analysis. 100 4.7.3 Insight From [53, 54] We hereby provide some insight from the works in [53, 54] that is worth mentioning since it is tightly related to the work in this chapter. The work of [53] They modeled concurrent systems using a process graph where the vertices repre- sent the events and the edges represent the rule templates (i.e. the events depen- dency and relations). They restricted their analysis to strongly connected graphs (note that live and safe MGs are strongly connected by denition [28]). In order to describe the events periodicity and repetition, they used an occurrence index (sim- ilar to (i) in this chapter) and they did unfolding of the process graph to obtain an innite directed graph that describe every iteration in an explicit manner. They used the concept of an execution model (see Section 4.6.1) to represent the consis- tent assignment of time values to event occurrences. The main contribution of the work in [53] was to develop an algorithm that can be used to nd the maximum time separation between two events in a nite portion of the unfolded process graph; That is, given that each event is bounded by a minimum delay value and a maximum one . 101 For the sake of this chapter's scope, we recite the following conclusion from the work of [53]: For a strongly connected process graph, there exist integers k I and " I such that for all kk I +, and any event (vertex) v in the graph m v (k+" I ) m v (k) =r" I (4.26) Where m v (k) is the longest time path (using the lower bounds ) between the kth occurrence of the event v and the th occurrence of another event u, and r is the maximum ratio cycle (the counterpart of MAX in this chapter). They dened k I as the number of unfoldings of the process graph backwards relative to the th occurrence of the event u, and " I as the occurrence period of this repetition. Those integers, k I and " I , are values specic to a particular process graph. In simpler words, this means that after a certain warming-up period or an uncertainty period, which happens also after a certain number of occurrences, the time separation of events (TSE) between two events shall reach a periodic steady- state. It is worth mentioning that for a MG-LTSC (see Theorem 3), our results prove that those integers, k I and " I , are equal to zero and one respectively. The work of [54] They modeled the systems using an event-rule (ER) paradigm where E is a set of events, and R is a set of rules that dene the dependency and the timing constraints. They dened a repetitive event-rule (RER) system where the events 102 do repeat, and thus they dened an occurrence indexi (similar to this thesis), and they dened an occurrence index oset which they associate with a rule. This is the counterpart of the number of markings (tokens) that separate two transitions in a MG. The main contribution of the work in [54] was to develop a set of algorithms and rules that can be used to compute the exact timing behavior of more general classes of systems that combine asynchronous and synchronous structures. For the sake of this thesis, we recite the following conclusion from the work of [54]: In any RER system (note that a MG-L is an RER system by denition), there exists an integer M and a bound k ? such that the system's periodicity can be described for any event e, for nk ? , as follows ^ t (he;n +Mi) ^ t (he;ni) =Mp ? (4.27) where ^ t (he;ni) is the exact time of the nth occurrence of the event e, p ? is the maximum cycle (the counterpart of MAX in this chapter), andM is the system's periodicity. This is similar to the conclusion reached by the work of [53] in the previous section. It is worth mentioning that for a MG-LTSC (see Theorem 3), our results prove that those integers, k ? and M, are equal to zero and one respectively. 103 Connection to ACDN As previously mentioned in Section 4.3, a periodic MG-L has a warm-up period before the system enters a periodic steady-state [53, 54]. This can be formalized in the existence of integers m ? and i ? that satisfy the following (i+m ? ) u (i) u =m ? MAX ;8ii ? ; 8t u 2T (4.28) where m ? is the occurrence period of repetition i.e. the system periodicity, and i ? is the occurrence warming-up bound before the periodic steady-state. This formalization is similar to the ones introduced by [53, 54]. It should be noted that both m ? and i ? are characteristic to a graph rather than to a specic transition. If we applied the formulations of Appendix B.2 in [96] or algorithm-1 in [54] to a MG-LT, both would lead to the same conclusion ofm ? = 1 which supports our conclusions in Theorem 3. In particular, if we apply this tot s under the conditions of Theorem 4, we can write (i+1) s (i) s = MAX ; 8ii ? (4.29) This means that (4.23) is applicable for any ii ? . 104 The authors of [54] have determined the value of the upper bound of i ? in the worst case scenario (see Theorem 3 in [54]). For a MG-LT and m ? = 1, their upper bound becomes i ? 4 + 3 MAX MAX 2 nd MAX jPj (4.30) where 2 nd MAX is the second largest circuit (see Lemma 3). In contrast, this work shows that for MG-LTSC systems, i ? = 0, and (0) n is well determined for any transition t n . Clarifying Examples: 1. The graph of Figure 4.3 is an example for a net with a system periodicity larger than one. In particular, m ? = 2 for that MG; That is, the transitions ring of the net in Figure 4.3 can be described following (4.28) as follows (i+2) u (i) u = 2 MAX ;8i 0; 8t u 2T (4.31) 2. Following the discussion in Section 4.6.4 about the example of Figure 4.4, we can see that MAX = 3, 2 nd MAX = 2, andjPj = 6. Therefore, we substitute into (4.30) to get the limit of i ? as follows. i ? 4 + 3 3 3 2 6 = 78 (4.32) 105 It is noticeable that this is a conservative upper bound since following the transitions rings we can deduce that i ? 2. 4.8 Discussion When we compare Theorem 3 and Corollary 3 to (4.1), we nd that every transition does indeed re periodically every T sys with no warming up phase (both T up and i up equal to zero). Regarding T skewn , it is well dened and equals the initial delay between the source and t n . This means that if the MG model of an asynchronous system is a MG-LTSC, it can be used as ACDN for a synchronous system. The four constraints of Section 4.5 can be interpreted as follows: 1. The absence of an always-enabled-transition [28] excludes systems with externally gen- erated clock sources. 2. A live and safe MG guarantees the ring periodicity. 3. Only one token per directed circuit guarantees a xed clock period, i.e. n =T sys and m ? = 1. 4. A single source that belongs to a critical circuit guarantees the a priori knowledge of T up and the clock skew T skewn which provides the needed tim- ing information for a robust CDN against setup and hold violations [38]. Those constraints can thus be considered as design guidelines for an asynchronous system to be used as a CDN for a synchronous system. Nevertheless, as shown in Section 4.6.4, if the fourth constraint is partially satised (only MG-LTS), then there exists a transition which shares a circuit with the source such that probing it enables nding the value ofT sys . This is one of the 106 foundations of the feasibility of our main contribution in Chapter 5. By probing the inputs of this transition, a programmable delay line inserted into one of the circuits with t s can be adjusted until the system becomes a MG-LTSC. And henceforth, the system will behave as an ACDN. An important point that is worth mentioning is our choice of using deter- ministic delay values for the sake of simplicity. Practically, and particularly in the case of uncertainties, all delay values would possess a stochastic value rather than a deterministic one because of variations, or design asymmetries, etc. In such a case, every delay value shall be presented by a range with a lower and an upper bound (similar to the model in [53]). Under such model, it will be possible to study time-varying clock skews and timing jitter. A nal point that needs to be discussed is that throughout this chapter, we emphasize that we are providing sucient conditions to satisfy (4.1). We hereby discuss the necessary conditions. First, we show an example that violates the constraints and still satisfy (4.1) to prove that our conditions are not necessary. In the example of Figure 4.6, the net violates both the T and the S constraints; That is, not every circuit has a single token and there is more than one source. However, as we can see from the table of the transition rings, all transitions can 107 Figure 4.6: An example for a MG that violates the sucient constraints but still satisfy (4.1) and (4.19). T sys = max C2C (C) M(C) = 8 2 = 4. be written following (4.1) or Theorem 3 considering t 1 as the source for t 1 , and t 2 as the source for the rest. We write the equations for8i 0 to elaborate (i) 1 = max h2 ^ H(t 1 ;t 1 ;0) (h) + 4i = 4i (i) 2 = max h2 ^ H(t 2 ;t 2 ;0) (h) + 4i = 4i (i) 3 = (t 2 p 3 t 3 ) + 4i = 2 + 4i (i) 4 = (t 2 p 3 t 3 p 4 t 4 ) + 4i = 4 + 4i (i) 5 = (t 2 p 3 t 3 p 4 t 4 p 6 t 5 ) + 4i = 5 + 4i (4.33) On the other hand, the example of Figure 4.3 violates the same two con- straints but causes oscillation in the cycle time as discussed in Section 4.5. That is why we believe that one of the necessary conditions is that every path h from 108 any transition to a source has to satisfy (h)T sys M(h) 0. This inequality is inspired from the lemma that the work in [90] used to prove their Theorem 3. It has to be mentioned that they proved this inequality for every circuit, but they incorrectly applied it to every path, which led to an ill-constrained theorem as we showed in Section 4.7.2. In particular, it is trivial to verify that this inequality is violated by the pathft 1 p 2 t 2 p 3 t 3 g in the MG depicted in Figure 4.3. The key intuition behind our result is that by forcing every directed circuit to have one token and by having a single source, then this inequality is indeed satised for all paths in the MG. In summary, we believe that the necessary conditions to satisfy Theorem 3 are too complex to be useful for designing ACDNs. However, by introducing more conservative conditions, which are sucient, we simplify the design requirements and make it practical for circuit designers to verify whether their asynchronous structures can indeed be used to provide the clocking signals for a synchronous system. 4.9 Conclusion This chapter provides the theoretical foundation for asynchronous clock distri- bution networks. Particularly, we show sucient constraints for a system with asynchronous structures in order to provide a strictly and well-determined peri- odic signals for a large number of clock sinks, which is necessary for the timing of 109 a fully synchronous system. We use timed Marked Graphs (MGs) to prove that for a live and safe MG with single token per directed circuit, if there exists a single source, and this source belongs to one of the critical directed circuits, then all tran- sitions (or clock sinks) will re periodically and with well-determined clock time occurrences. This allows the timing of all clock sinks in a synchronous manner in the absence of a clock source. Additionally, we prove that by only probing a single transition and adjusting a single programmable delay line, we are able to force the source to belong to a critical circuit, and thus forcing a well dened synchronous periodicity. This is particularly useful in systems with high levels of manufacturing uncertainty. 110 Chapter 5 Hierarchical Chains of Homogeneous Clover-Leaves Clocking ((HC) 2 LC) In this chapter, we present our main contribution which is a new robust clocking technique for SFQ circuits. 5.1 Introduction In Chapters 1, 2, and 3, we explained in details the motivation of this thesis and how the problem of clocking in SFQ is more challenging than CMOS because of 1) the gates-level deep pipelines which cause pipeline starving and results in more clock sinks than CMOS. 2) The common techniques of clocking high frequency CMOS chips are inapplicable to SFQ. 3) The high level of timing uncertainty. and 4) the absence of an established design tool ow. In Chapter 3, we suggested the use of an ACDN in order to provide the clocking of a CSR. For a 32-gate CSR, the proposed technique achieves up to 111 a 93% yield improvement at the same cycle time compared to zero-skew clock tree. Nevertheless, the HCLC work did not solve how to extend the proposed clocking structure to more generic and complex pipelines than a basic CSR loop. In particular, its cycle time was O(N gates ) which is highly impractical for large scale VLSI designs. This chapter extends HCLC and addresses its impracticality; We propose a robust and self-adaptive clocking technique for generic and complex pipelines, the cycle time of which is independent of the total number of gates. The proposed hier- archical chains of homogeneous clover-leaves clocking [97] inherits its robustness from: 1) the spatial correlation of various sources of variations [8, 9, 24{27] (see Section 2.1.3), and 2) the timing robustness of traditional counter- ow clocking [4] (see Section 2.3.1). We prove the timing properties of the clocking scheme using MGs [28] (see Section 4.2 and the fundamentals of ACDNs [52] (see Chapter 4). We designed a tool (see Chapter 7), and used it to implement (HC) 2 LC on the ISCAS'85 benchmark circuits [29]. Our simulations show that averaging over the benchmark circuits, at the same cycle time, 52.3% and 211.8% yield improvement over zero-skew trees at low and medium ranges of of gate delays, respectively, and with only an area overhead of 9.00%. Those results prove our robustness superiority over zero-skew clock trees, with yet a large room for improvement as discussed in Section 9.2. 112 The remainder of this chapter is organized as follows. Section 5.2 introduces the architecture of our proposed clocking technique. Then, the proposed technique is analyzed in Section 5.3. After that, the proposed optimizations are discussed in Section 5.4. Finally, Section 5.5 provides a discussion and Section 5.6 presents our conclusions. 5.2 Proposed Architecture We hereby introduce the Hierarchical Chains of Homogeneous Clover-Leaves Clock- ing; the (HC) 2 LC [98]. It is worth mentioning that it is inspired from the com- bination of our HCLC in [23] and a similar strategy to the RSFQ-AT technique of [19] where they associated a specic clock signal to each data signal to achieve a source-synchronous inter-gate communication without the use of an expensive dual-rail implementation (see Section 2.3.2). 5.2.1 Hierarchical Chains First, we introduce the architectural foundation, which is the Hierarchical Chains' Link (HCL), illustrated in Figure 5.1a. An HCL has a single input and a single 113 (a) The architecture (b) The MG model Figure 5.1: The Hierarchical Chain's Link (HCL). output. Assuming a periodic signal on CLK in with period T in , the output signal on CLK out shall have a period of T HCL = max (T in ; ov +T core ) (5.1) where ov is the HCL overhead delay (see (3.7)) andT core is the propagation delay of the black-boxed core. Note that since this core has one input and one output, it also can be an HCL. Using HCL as the building block of a hierarchical architecture is based on this property. Now we connect those links together to form a chain of HCLs as shown in Figure 5.2. Once again, note that a chain of HCLs can be considered as an HCL. Similarly, assuming an existing T in , the output period can be written as follows T chain = max T in ; (C + 1) ov ; max i T HCL i ; (5.2) 114 Figure 5.2: A chain of C HCLs. where C is the number of links in the chain, and max i is over the ith link in the chain. If we assume that we have a large number of cores with some delay, T core i . Then linking each core to the HCL circuitry of Figure 5.1a, and chaining them, building the hierarchy upwards using the HCL chain of Figure 5.2, we end up with a hierarchical chain that has a top HCL. Hence, with the top HCL'sCLK in having a period ofT in , the period of the signal at the input of everyjth bottom level core, T core j , shall be T core j = max T in ; (C max + 1) ov ; ov + max i core i ; (5.3) where C max is the maximum number of links in an HCL chain. 115 5.2.2 Bottom Level This section explains how the logic cells are clocked at the bottom level of the hierarchy. In a similar way to the hybrid clover discussed in Section 3.2, we intro- duce the homogeneous counterpart with only counter- ow as shown in Figure 5.3 to be the bottom level. The gates per clover are divided into L gates per leaf, where the clover hasN leaves. First,CLK in is fed to a splitters tree to obtain one input per leaf. Within one leaf, we use the SFQ counter- ow clocking [4, 10] (see Section 2.3.1) where a sequence of L splitters is used to clockL gates sequentially in an opposite direction to the data ow, hence the counter- ow. This naturally provides robustness to hold constraints at the cost of increased setup constraints. After that, we use a C-junctions tree to collect the leaves outputs producing a single CLK out signaling the clocking of all the gates. It is worth mentioning that the homogeneity of the ow (i.e. all leaves use counter- ow) is not expected to be perfect for an arbitrary pipeline, i.e., the data path cannot always be consistently matched to counter- ow clocking. For instance, a connection can exist across two distinct clovers or HCLs and the clocks associated with those connections would thus not follow the counter- ow pattern. However, there will exist a certain gate-to-clock-sink assignment that optimizes this homogeneity; this is discussed in Section 5.4.2. Based on the diagram of 116 Figure 5.3: A homogeneous clover with N leaves and L gates per leaf. Figure 5.3, T core , or the clover delay, clover , which is the delay from CLK in to CLK out can be written as clover = log 2 N ov +L max sp (5.4) where N is the number of leaves per clover, and L max is the maximum number of gates per leaf belonging to that clover. 5.2.3 Top Loop This section explains how the highest hierarchy HCL, which is the highest hierarchy chain of HCLs, or the top loop for brief, is managed. Figure 5.4a illustrates the 117 block diagram of the loop. (HC) 2 LC needs a single GO pulse to initiate the clocking system. This signal is coupled to the loop using a con uence buer (see Figure 4.1b). After the initial ring, when no more pulses are generated on theGO port, the buer shall function as a mere JTL. This results in an oscillating loop which explains the omission of a clock source. Notice that a primed C-junction and a splitter are used to couple the rest of the hierarchy similar to the HCL structure. The output signal CLK out could be used to probe the system from o-chip for either testing or communication, and it shall have a period of T top = max (T L1 ; T L2 ) = max ( ov + sp +T hrcl ; cf + 2 sp + ov + ctrl ) (5.5) whereT L1 ,T L2 , andT hrcl are respectively the periods of the loop L1, the loopL2, and the highest level HCL which is depicted as the hierarchy box in Figure 5.4a. The circuit design of the shaded structure with variable control delay is left for future work, in addition to the loop stability analysis in the case of allowing the programmability to either increase or decrease the delay, both of which are discussed in Section 9.2. Meanwhile, a behavioral model is used to verify the functionality of the proposed technique. It acts as a programmable delay line with delay ctrl . Its value depends on two inputs: its own output, and on CLK FB which is the CLK out of the highest level HCL. This is designed to ensure that the delay across the loop L1 is always longer than L2. In the resulting structure, 118 (a) The architecture. (b) The MG model. Figure 5.4: The top loop structure. this property would guarantee that the critical path of each clock path is emerged from the C-junction in the top loop of Figure 5.4a, this is discussed in details in Section 5.3.1 and Section 5.5. 119 5.3 Analysis This section analyses the theoretical claims made about the behavior of the (HC) 2 LC architecture, and discusses its cycle time, clock skew, and area overhead. 5.3.1 ACDN Theory This section connects the structure of the (HC) 2 LC to the ACDN timing theory in Chapter 4. In particular, we show that Theorems 3 and 4 (See (4.19) and (4.23)) can be applied to our proposed clocking technique. First, the (HC) 2 LC circuitry can be seen as a CDN which can be modeled as a timed MG as discussed in Section 4.2. Specically, the MG model of the HCL structure and the top loop are illustrated in Figure 5.1b and Figure 5.4b. Second, if we prove that (HC) 2 LC satises the constraints of those theorems, then it can be used as an ACDN and hence used for the timing of a synchronous SFQ chip. The constraints are indeed satised as we argue below. 1. Live and safe MG: Based on Theorem 6 in [28], which states that the number of tokens in a directed circuit does not change with ring, we can deduce that every HCL can be abstractly modeled as one place and one transition in series. Also, the bottom level (which does not contain any loops and thus has no tokens) can be abstractly modeled as one place and one transition, i.e. like a single gate with a single delay value. If we do this 120 recursively and climb up the hierarchy, the top loop model would become an alternating sequence of three places and three transitions that has a single token in it. It is intuitive that such a simple loop is a live and safe MG. 2. A single token per circuit: Similarly, every HCL has one single-token directed circuit, and since no nodes are allowed to be repeated in a circuit by denition, we can abstract the HCL model and climb up the hierarchy recursively to show that no directed circuits can exist with more or less than one token. 3. A single source: The bottom level has no initial tokens, and the HCL structure does not contain a source. Therefore, the source created by the GO signal (see Figure 4.1b) is the only source, and the top loop is the only directed circuit that contains a source. 4. Does the source belong to a critical circuit? Assuming a high level of uncertainty {which is the case in SFQ as discussed in Section 2.1.3{, the answer to this constraint can be either positive or negative. In case it is true and the source does indeed belong to a critical circuit, i.e. the top loop is the slowest loop, then Theorem 3 (see (4.19)) can be applied and the clocking technique works immediately upon start up. On the other hand, if the top loop is not critical, then Theorem 4 (see (4.23)) can be applied. In that case, let t n in the theorem be the primed C-junction in Figure 5.4a. 121 Thus, by probing its inputs, after i up rings, the ith pulse on the unprimed input will occur before the (i 1)th pulse on the primed input, which means that the top loop needs to get slower. Then, slowing down progressively, the inequality will ip at some point, and the top loop will become critical. From this point onward, the circuit will behave following (4.19), which means it will satisfy the synchronous property of (4.1) and it can henceforth be used as an ACDN. 5.3.2 Cycle Time And Clock Skew Based on the discussions in Section 2.2 and Section A.3, all hierarchical chains share the same cycle time,T sys . Substituting from (5.1){(5.5), T sys can be dened as T sys = max 8 > > > > > > > > > > > > > < > > > > > > > > > > > > > : log 2 N max + 1 ov +L max sp (C max + 1) ov (C top + 1) ov + sp cf + 2 sp + ov + ctrl (5.6) where C max is the largest chain length and C top is the length of the highest level HCL (see the shaded box in Figure. 5.4a). However, this is not enough to fully determine the timing of the chip and perform STA. The value of T skewn in (4.1), which is found to be the value 122 max h2 ^ H(ts;tn;0) (h) in (4.19) for the MG model, has to be well dened. In (HC) 2 LC, there exist only one zero-token path from t s to any transition t n . Since we dene the skew for all sinks from a point which acts as the clock source, then the clock skew value for each sink is identical to the insertion delay, T ins , by denition (see Section 2.2.3). We thereby dene that skew value, or the insertion delay, for any clock sink, i.e. gate n as two components T skewn =T insn = hrcl (n) + local (n) (5.7) where hrcl (n) is the hierarchical delay and it returns the delay of the pathh s!clv(n) from t s to the input of the clover (bottom level) that contains the gate n. This clover is the core of one of the lowest level HCLs (see Figure 5.1a). local (n), on the other hand, is the delay of the local distribution within the clover (see Figure 5.3), i.e. the delay from the clover's input to the output of the splitter feeding this clock sink n. First, we further decompose the hierarchical delay. For H hierarchy levels where the 0th level is the lowest level HCL and the top loop is the (H + 1)th level, and with R(h;n) returning the rank (or order) of the HCL's core at the hth level that belongs to the path h s!clv(n) , hrcl (n) = cf + sp + ctrl + ov H + 1 + H X h=0 R(h;n) ! (5.8) 123 Second, we dene the local delay within a clover. If N sp (n) is the number of splitters from the input of the clover containing the gate n to the input of the speicic leaf containing it, and L sp (n) is the number of splitters within that leaf leading to the gate n, then local (n) = sp (N sp (n) +L sp (n)) (5.9) Note that N sp (n) would equal log 2 N in case N is a power of 2. 5.3.3 Area Overhead As Section 8.1 shows, a circuit with (HC) 2 LC has a larger area than a circuit constructed using a zero-skew clock tree. The main reason of this area overhead is that (HC) 2 LC has a larger relative clock skew between sinks than zero-skew tree, which means that the network requires more timing xes that are needed to nalize the construction of the CDN. After calculating the insertion delay, T ins , at every gate, we check every logical connection for potential setup and hold violation. Note that the clock skew at the connection between two gates is the dierence inT ins of these gates. In case of setup violation, we cut that connection and insert locally- ow-clocked DFFs accordingly to distribute the large clock skew caused by the wide range of T ins . 124 For every data connection between a gate clocked at time i to one at j , the number of inserted ops equals N FF = max 0; i j + DATA +T setup T sys T sys FF T setup (5.10) where FF is the maximum C-Q delay of a op. This forces the network to be re-levelized, which introduces even a larger number of DFFs to ensure the path balancing [99] (see Section 7.2). Also, we check for potential hold violations, we insert JTLs [4] acting as hold buers to x them as follows. N jtl = max 0; T hold i + j DATA jtl (5.11) where N jtl is the number of JTLs, and jtl is the delay of one JTL. In Section 5.4, we show that by constructing the network targeting opti- mizing the insertion delay as proposed in Section 5.4.1, and by using a placement- aware and a logic-level-aware assignment algorithms as proposed in Section 5.4.2, we reduce the occurrences of logic connections with large clock skew, and thus reduce those overheads as the results show in Section 8.3. 125 5.4 Optimizing (HC) 2 LC In this section, we propose algorithms [100] that optimize the construction of the (HC) 2 LC network, and the assignment (or mapping) of the gates to the clock sinks constituting the CDN. As the results in Section 8.3 show, averaging over the ISCAS'85 benchmark circuits, the optimizations done in this section results in area reduction of 51.28%, and yield improvement of 166.74% compared to the unoptimized (HC) 2 LC in our work in [97, 98]. 5.4.1 Optimizing Insertion Delay As explained in [59], larger values ofT ins statistically worsens the yield. The range of T ins is dened as the dierence between the minimum and maximum T ins . A wide range of T ins strains the setup/hold equations, and not only causes the large area overheads as previously discussed, but also impacts the robustness against those violations. This is because of the insertion of additional circuits, and thus more sources of uncertainties. Both of those are quantied in Section 8.3. So in summary, both the absolute values and relative values (range ofT ins ) of insertion delays should be minimized to reduce the area overhead and improve yield. Our initial work in [98] has designed the network structure in the most intuitive way, which is to minimize the hierarchy depth, H. This resulted in a balanced structure as shown in Figure 5.5 that illustrates an abstract diagram of 126 the (HC) 2 LC network. The gure denotes T ins , at the entry point of an HCL or clover, in relative units of overhead, ov , and thus to go from an HCL to the next one in a chain, this adds one unit of skew. This way of construction resulted in large values of T ins as shown in Section 8.3. From the gure, the minimum T ins is three, and the maximum is nine. SinceC, the number of links per chain (see Section 5.2), is only a maximum value that should not be exceeded to satisfy performance requirements (see (5.6)). Some chains may have smaller number of links. This makes the (HC) 2 LC network structure a design problem, where designers are free to choose which chains should have C links or less. Note that both equations (5.6) and (5.7) are valid for any structure. In this section, we propose an algorithm to construct the structure of the hierarchy that minimizes the hierarchy part of T ins , hrcl in (5.7), which is the insertion delay up to the entry point of the clovers. Moreover, the network con- structed using the proposed algorithm bounds the range of hrcl to be strictly one unit (The optimum being either zero or one). This leads to a signicant reduction in area overhead, and to yield improvement (see Section 8.3). In particular, we suggest a counter-intuitive approach where we minimize hrcl , and thus obtain an unbalanced structure. Figure 5.6 shows the same network constructed using our proposed algorithm. Note that the minimumT ins is six, and 127 Figure 5.5: Abstract diagram of a 16-clovers (HC) 2 LC network constructed by mini- mizing the hierarchy depth, with four links per chain (C=4). Figure 5.6: Abstract diagram of a 16-clovers (HC) 2 LC network constructed by mini- mizing the insertion delay, with four links per chain (C=4). the maximum is seven. Also, note that the graph structure is unbalanced, and the hierarchy depth is thus seven. Our proposed algorithm is summarized in Algorithm 1. The parameters C, N, and L are the conguration of (HC) 2 LC (see Figure 5.2 and Figure 5.3); they are number of links per chain, number of leaves per clover, and number of gates per leaf, respectively. For a certain number of gates, nGates, and certain C, N, and 128 L values, the algorithm constructs a linked list of HCLs with topLink as the root, which represents the HCL with the hierarchy box in Figure 5.4a. The algorithm initializes a dictionary, availDict, with the available slots where the keys are the T ins values. An available slot is dened as an open connection into which a clover or an HCL can be inserted. Then, the algorithm greedily opens an either zero or one additional single slot per iteration, until the number of open slots equals the number of clovers that need to be connected. Given the nature of the design space, and due to the optimal substructure of the network [101], our greedy approach yields a structure that minimizes T ins at the entry point of nClovers. Moreover, our algorithm limits the maximum dierence in T ins between two clovers to exactly one unit. The time complexity is O(nClovers 2 ) with the constraint that C > 1, otherwise the algorithm would not terminate as the problem would not possess a valid solution for nClovers > 1, and it is an impractical design choice anyway. In order to validate the proposed algorithm, and to analyze how this unbal- anced network benets both the overhead and the yield, the experimental results in Section 8.3 show that averaging over the ISCAS'85 benchmark circuits, the maximumT ins is reduced by 32.3%, and theT ins range is reduced by 75.1%. More- over, on average, optimizing T ins results in area reduction of 48.44%, and yield improvement of 77.31% compared to (HC) 2 LC with unoptimized T ins . 129 Algorithm 1 Construct network while minimizing T ins Input: nGates, N, L, C Output: The root of the HCLs linked-list, topLink 1: gatesPerClv = N L 2: nClovers = nGates/gatesPerClv 3: topLink createLink() . ins=1 4: availDict[2] topLink 5: nSlots = 1 6: while nSlots < nClovers do 7: Tins = min (availDict) 8: thisLink availDict[Tins].pop() 9: thisLink.child createLink() . ins=Tins+1 10: availDict[Tins+1] thisLink.child 11: if # of links in thisLink.chain < C then 12: availDict[Tins+1] thisLink 13: nSlots++ 14: end if 15: done 5.4.2 Gate Assignment As discussed in Sections 5.2.2 and 5.3.3, one of the challenges of implementing (HC) 2 LC is to nd a method for assigning gates to clock sinks in order to max- imize the exploitation of the (HC) 2 LC promised advantages. Gate assignment can be divided into two questions or problems: A) The hierarchy assignment: gates-to-clovers, clovers-to-HCLs, and the hierarchy of the HCLs. B) Intra-clover assignment: gates-to-leaves, and the gates ordering within one leaf. 130 Such mapping problem, similar to many problems in VLSI CAD, is NP hard and too complicated to nd the absolute optimal answer [102, 103]. In this section, we propose a placement-aware algorithm to perform the hierarchy assignment. Given that placement algorithms target minimizing the total routing wire length [102, 104], gates which are in proximity are more likely to be sequentially adjacent [51], and thus (HC) 2 LC could benet more from its reliance on spatial correlation of sources of uncertainty, since this reduces the amount of inter-clover and inter- chains logical connections. It is worth mentioning that this relaxes the timing violations constraints, resulting in less area overhead needed in performing the timing xes (see Sections 5.3.3 and 7.2, and [97, 98]). Moreover, in this section, we propose a logic-level-aware intra-clovers assignment algorithm to order and assign the gates within the clover chosen by the rst algorithm. This helps (HC) 2 LC to benet from the intrinsic timing robustness of counter- ow clocking. Placement-Aware Partitioning The hierarchy assignment algorithm is summarized in Algorithm 2. The abstract outline of the algorithm is as follows: 1. Perform placement to obtain gates coordinates: line 1. 2. Use Algorithm 1 to build the linked list of HCLs, and thus we accordingly assign every HCL a number of gates descendants: lines 2-3. 131 3. Transform the gates into a complete graph where the weight of each edge connecting two gates is equal to the negative of the square distance between these two gates: line 4. 4. Do a top-bottom xed-size partitioning: lines 5-17. Note that in line 5, the root is everyone's parent. The top-bottom partitioning algorithm mainly uses the Kernighan-Lin (KL) partitioning algorithm [105]. First, we use the recursive coordinate bisec- tion algorithm (RCB) [106] to generate initial xed size partitions whose specic sizes are dened by the output of Algorithm 1. Then, we use multi-way KL passes [102, 103, 107] to obtain a better solution. For our complete graph, the time com- plexity of one two-way KL pass is O(nGates 3 ), but we use the diagonal swapping [102, 108] (in particular, we use the algorithm of gure 3 in [108]) to improve it to O(nGates 2 log (nGates)). Moreover, we use multi-threading while executing the recursive calls in line 13 to decrease the runtime. It is worth mentioning that KL does not yield an optimal solution, but it performs reasonably well in a very limited number of passes [102, 103]. Intra-clover Assignment The intra-clover assignment algorithm is summarized in Algorithm 3; its main objective is to favor the counter- ow connections whenever possible. For each clover, its gates are sorted by descending order of their logic level, and each leaf 132 Algorithm 2 Placement-aware hierarchy assignment Input: netlist, N, L, C Output: Adding gates to the HCLs linked-list 1: allGates placement(netlist) 2: topLink constructNtk(allGates,C,N,L) . Algorithm 1 3: Compute number of descendants for each link 4: Generate gates graph 5: topLink.descendantGates allGates 6: procedure doPartitioning(thisLink) 7: generate initial partitioning using RCB 8: multi-partition KL algorithm 9: for children of thisLink do 10: if thisChild is a clover then 11: link gates to their direct parent clover 12: else 13: doPartitioning(thisChild) 14: end if 15: done 16: end procedure 17: doPartitioning(topLink) is initialized as a stack, the bottom of which is the input of that leaf. This is done to increase the likelihood of a counter- ow clocking connections within each leaf. Figure 5.7 shows an illustrative example to visualize the three priorities of the algorithm. The rst priority is a counter- ow connection, and is used in the steps of Figure 5.7d and Figure 5.7f. However, it is not possible to always nd a 133 counter- ow connection for an arbitrary pipeline as discussed in Section 5.2.2. The second priority is used to avoid missing a chance of a counter- ow when possible. When the dierence in level in a leaf is already larger than one, and since the gates are sorted by their level, the gate at the stack top of that leaf has no chance of a counter- ow connection. Therefore, placing a gate there is better than placing it on top of another leaf where a future counter- ow connection might be available. This is illustrated in Figure 5.7e, where choosing leaf#1 for g7 allowed the counter- ow connection in leaf#0 in the following step of Figure 5.7f. The default choice is to choose the shortest leaf hoping for future long sequence of counter- ow gates. The time complexity of the algorithm is O(nGates). In order to validate the proposed algorithms, and to analyze how these two assignment algorithms benet both the overhead and the yield, the experimental results in Section 8.3 shows that averaging over the ISCAS'85 benchmark circuits, (HC) 2 LC optimized using Algorithm 2 and Algorithm 3 achieves 5.29% of area reduction, and 48.97% of yield improvement compared to arbitrarily assigned gates. 5.5 Discussion The main motivation behind our work is the low yield reported in many SFQ papers [8, 9, 13, 18, 19, 27, 40, 45, 79, 80, 109, 110]. Though some chip failures are due to fabrication and/or other issues, many failures were reported to be func- tional. We believe that due to the unanticipated levels of uncertainty and other 134 Algorithm 3 Intra-clover assignment Input: netlist, N, L, C Output: Gates added to the HCLs linked-list 1: for all clovers do 2: gates thisClover.children 3: gates.sort() . descending by logic level 4: Initialize a stack for each leaf; its bottom is its input. 5: while gates not empty do 6: thisGate gates.pop() . Highest level 7: isGateAssigned = False 8: . First Priority: 9: for non-full leaves AND not isGateAssigned do 10: if thisGate seq. adj. thisLeaf.top then 11: thisLeaf.push(thisGate) 12: isGateAssigned = True 13: break 14: end if 15: done 16: . Second Priority: 17: for non-full leaves AND not isGateAssigned do 18: if thisGate.lvl+1 < thisLeaf.top.lvl then 19: thisLeaf.push(thisGate) 20: isGateAssigned = True 21: break 22: end if 23: done 24: . Default Choice: 25: thisLeaf ndShortestLeafStack() 26: thisLeaf.push(thisGate) 27: done 28: done 135 (a) Remaining gates connections. (b) Initial state of leaves stacks. (c) g4 to leaf#0 by default. (d) g9 to leaf#0 by rst priority. (e) g7 to leaf#1 by second priority. (f) g3 to leaf#0 by rst priority. Figure 5.7: An illustrative example of Algorithm 3 in action. gX is the gate with ID: X, and its logic level is shown between brackets. eects, timing violations and clock distribution may be the root of many of those failures. Consequently, we hypothesize that using the CMOS conventional zero- skew clock trees is not the best t for clocking large-scale digital SFQ chips (Also see Section 2.3.2). Our proposed clocking scheme, the (HC) 2 LC, provides a higher level of tolerance to variations because of two key properties: 1. Self-adaptive: The top loop self-adapts to the worst-case delays of any lower-level loop. Moreover, the top loop increases its own intrinsic delay 136 accordingly so that the clock skew at every clock sink can be exactly dened (see Section 5.3.1), i.e., the relative time separation between: 1) the clock edge reaching an arbitrary gate; and 2) the reference net (CLK in in Fig- ure 5.4a) can be determined independent of which lower level loop has the worst-case delay. Together, this provides a stable timing reference that can be used to ensure setup and hold times are met across the entire circuit. This is in sharp contrast to synchronous clock-trees where xing a setup/hold delay on one portion of the clock path (using tunable delay lines as in [61, 111]) can cause setup/hold problems in other sections of the tree. 2. Spatial correlation of variations in a counter- ow scheme: Here we emphasize that the high resilience towards hold violations comes from the use of counter- ow clocking at the bottom level and in ranking all the HCLs and clovers. Additionally, we assert that in such system with mostly local timing constraints, the spatial correlation between neighboring clock and data circuitry makes the proposed architecture more robust against setup and hold violations than zero-skew clocking. This is because of the built-in correlation between the data- ow and clock- ow, and between the hold/setup buers/ ops and the clock skew path that would have potentially caused a violation. This concept was analyzed for CMOS logic in [81], where they used the data and clock paths together to improve the correlation among clock sinks to reduce the sensitivity of clock skew [58]. 137 In summary, given a large level of uncertainty, the main advantage of our approach is that the clock follows the circuit's variability and makes less assump- tions about the cells delay than zero-skew clock trees. Our clock path {in its own peculiar way{ self-adapts to the data path and forces the whole distribution network to follow. 5.6 Conclusion This chapter presents a new clocking strategy called hierarchical chains of homoge- neous clover-leaves that is a robust and self-adapting clocking technique for generic and complex SFQ gate-level pipelined circuits. The proposed technique inherits its robustness from the counter- ow clocking and the spatial correlation of the various sources of variations. The scalability of the scheme stems from the fact that the input clock signal of a block skips to the neighboring block without waiting for its own block to acknowledge the clocking of all its components. This skipping mechanism allows the construction of a hierarchy of clocked blocks that adapts to the slowest block but achieves a cycle time that is independent of the hierarchy depth, making the cycle time independent of the total number of clocked gates. We prove the timing properties of the clocking scheme using the fundamentals of asynchronous clock distribution networks. We perform a complete and fair comparison between zero-skew clock trees and (HC) 2 LC implemented on ISCAS'85 [29] benchmark circuits. The simulations 138 show that averaging over the benchmark circuits, at the same cycle time, and with only an area overhead of 9.00%, (HC) 2 LC achieves 52.3% and 211.8% yield improvement over zero-skew trees at low and medium ranges of of gate delays, respectively. 139 Chapter 6 Two-Phase Clocking (2PC) In this chapter we explain the previously proposed two-phase clocking (2PC) [10] and argue why it is a serious contender towards solving the clocking problem. Par- ticularly, such clocking represents another dimension where extreme performance overhead meets relaxed complexity and area overheads. In this chapter, we for- malize this approach, quantify the overheads associated with it, and theorize its performance limits. The remainder of this chapter is organized as follows. First, the 2PC archi- tecture is explained in Section 6.1. Then, Section 6.2 discusses the timing proper- ties and the limits of this technique. In the end, Section 6.3 is a discussion. 6.1 The Clocking Technique The use of a two-phase clock in VLSI is not a new idea [112, 113]. Nevertheless, the way it was proposed to be used for SFQ in [10] was innovative. Assuming we have two clock phases 1 and 2 as in Figure 6.1b, Figure 6.1a shows how a pipeline should be clocked using the 2PC. 140 (a) A pipeline clocked using 2PC. (b) Timing diagram of the 2PC. Figure 6.1: Two-phase clocking (2PC). Where T is the cycle time per phase and both phases have the same T . This can easily be implemented by using the same clock source to obtain the two phases [10]. is the time shift of 2 with respect to 1 and we assume that does not change from cycle to cycle except for clock jitter (see Section 2.2.1). s is the standard deviation of the clock skew value of any clock edge. The main benet of using such technique is that it is 100% violations free (see Section 2.2.2). The main drawback is that it is expected to be signicantly slower than zero-skew clock trees or (HC) 2 LC. Both of those points are discussed in the following section. 141 6.2 Analysis We hereby analyze the 2PC architecture to deduce its timing characteristics. 6.2.1 Timing Violations In Section 2.2.2, we discussed the setup and hold constraints which characterize the timing of any pipeline. In 2PC, we have two setup constraints. The rst is for the datapath from a red cell to a blue cell (see Figure 6.1) as follows CQ +T setup + 2 s (6.1) And the second is for the datapath from a blue cell to a red cell as follows (T) CQ +T setup + 2 s (6.2) where CQ is the maximum gate delay,T setup is the gate setup time assuming they are all equal for simplicity. Note that there is a factor of 2 next to s assuming the worst case in each scenario where the clock edges of both 1 and 2 are skewed in the directions that worsens the setup constraint. Combining the previous two inequalities, we get T 2 ( CQ +T setup + 2 s ) (6.3) 142 Regarding hold constraints, 2PC has two as well. The rst is for the data- path from a red cell to a blue cell as follows CQ +T hold + 2 s (6.4) And the second is (T) CQ +T hold + 2 s (6.5) Where CQ is the minimum gate delay (contamination delay [38]). Combining the previous two inequalities, we get T 2 ( CQ +T hold + 2 s ) (6.6) Ignoring the dierence between maximum gate delay and minimum delay and just assuming a nominal gate delay value, the previous inequalities can be combined to write the following 1 2 (T setup +T hold + 4 s ) (6.7) TT setup +T hold + 4 s (6.8) 143 Figure 6.2: Abstraction of the operating region of 2PC. 6.2.2 100% Timing Yield The work in [10] claimed that using 2PC as explained in Section 6.1 would result in a 100% timing yield, i.e. zero timing violations. This is ignoring the performance overhead. In other words, there exists a clock frequency at which the pipeline functions with zero setup and hold violations. Similar to the clock abstraction diagram in Figure 2.16, Figure 6.2 shows the relation between T and . T 0 is the minimum cycle time that respect setup constraint in the gate-level pipeline assuming zero skew, which is equal to CQ + T setup . From (6.3), T has to be larger than 2T 0 ignoring s for simplicity. This is the darker-red region underneath 2T 0 ; no operating point can be located there. 144 From (6.1), (6.4), and (6.7) the value of has to satisfy all of them. We denote the maximum amongT 0 , CQ +T hold , and 1 2 (T setup +T hold ) asskew 0 . Any operating point to the left of this vertical barrier shall result in a hold violation as shown. From (6.2), we know that T has to be larger than +T 0 , and hence the straight line with the unity slope. Any operating point underneath this line would result in a setup violation as shown. The key attribute of 2PC is the 100% timing yield. This results from the fact that should be adjusted to equalT=2. This is the shown red line with slope of two. Using 2PC, the operating point has to be located on this red line. Since this slope is higher than the setup constraint slope, for any T > 2T 0 , there will always exist a clock frequency at which the circuit would function free of timing violations. Given a high level of uncertainty, this seems like a perfect solution. As soon as the chip starts, one keeps increasing T and adjusting accordingly until one nds a working operating point. Nonetheless, the degradation of performance is its main drawback as discussed in the following section. 6.2.3 Performance Limits It is true that 2PC gives a 100% violations free operation as claimed by [10]. How- ever, we disagree with the work in [10] about their timing analysis and about their 145 claim that 2PC and synchronous clocking share the same fundamental minimum cycle time which is T min =T setup +T hold according to (2.7) and (6.8). The hypothetical limit of T setup +T hold can be theoretically applied to syn- chronous systems using cycle stealing by the omni-usage of a negative skew value (same as in SFQ concurrent ow clocking). However, in 2PC, if the clock skew is adjusted to that particular negative value from 1 to 2 for instance, it would be violated in the other direction from 2 to 1 . Hence the disagreement with [10]. Theoretical Limit According to (6.3) and as illustrated in Figure 6.2, we have the following constraint (ignoring s for simplicity) T 2 ( CQ +T setup ) (6.9) This is typically a signicantly larger bound than the claimed one by [10]. Com- paring to zero skew clocking {which is slower than concurrent ow clocking (See Section 2.3.1){ this limit is twice as slow as zero skew trees. Therefore, we conclude that the theoretical performance limit of 2PC is 100% speed overhead. 146 Practical Limit Practically, given a high level of uncertainty, which is the case in SFQ and the main motivation behind the use of a technique such as 2PC, the limit will be even higher than the theoretical one. Assuming each clock phase has its independent splitter tree with indepen- dent delay values whose standard deviation is SP . Assuming there exist one inter- connect between every two splitters with independent delay values whose standard deviation is INT . Assuming CQ encompasses the interconnect delay between gates. Then we can write the practical limit as follows T 2 CQ +T setup + 2 log 2 N gates 2 ( SP + INT ) (6.10) The external 2factor is because of the theoretical limit. Note that CQ + T setup isT 0 in Figure 6.2. The second term is multiplied by 2 because we assumed the two splitter trees are independent, so there exit one term per tree. The number of gates N gates is divided by two because each half of the circuit is clocked by a dierent phase. Thelog is taken to the base of 2 because the fan-out of the splitters is 2. 147 6.3 Discussion In this chapter, we discussed the previously proposed two-phase clocking (2PC) [10]. In this clocking architecture, the gates are clocked alternatively using two dierent clock phases. The key advantage of this technique is it is free of tim- ing violations. Using a single clock source, the second phase might be obtained by a programmable delay line delaying the rst phase. Post fabrication, we can increase the clock cycle time T , and adjust the delay of the delay line until we reach a functioning operating point. Despite this huge advantage over any other clocking technique in SFQ, there is a theoretical limit of 100% speed degradation. Practically, the speed is expected to be slower than this value. 148 Chapter 7 Tools Platform Quantifying (HC) 2 LC advantages over zero-skew clock trees is crucial to persuade the SCE community to re-consider their support for orthodox CMOS synchro- nization techniques. qHC2LC is our answer to this objective. First, we present an overview of the tool, then the CDN construction is described. After that, we present our HDL gate models for SFQ. Then, we explain the dynamic co-simulation part, and in the end we explain our proposed variation models and Monte Carlo (MC) platform. 7.1 Overview Figure 7.1 shows a high level overview of the qHC2LC ow. The designed tool takes an RTL netlist as an input. Then, we use and modify the Berkeley open source synthesis tool, ABC [114], to synthesize the netlist, and build the clock network using (HC) 2 LC or zero-skew tree. The output of this step needs processing to be translated into SystemVerilog (SV) format [115] and using our proposed SFQ cells library [116] (see Section 7.3 and Appendix B); this is the purpose of the SV translation intermediate step. After that, we generate automatic test benches 149 Figure 7.1: Overview of the qHC2LC ow. to perform dynamic co-simulations using Cadence NCsim. Finally, the report contains the following: The cycle time, timing verication reports, JJ count to quantify the area, and MC results (see Section 7.5). 7.2 CDN Construction Figure 7.2 shows the detailed description of the CDN construction ow. First, we use ABC [114] to synthesize the input netlist in bench format [29], in addition to the ABC functions proposed by [99, 117] to perform the path balancing needed for SFQ deep-pipelines, and to insert splitters to x the SFQ limited fan-out. Then, placement is done using the qPALACE tool proposed by [104, 118]. Our in-house python scripts process the placement information. In case of (HC) 2 LC, the algorithms 1 and 2 are executed, while in case of zero-skew clock trees, we process the CDN placed by qPALACE instead. After that, we construct the CDN using qHC2LC functions within ABC. Note that in case of (HC) 2 LC, algorithm 3 is executed during that step. Then, as detailed in Section 5.3.3, timing xes are added within ABC. The output of ABC is a pure combinational bench netlist with 150 Figure 7.2: qHC2LC: Constructing the CDN. a very limited set of cells, and some additional mapping les. Another set of in- house python scripts use those les to re-build the circuit graph, and use the SFQ SV gate models of Section 7.3 to write the complete netlist in SV needed for the dynamic simulation. 7.3 HDL Modeling As discussed in Chapters 1 and 2, the SCE community is yet to deliver a processor that lives up to SCE potential. One of the main challenges is the lack of complete and validated design tools ow [5, 6, 119]. One could argue that one of the reasons that semiconductor technology is at the level at which it resides is a result of a decades-long improved and established tools ow. This is the main motivation 151 behind the IARPA SuperTools project [120, 121], and thus the primary motivation of the work in this thesis. As part of the digital design ow, logic gates have to be modeled in a hardware description language (HDL). Circuit simulators, such as JSIM [122] or PSCAN2 [123], though they give a more exact simulation, when the circuit netlists become larger, the computation intractability makes their use impractical. And thus the need for behavioral and functional simulations on the logic gates level [38]. First, we discuss the motivation to design new HDL models for SCE, then we provide a survey over the previous work in HDL modeling in SCE, and after that we discuss our proposed models. The complete SV models are provided in Appendix B. 7.3.1 Motivation Why Not CMOS Models? The specic motivation of this section is that SFQ and AQFP gates are quite dierent in functionality and behavior from their CMOS counterparts. As previ- ously detailed in Chapter 2, CMOS is based on voltage-level logic whereas SFQ circuits are based on the transmission of uxons and AQFP circuits are based on the polarity of AC currents resulting from the coupling of inductors connected to a SQUID. 152 It is a challenge to model SFQ gates in an HDL as the inter-gates com- munication is pulse-based rather than level based, while Verilog standards and tools evolved over the past few decades aiming to serve level based logic. Also, the destructive read-out (DRO) nature of SFQ clocked gates (see Section 2.1.2) is not common to HDL models. Moreover, wire delays in SFQ, i.e. PTL and JTL delays, are relatively comparable to the gate delays [8], which makes modeling them a necessity rather than a mere accuracy objective. Furthermore, some cir- cuit parameters and environmental eects could change the gates behavior. For example, the bias network operation has eects on the gates delay [39], and also the current states of the drivers and the loads have eects on the gates behavior [50]. Additionally, some advanced SCE phenomena [4, 8] might be needed to be modeled for testing purposes to be able to predict the behavior of the circuit in such cases. For instance, disappearing pulses due to various faulty reasons, appearance of extra pulses, pulses accumulation, or jitter accumulation [26]. Regarding AQFP, from their unique operation discussed in Section 2.1.1, there is no wonder it is quite dicult to model such behavior using a standard HDL. First, it is a three-level logic, even if the current is modeled as voltage level. Second, the current values keep returning to an idle middle-level. Sampling that idle-level should result in an unknown behavior. Third, there is not a single asynchronous gate. Every gate requires the AC excitation current, including the fan-out splitters. Fourth, each gate has an input excitation AC current that propagates through the 153 gate to the next one. The model should be able to detect the polarity of such current, and delay it by the correct amount before handing it over to the next gate. Fifth, clock routing has some boundaries problem because AQFP circuits are limited by a certain number of gates in one row. Sixth, AQFP circuits use dierent number of phases based on how many AC excitation currents are provided o-chip, and how many directions of DC shifts is applied to change the polarity or relative phases of those signals. As a conclusion, gate models that are particularly tailored towards SFQ and AQFP have to be developed. Why Designing New Models? The previous eorts of the community which are detailed in Section 7.3.2, had issues describing the timing characteristics of the cells, and in detecting the timing violations in a clear manner. Also, most of them were too complicated to be gen- eralized or modularized. Moreover, the models in the literature are not compatible with SDF (standard delay format) back-annotation which is standard in CMOS tools ow [124]. In summary, most of the literature about HDL modeling in SCE is not nat- ural to SFQ or AQFP, and does not conform to the well established standards of design tools ow in CMOS; this is important because adapting the already exist- ing tools to SCE is the shortest path to achieve the SCE's objectives instead of 154 re-inventing the wheel. Both of those limitations are seemingly two contradictory points to fulll. Nevertheless, in this thesis, we present SystemVerilog (SV) logic gates models for both SFQ and AQFP circuits, and the interface circuitry between them. Our models inherits the strengths presented by the SystemVerilog stan- dards [115] to provide models which are both natural to SCE, and ts well within established semiconductors design tools. 7.3.2 Literature Review The rst attempt in modeling SCE in HDL was done for RSFQ gates in [125] where the gates were modeled structurally using typical CMOS gates. The inability of such structural models to accurately model the various SFQ eects, as discussed in Section 7.3.1, was the motivation for the works done after that. Regarding AQFP, much less research was done in HDL gates modeling. Meanwhile, for SFQ, the existing models could be classied into two main philosophies [98]: • The rst team believes that despite the major dierences between CMOS and SFQ [4, 10], the abstract behavioral timing description of the CMOS logic gates in a cell-based approach can be transferred to SFQ. This abstraction was used for the modeling and analysis of SFQ digital circuits and logic simulations in [4, 6, 8{10, 18, 23, 26, 125{129]. The main essence of that type of modeling is that the ballistic ux quanta transfer of energy between SFQ gates can be regarded abstractly as events occurrences. This model is 155 solely targeting modeling the causality and time separation of events of those energy transfer events. This is represented by characterizing every logic gate by a setup time, a hold time, and a propagation C-Q delay value which has a maximum and a contamination values [38] and they shall all be considered as random values due to various sources uncertainties. This thesis belongs to team one. • On the other hand, the second team [130{133] uses dierent timing models for the logic cells where they model each gate by an FSM and they use a circuit- level simulator -such as JSIM [122]- to extract the FSM representations and timing characteristics of the cells. In the remainder of this section, we discuss some of the models presented in literature. The work in [128, 134] designed Verilog models for SFQ circuits. The mod- els were simple and abstract, modeling SFQ events as tiny voltage pulses. However, they had diculties describing the timing characteristics and timing violations. Which rendered their timing models hard to understand and not straightforward to reproduce. Also, they are incompatible with standard CMOS ow and SDF back-annotation. The work in [135] designed Verilog models for AQFP circuits. Those models are too detailed and specic, which makes it dicult to generalize or modify. The SDF compatibility of those models is not complete, and does not include most of 156 the existing timing arcs. Also, the models have issues with interpreting the clock moment at which the gates evaluate, which is indeed quite dicult to model as discussed in Section 7.3.1. The work in [131, 132] designed Verilog models for SFQ circuits based on the FSM method. This resulted in models which are quite dierent from the way gates in CMOS are architectured. Every logic gate type has a distinguished description of the timing arcs. They behaviorally described their timing checks using their own denitions of violations. Their models does not have setup and hold time violations same as CMOS, but rather some forbidden timing regions for some particular arcs that diers from a gate to another. They use extraction on one gate level, rather than a netlist level, because using a circuit simulator on a large netlist is intractable. In their current form, their models are incompatible with SDF back-annotation and the standard CMOS ow. Nevertheless, we agree with their claim that this way of modeling is very accurate and exact; it is ne tuned and tailored to suit the gates they are modeling. However, we disagree that this is the way to go. Solely modeling worst cases of the all timing arcs and modeling C-Q delays, and hold and setup violations does cover all timing arcs. We see it as a over-tting high-order solution to a simpler problem. Similar to the work of [128, 134], the work in [136] designed Verilog mod- els for SFQ circuits using an abstract and simple method. However, they also 157 over complicated the description of the timing characteristics and their models are dicult to reproduce and to generalize. Analogous to the work of [131, 132], the work in [130] suggested using an FSM and a circuit simulator to extract the models of SFQ circuits in VHDL [137], and the work of [133] used an FSM and a circuit simulator to extract the models of AQFP circuits in Verilog. However, their models still explicitly describe the timing violations behaviorally, is not compatible with the standard ows, and their extraction is on one gate level. 7.3.3 Proposed Models This section presents our proposed SystemVerilog logic gates models for both SFQ and AQFP circuits, and the interface circuitry between them. Our models inherits the strengths presented by the SV standards [115] to provide models which are both natural to SCE, and ts well within established CMOS design tools. The advantages of our models can be summarized as follows: 1. They are compatible with SDF standards in annotating the timing character- istics of the gates, and use the built-in Verilog directives for checking timing violations. This makes them more friendly to use with existing CMOS tools. 2. The SV interfaces gives the designer a simple platform to debug a design more easily because it allows them to add debugging information such as a 158 counter, or other state holding statistics related to that logic connection, into the interface without aecting the behavior of the design in any way. 3. The use of SV interfaces enhances the modularity and simplies the design of circuits and components interfacing dierent circuit logic families, such as SFQ and AQFP. This also facilitates the models generality. 4. SV interfaces can encapsulate wire delays. In SCE, wire delays are relatively more signicant than in CMOS [8]. For example, PTL delays are linearly proportional to their length. One can simply annotate the interface with the PTL length post-layout, and thus model the wire delays and incorporate it into the main platform. This cannot be done in Verilog. 5. Our gate models are elegant and written in a simple and conceptual manner which makes it easier to add to the library by any designer, and renders them easier to reproduce for dierent logic families such as RQL [138] or newly suggested ones such as DSFQ [139]. 6. More circuit parameters and such as the eects of the bias distribution net- works could be easily added to the library. Moreover, probing the surround- ing environment could also be included. For example, adding the eects of the states of drivers and loads on the gate behavior. This could be done by adding functions to the SV interfaces; which is quite simpler than being done in Verilog. 159 7. Our models provide a functional modeling of the gates that follows the exact behavior of gates without an excess of abstraction. This low level of abstrac- tion increases the credibility of behavioral simulation results. 8. Using SV interfaces makes it easier to build modular reusable verication components and test benches, and facilitates their integration with dierent standardized verication methodology including UVM, which provides faster, higher functional coverage, portability, reusability, and scalability for SFQ and AQFP circuits verication [140]. 9. If the analysis of some advanced phenomena is required, those could be inte- grated into the dynamic simulation platform by modifying the corresponding SV interfaces. Some examples were discussed in Section 7.3.1, such as dis- appearing pulses, generation of extra pulses, pulses accumulation, or jitter accumulation. Appendix B has a detailed description of the models. A brief background about SystemVerilog is provided in Section B.1. Then, the SFQ models are pre- sented in Section B.2, along with our vision for SV and some prospective models. After that, the AQFP models are proposed in Section B.3, in addition to our solu- tion to the boundaries and corners cases. In the end, the models for the interface circuitry between SFQ and AQFP are proposed in Section B.4. 160 7.4 Dynamic Simulation Figure 7.3 shows the detailed description of the dynamic co-simulation ow. netlist.sv is the output netlist from the CDN construction step. netlist.v is the original RTL description which is usually found in Verilog format; this is required by the co-simulation to verify the functionality of the constructed SV netlist. The sfqTech.def is the denition le with the logic gates timing characteristics. In the experimental results of this thesis (see Chapter 8), we use the values from the library designed by the work in [132]. We use the gates library SV descriptions designed in Section 7.3 and Appendix B, and the SV interfaces that describe the SFQ signaling from the sfqLibrary.sv le [116]. We designed a generic test bench template, tbTemplate.sv, which needs to be customized and congured to generate a set of random input combinations, inputs.mem, and the actual test bench le, tb.sv. This test bench le is simulated using Cadence NCsim using the SV option. Note that the timing violations are checked using the built-in SV directives, and is used in the SV library of Section 7.3. The output is then analyzed in python to generate a nal report containing the results of functional verication, timing violations checks, cycle time, and JJ count. The MC.def le is explained in the next section. 161 Figure 7.3: qHC2LC: Dynamic co-simulation. 7.5 Variations Model As discussed in Chapter 5, (HC) 2 LC robustness claims need to be quantied by yield measurements. In this section, we explain how we incorporate a Monte Carlo analysis ow into qHC2LC. In order to emphasize the spatial correlation of timing uncertainties [98], our variations distribution are placement aware. Inspired by the established advanced on-chip variations (AOCV) [141], and following the spatial correlation of variations studies of [60, 142, 143], we use a grid-based model [143] where the hierarchy levels are statistically independent, identical to the model proposed by [60] as abstractly illustrated in Figure 7.4. We hereby explain the used variations model: the entire circuit is considered as one grid unit where we apply global variations. Then, level by level, we divide the grid into ner grids, 162 where an independent Gaussian random variable is added as a delay component to represent the on-chip timing uncertainties. And thus the normalized delay of the ith gate can be written for n grid levels as follows: i =N 0 ; ( 0 G ) 2 + N 1 ; ( 1 L ) 2 + + N n ; ( n L ) 2 (7.1) where G and L are the standard deviations of the global, and local (i.e. per level) uncertainties, respectively. l is a deviation scaling factor for the tile at the lth level, and it is a function of the number of gates in that tile. Higher levels have wider distribution, and thus the value of l decreases as we slide down the hierarchy. We start with 0 = 1 for the global variations. l is another hierarchy factor to scale the mean value. Naturally, higher levels contribute more to uncertainties, and as we slide down, the local uncertainties have smaller eects. To ensure the use of the normalized i as a scaling factor of the gates delays, the following constraint on l values has to be considered: P l l = 1. To incorporate MC analysis into qHC2LC, the dynamic simulation step needs to be done separately for each dierent sample of variations. To limit the changes to only the NCsim run, we use a scaling factor to each gate delay dened as a Verilog macro, and we include the macro denitions in a single le. This le, MC.def, is the only change to the ow of Figure 7.3. In case of no MC, all the scaling factors are assigned to one. In our yield measurements, an MC single run 163 Figure 7.4: An illustration of the grid based variations model [60]. The rst coordinate is the hierarchy level (top starts from 0), and the second is an ID within the level. is considered a pass only if there is a perfect functional match with the output of the co-sim behavioral netlist.v, and there are not any reported setup or hold violations. 164 Chapter 8 Experimental Results In order to evaluate the advantages of (HC) 2 LC, and prove the claims made in this manuscript, we use our designed qHC2LC tool described in Chapter 7 to do the following: 1) Monte Carlo analysis to compare the yield of (HC) 2 LC versus the conventional zero-skew clock trees. 2) We measure the JJ count of the circuits to evaluate the area overhead of (HC) 2 LC. 3) In order to quantify the benets of the optimization algorithms presented in Section 5.4, we compare T ins , JJ count, and the yield of (HC) 2 LC designed using the algorithms presented in this manuscript versus the unoptimized (HC) 2 LC. 8.1 Monte Carlo Analysis Using the MC analysis ow described in Section 7.5, we compare the yield of (HC) 2 LC to the zero-skew clock trees by building those CDNs for the ISCAS'85 benchmark circuits [29]. For each benchmark, we use a clock margin for the zero- skew tree, and another timing margin that we use in the equations during the timing xes. For the zero-skew tree, we assume a uniformly distributed clock source jitter with = 5:8% as in [15]. For all splitters, we assume a normal 165 distributed mismatch between each arm with = 3:3%, as concluded from the work of [50]. Our variations model assume global variations normally distributed with 3 G = 20% as in [20, 25]. Regarding local variations, as explained in Section 7.5, we compute the yield across dierent values of L , l , and l . Figure 8.1 shows the yield improvement of (HC) 2 LC over zero-skew trees at the same cycle time, T sys , for the benchmark circuits. The results are shown at dierent T sys , and at dierent amounts of variations applied. We show the points starting from when the (HC) 2 LC starts to have a yield of less than 100% until a value around 20% for the sake of meaningful comparisons. Table 8.1 highlights the yield improvement at two dierent ranges of variations. The results show that averaging over the benchmark circuits, at the same cycle time, (HC) 2 LC achieves 52.3% and 211.8% yield improvement over zero-skew trees at vars ranges of 0.04-0.06 and 0.09-011, respectively, and with only an area overhead of 9.00% as detailed in the next section. Note that the improvement is signicantly higher at lower yield values, i.e. at higher variations, because of the division over smaller values. In case of very small amount of variations, both designs have yields close to 100%. Those results validate our analytic and theoret- ical analysis in Chapter 5, and meet our predictions of robustness superiority over zero-skew clock trees. 166 (a) c432. (b) c499. (c) c880. (d) c1355. (e) c1908. (f) c2670. 167 (g) c3540. (h) c5315. (i) c6288. (j) c7552. Figure 8.1: The MC (1000 runs) results over the ISCAS'85 benchmark circuits. The y-axis is the yield improvement of (HC) 2 LC over zero-skew trees at the same cycle time, T sys . We show the results at dierent values of T sys . The x-axis is the amount of variations applied on the circuit, represented by the standard deviation of the gates delays, vars . We vary vars by varying L as previously mentioned. The values shown at each data point is the yield (%) of the (HC) 2 LC circuit at this point. The results were obtained using 1000 runs as further discussed in Section 8.2. 8.2 (HC) 2 LC Overhead Table 8.1 shows the following for each benchmark during the MC runs: The average number of JJs for the (HC) 2 LC version, JJ count. The average area overhead compared to the zero-skew clock tree, Area Ovh. The % yield improvement of 168 (HC) 2 LC over zero-skew clock tree (see Figure 8.1) at vars ranges of 0.04-0.06 and 0.09-011, Yield Impr. at vars range. Another point that is worth to be mentioned is the computation overhead of running those MC simulations. One thousand runs do not seem enough to get an accurate evaluation, however, as shown in the Run Time column in Table 8.1, those runs are very computationally expensive. We used a strong local server, in addition to Amazon web services (AWS) [144] to perform those simulations. In total, the pure simulation time is 21.6 days, which makes it impractical to run 10,000 runs for example. In other words, the values reported in this manuscript are not enough to be considered as accurate quantication of the robustness of (HC) 2 LC. However, the consistency of the results over all the benchmark circuits is enough to validate the claims made in this manuscript. It is crucial to mention, -as the results show- that the current form of (HC) 2 LC is still not as robust as desired. This thesis is a mere beginning to a new class of CDNs. Plenty of research is yet to be done in order to obtain a fully- practical and robust version of (HC) 2 LC; this is discussed in details in Section 9.2. The work in this thesis, and this section in particular, shows -with enough proof- that (HC) 2 LC is signicantly better than zero-skew clock trees. 169 Table 8.1: Some detailed results from the MC analysis of (HC) 2 LC across ISCAS'85 benchmark circuits. JJ count (JJs) Area Ovh. (%) Yield Impr. (%) 0:04< vars < 0:06 Yield Impr. (%) 0:09< vars < 0:11 Run Time (hours) c432 83,749 10.28 33.84 240.78 34.1 c499 40,787 7.37 42.33 86.41 11.2 c880 76,696 15.18 22.01 136.23 18.9 c1355 41,735 9.60 15.17 139.03 16.3 c1908 72,224 8.59 22.93 215.88 29.0 c2670 131,195 11.55 85.78 285.70 55.7 c3540 160,122 4.35 28.32 107.28 70.0 c5315 332,091 7.11 57.96 245.14 114.4 c6288 265,636 7.93 124.77 398.33 102.4 c7552 238,609 8.03 90.15 262.84 66.0 Avg. { 9.00 52.33 211.76 { Table 8.2: Comparison between the unoptimized (HC) 2 LC constructed as in [98], and the optimized construction following the work of Section 5.4; across ISCAS'85 benchmark circuits. Arbitrary min(H) [98] Arbitrary min(T ins ) Placement aware min(T ins ) JJ count (JJs) Yield (%) JJ count (JJs) Yield (%) JJ count (JJs) Yield (%) c432 141,576 14.6 91,331 23.0 83,556 52.0 c499 75,513 31.8 45,447 57.0 39,958 63.0 c880 144,849 11.8 79,239 28.6 72,990 48.2 c1355 79,971 33.2 42,887 45.0 40,250 63.0 c1908 199,818 30.2 90,882 35.4 78,065 50.4 c2670 221,179 21.8 147,617 41.4 148,866 73.0 c3540 445,941 60.5 189,717 78.4 183,367 89.3 c5315 643,167 39.3 385,787 66.8 380,143 79.7 c6288 2,216,327 28.1 300,527 56.1 298,616 80.3 c7552 501,646 21.3 272,121 53.7 273,283 80.3 170 Figure 8.2: The values of T ins measured in terms of ov for the ISCAS'85 benchmark circuits. prev (red colors) is (HC) 2 LC constructed without any optimizations, while opt (green colors) is (HC) 2 LC optimized as in Section 5.4. The experiment was done for two (HC) 2 LC congurations using C/N/L of 4/4/4 and 6/4/8. For every conguration and benchmark, the bars show the minimum, the average, and the maximum values from top to bottom. 8.3 Evaluating the Optimizations In order to quantify the benets of the algorithms proposed in Section 5.4 in particular, we compareT ins , the JJ count, and the yield of the optimized (HC) 2 LC to the unoptimized (HC) 2 LC as in [98]. Figure 8.2 shows the values ofT ins for the ISCAS'85 benchmark circuits with (HC) 2 LC using both network construction methods: minimum hierarchy depth as 171 in [98], and the proposed optimizedT ins in Section 5.4.1. The experiment was done for two (HC) 2 LC congurations using C/N/L (see Figure 5.2 and Figure 5.3) of 4/4/4 and 6/4/8. Averaging over all benchmarks and congurations, the minimum, average, and maximum T ins per design are equal to 5.3, 12.7, and 19.4 ov , respectively for the unoptimized (HC) 2 LC. Optimizing T ins results in 9.6, 11.4, and 13.1 ov on the other hand. On average, the maximum T ins is reduced by 32.3%, and the T ins range is reduced by 75.1%. Table 8.2 summarizes the JJ count and yield comparison between three designs: 1. Unoptimized (HC) 2 LC [98] constructed by minimizing the hierarchy depth, and using an arbitrary assignment; Arbitrary min(H) (see Section 5.4.1). 2. (HC) 2 LC constructed by optimizing T ins as proposed in Section 5.4.1, but with an arbitrary assignment as in [98]; Arbitrary min(T ins ). 3. (HC) 2 LC constructed by optimizing T ins and using the placement aware assignment and the algorithms presented in Section 5.4.2; Placement aware min(T ins ). We ran 1000 MC runs for each design at each benchmark, the rst ve at T sys = 80ps and the second at T sys = 110ps, and using vars 0:1. On average, optimizing T ins results in area reduction of 48.44%, and yield improvement of 77.31% over the work in [98]. Using the placement aware assignment results in area reduction of 5.29%, and yield improvement of 48.97% compared to only optimizing T ins . Overall, our pro- posed optimizations in Section 5.4 results in area reduction of 51.28%, and yield improvement of 166.74% in average when compared to the unoptimized (HC) 2 LC. 172 Chapter 9 Summary, Future Work, and Conclusions This chapter concludes the thesis. First, we provide a summary of the achieved contributions and the work discussed in the thesis. Second, we discuss the possible future work. Third, we give the author's conclusion about the thesis problem which is clocking in SFQ large scale circuits. 9.1 Summary SFQ technology has the potential to meet the booming demands for lower power consumption and higher operation speeds in the electronics industry and future exascale supercomputing systems. Nevertheless, the promised benets of three orders of magnitude lower power at an order of magnitude higher performance has yet to be attained. In particular, ultra-high-speed clocking of large scale SFQ circuits in the presence of unprecedented levels of timing uncertainty represents a tough obstacle for the technology to advance. We argue that the traditional zero- skew clock trees are not reliable for large scale SFQ designs. Instead, we propose an 173 innovative self-adaptive clocking technique which is designed to be resilient in such uncertain environments. This thesis introduces the architecture of the clocking technique, theorizes and proves the claimed timing properties and characteristics of the scheme, and builds a simulation tools ow to show its superiority over the conventional clock trees. We hereby recount a summary of the achieved contributions. • We propose the Hybrid Clover-Leaves Clocking (HCLC) which is an ACDN to be used for the clocking of a CSR. For a 32-gate CSR, the proposed technique achieves up to a 93% yield improvement at the same cycle time compared to zero-skew clock trees. Nevertheless, HCLC is ill-suited for more generic and complex pipelines than a basic CSR loop. In particular, its cycle time is O(N gates ) which is highly impractical for large scale VLSI designs. • This thesis's main contribution is the proposal of a robust and self-adaptive clocking technique for generic and complex pipelines, the cycle time of which is independent of the total number of gates. The proposed hierarchical chains of homogeneous clover-leaves clocking ((HC) 2 LC) inherits its robust- ness from: 1) the spatial correlation of various sources of variations [8, 9, 24{ 27], and 2) the timing robustness of traditional counter- ow clocking [4]. We theoretically prove the timing properties of (HC) 2 LC using the ACDN the- ory that we establish as well. Our simulations show that averaging over the 174 ISCAS'85 benchmark circuits, at the same speed, and with only an area over- head of 9.00%, (HC) 2 LC achieves 52.3% and 211.8% yield improvement over zero-skew trees at low and medium ranges of of gate delays, respectively. • We optimize (HC) 2 LC to achieve lower power and area overheads in addi- tion to higher variation robustness.We present an algorithm that optimizes the insertion delay of (HC) 2 LC. Averaging over the ISCAS'85 benchmark circuits, the maximum T ins is thus reduced by 32.3%, and the T ins range is reduced by 75.1%. Also, this results in area reduction of 48.44%, and yield improvement of 77.31% over the unoptimized version. Moreover, we present two algorithms: a placement-aware gates-to-clovers assignment algorithm, and a logic-level-aware gates-to-clock-sinks assignment algorithm. This opti- mized mapping exploits the spatial correlation of uncertainties and the intrin- sic robustness of counter- ow clocking, which results in 5.29% area reduction, and 48.97% yield improvement on average over the ISCAS'85 benchmark circuits, compared to gates assigned arbitrarily. In total, averaging over the benchmark circuits, our optimizations to (HC) 2 LC achieve 51.28% of area reduction, and 166.74% of yield improvement when compared to the unopti- mized (HC) 2 LC. • We present the theoretical foundation for asynchronous clock distribution networks (ACDN), where an asynchronous circuit is used to generate the 175 periodic signals necessary for the timing of a fully-synchronous system. This approach takes advantage of the natural adaptiveness of asynchronous cir- cuits to delay uncertainties while still providing the performance advantages of synchronous clocking. In particular, we state the sucient conditions for an asynchronous system to provide strictly and well-determined periodic signals for a large number of clock sinks. We model these systems using timed Marked Graphs (MGs) from Petri Net (PN) theory [28] and prove the following. In a live and safe MG with strictly a single token per directed circuit, if the graph has a single source that belongs to a critical directed circuit, then all transitions (or clock sinks) will re periodically and with well-determined clock time occurrences. This structure can thus provide the timing of all clock sinks required in a synchronous system in the absence of a clock source. Additionally, we prove that by only probing a single tran- sition and adjusting a single programmable delay line, we are able to force the source to belong to a critical circuit in the case of not-knowing its exact critical value, and thus forcing a well dened synchronous periodicity. This is particularly useful in systems with high levels of timing uncertainty. • We arm that the use of the previously proposed two-phase clocking (2PC) [10] shall be considered as a serious contender towards solving the clocking problem. Particularly, such clocking represents another dimension where extreme performance overhead meets relaxed complexity and area overheads. 176 In this thesis, we formalize this approach, quantify the overheads associated with it, and theorize its performance limits. • We present SystemVerilog models for SFQ and AQFP gates, and the SFQ/AQFP interface circuits. Our models are compatible with SDF (stan- dard delay format) and friendly with the existing tools, they oer an easy debugging platform, modular and qualied for generalization and reproduc- tion, elegant and conceptual, can encompass many circuit parameters, envi- ronmental eects, and advanced SCE phenomena. • We build a tools platform to build and evaluate the dierent SFQ clock- ing strategies, qHC2LC. In its current form, our qHC2LC takes an RTL circuit as the input, synthesizes it, constructs the clock network (zero-skew tree or (HC) 2 LC), generates a test bench, and performs a co-sim evaluation for design performance, functionality, timing violations, and area overheads. We run the qHC2LC successfully over ISCAS'85 combinational benchmark circuits [29]. • We propose a grid-based placement-aware variation model to model the spatial-correlation timing uncertainties. Also, we incorporate a placement- aware and spatial-correlation-variations-aware Monte Carlo (MC) analysis into qHC2LC in order to compare and evaluate yield. 177 9.2 Future Work We hereby suggest further points that need further investigation in future works. • Traditional designers do believe strongly in synchronous clocking. They face hard time accepting the possibility of the use of an asynchronous technique. Asynchronous design is indeed radically dierent from the conventional syn- chronous concepts, however, it always seems more natural in SFQ. Since the introduction of the SFQ in the late 1980's, researchers have admitted that asynchronous communication is natural to SFQ as it ts its basic conven- tion of bits representation. Nevertheless, they have failed to achieve those techniques without drastic overheads of 400% or 500% of area and power. Although we believe that (HC) 2 LC will be a valid contender for clocking options in SFQ, we have to admit that the yield advantages are not that sig- nicant. This thesis is a mere beginning of exploring a new realm of clocking techniques for SFQ. Modications to (HC) 2 LC or a new clocking scheme inspired by it shall be investigated for better robustness. We believe that researchers have not suciently studied the timing uncertainty in SFQ, and their designs do not consider that adequately, and they gave up on asyn- chronous techniques. We arm that investigating this perspective should always be in sight of SFQ research. 178 • (HC) 2 LC can be further optimized. This thesis did a reasonable attempt in that domain, but we believe there is still a signicant room of improve- ment. This is particularly concerning the assignment problem discussed in Section 5.4.2, which is assigning the gates within a clover, assigning gates to clovers, clovers to chain, chains to chains hierarchically. We are currently investigating modifying the qPALACE tool [104, 118] to do that. In partic- ular, we are investigating the use of the HL-tree version of qPALACE and use those L sections as our leaves. We expect this to achieve better results than the algorithms of Section 5.4.2. • Optimizing (HC) 2 LC assignment can also be performed from a dierent point of view, which is minimizing the number or weight of cut edges between parti- tions [145{147]. As discussed in Chapter 5, the inter-clovers and inter-chains logical connections are the ones straining the timing equations. Therefore, a style of optimizing (HC) 2 LC could be based on minimizing the occurrence of such connections or cuts. • Another idea of optimizing (HC) 2 LC assignment is more convoluted than the previous two points, which is basically the combination of them. It is well established that the placement problem is to minimize the total wiring length. Therefore, we think that if we combine the placement and the assignment into a single objective function, both minimizing the total wiring length and 179 to minimize inter-clovers and inter-chains connections, we shall obtain the best results. However, this seems almost as a full PhD problem. • In its current form, qHC2LC is an evaluation ow rather than an imple- mentation ow. In order to convince the SCE community that they should investigate new clocking techniques, we need a working chip with (HC) 2 LC. As a rst step, we should combine our evaluation ow to the SPORT lab implementation ow. We have already started investigating this, and we started discussing the steps needed to do that. In particular, we need to adapt our ow and combine it with the synthesis and retiming ow, qSYN [148, 149], the placement qPALACE [104, 118], the static timing analysis, qSTA [150], and the routing ow, qGDR [151]. • As discussed in Section 7.3.1, wire delays in SFQ, i.e. PTL and JTL delays, are relatively comparable to the gate delays. The simulations in this thesis do not take this into account. An important next step would be to update the qHC2LC ow to include wire delays. • Although the Monte Carlo analysis within qHC2LC does show the (HC) 2 LC superiority over zero-skew clock trees, it uses exhaustive dynamic simula- tions, and is not enough to accurately quantify the yields improvement for comparing dierent algorithms in the future. Towards that end, we think of using a statistical static timing analysis (qSSTA) tool to quantify the yield 180 improvements. We have already started investigating a collaboration with the authors of [150] in order to perform those evaluations. • Our (HC) 2 LC relies on the existence of a programmable delay line in a control loop. One of the implementation future steps should be: (1) The circuit design of an ecient programmable delay line. And (2) the stability analysis of the top loop when we allow the delay to decrease as well as increase. In this thesis, we used an only-increasing behaviorally described delay line. • Additionally, the use of variable custom assignments of the hierarchical parameters to further reduce the overheads and/or improve the yield should be investigated to optimize (HC) 2 LC. Particularly, the use of longer chains at the low hierarchy level versus shorter ones at the high levels would result in less resources wasting when we take the length of interconnects into con- sideration. • SFQ implementation ows should be upgraded to include the two-phase clock trees. Thereby, we will be able to compare the yield among all available dierent clocking solutions. As we argued throughout this thesis, we expect our (HC) 2 LC to be superior in performance with a reasonable yield when compared to two-phase clocking. 181 • The qHC2LC ow and the SPORT ow need to be upgraded to implement sequential structured Verilog netlists and thus be able to run the ow over ISCAS'89 sequential benchmark circuits [152]. • Upgrading the ows to integrate behavioral Verilog netlists using a dierent and more advanced synthesis tool than ABC [114] is an advised step as well. Thus, the ow would be able to take more advanced and sophisticated RTL input netlists. • In papers whose authors have electromagnetics background, they discussed the inductive return path as an important factor that has to be taken into consideration for high speed signals. We believe that this was never studied with enough focus for clock trees. A return-path-aware clock-tree design is a worthy point to look at. • In this thesis, we dismissed the use of a grid or a mesh because of the phys- ical incompatibility of the application of a potential surface to a quantum based technology. However, we did not discuss the possibility of inventing a dierent approach that aims for the same goals of a grid network, but adapted to a quantum based technology. This topic was never studied and since it is fundamental in high speed clock trees in CMOS which are at a few Gigahertz, we expect it to be crucial if it was successfully implemented to a few tens of Gigahertz clock network in SFQ. 182 • As discussed in Section 2.3, several works studied the use of a GALS approach where there exist several dierent clock domains on the same chip. Embed- ding such an idea with our proposed work and studying the circuitry needed for inter-domain clock crossing would be an addition to the work in this thesis. Moreover, an idea we had is a hybrid approach where two (HC) 2 LC structures are implemented along side each other's in a two-phase fashion. Since (HC) 2 LC is robust due to the spatial correlation of uncertainties, the clock skew constraint would be relaxed, and thus the performance overhead of the 2PC (which depends on a practical limit of clock skew as discussed in Section 6.2) could be presumably reduced. • AQFP circuits (see Section 2.1.1), oer an ultra-low power operation. How- ever, this technology faces a serious challenge when it comes to arbitrary and complex pipelines. In particular, successful designs were very regular, and future work is required in order to apply this technology to non-regular and more generic pipelines. The feasibility of a timing violation-free AQFP circuit, as in 2PC, was never studied. Also, adapting a technique such as (HC) 2 LC to AQFP was never studied, obviously. 183 9.3 Conclusions One of the biggest obstacles to SFQ advancement is the design of a high speed clock distribution network in a very uncertain environment. With gate-level pipelines, a signicantly higher number of clock sinks, and an order of magnitude higher frequency target, CDN design in SFQ is a bigger problem than it is in conven- tional CMOS. Moreover, due to the peculiar characteristics of this superconducting technology, we do not believe that conventional zero-skew balanced trees are the answer. Instead, we propose the hierarchical chains of homogeneous clover-leaves clocking as a robust clocking technique that inherits its robustness from: 1) the spatial correlation of various sources of variations, and 2) the timing robustness of traditional counter- ow clocking. Our simulations show that averaging over the ISCAS'85 benchmark circuits, at the same speed, and with only an area overhead of 9.00%, (HC) 2 LC achieves 52.3% and 211.8% yield improvement over zero-skew trees at low and medium ranges of of gate delays, respectively. This thesis introduces an innovative perspective on the SFQ clocking prob- lem, and encourages the SCE community to seek the reliability of a concrete and realistic solution rather than false promises of the simple approach that is yet to meet its expectations after decades of attempts. 184 Appendix A ACDN proofs In this appendix, we provide the mathematical derivations that proves the theo- retical foundation for ACDNs modeled as MGs (see Chapter 4). In particular, we present the key lemmas and theorems with their proofs. A.1 Execution Model In this section, we provide the proof of Theorem 1. We start by applying the ring rule in order to formalize the description of any transition ring. ä Denition 23: Execution Model: In a PN, 8t v 2T;8i 0 (i) v = max tu2T max h2 ^ H 1 (tu;tv;i) (iM(h)) u + u (A.1) As in [53], we dene the execution of the PN as the consistent assignment of time values to the transitions rings. Using our previous denitions, equation (A.1) is a direct translation of equation (1) in [53], which is a direct application to the ring rule of PNs [28]. This formalization follows the same conceptual understanding of 185 the process graph unfolding [53], and the unrolling of repetitive event-rule systems [54]. Theorem 1 In a PN, 8t v 2T;8i 0 (i) v = max tu2T max h2 ^ H(tu;tv;i) (iM(h)) u + (h) (A.2) Proof: From the execution model denition in (4.10), we use induction to extend this denition from the paths within a path length of one to the set of paths ^ H with unbounded length. (a) The basis step is the denition of (4.10) itself, for a maximum length l = 1. (b) Let us assume the following is true, for a maximum length L, 8t v 2 T;8i 0 (i) v = max tu2T max h2 ^ H L (tu;tv;i) (iM(h)) u + (h) (A.3) (c) Let the arg max of those maximum sets, i.e. the transition at the begin- ning of the critical path, and the critical path itself, bet x andh x respec- tively. Applying the basis step to the transition t x , we get 8i 0 (iM(hx)) x = max tw2T max h2 ^ H 1 (tw;tx;iM(hx)) (iM(hx)M(h)) w + w (A.4) 186 (d) Combining both (A.3) and (A.4), since ^ H 1 (t w ;t x ;i M(h x )) [ ^ H L (t x ;t v ;i) = ^ H L+1 (t w ;t v ;i) by denition, we directly get 8t v 2 T;8i 0 (i) v = max tw2T max h2 ^ H L+1 (tw;tv;i) (iM(h)) w + (h) (A.5) Q.E.D A.2 Cycle Time In this section, we provide the proofs for the key lemmas needed for Theorem 2 and the proof for the theorem itself. ä Lemma 4: In a MG-LT, ^ H(t u ;t u ; 1)C; 8t u 2T (A.6) | That is, given a MG-LT, any loop that contains one token only has to be a directed circuit. Proof: • Since it is a live and safe MG, then there is no loop with zero tokens and ^ H(t u ;t u ; 0) =;. Therefore, ^ H(t u ;t u ; 1) is the set of all single-token loops that start and end with t u . 187 • Intuitively, a loop has to be formed by the repetition and/or concatenation of directed circuits. And given that every directed circuit has exactly one token, and that the loops in question are single-token, then all those loops have to be directed circuits. Q.E.D ä Lemma 5: In a MG-LT, 8t u ;t v 2T if9h u!v 2H(t u ;t v ) ^ M(h u!v ) =m> 1; then9t w 2T s.t.9h w!v 2H(t w ;t v ) ^ M(h w!v ) =m 1 ^h w!v @h u!v (A.7) | That is, given a MG-LT, if there exists a path that contains more than one token between any two transitions t u and t v , then there exists a path from a transition t w tot v that containsm 1 tokens such that path is a part of the path from t u to t v . Proof: • In a MG-LT, it is trivial to deduce m n 1;8m n 2M 0 . • Given the condition in (A.7), going in order through the elements constituting the path h u!v , let us focus on the rst place p w whose m w > 0, its value has to be exactly 1. Since this is a MG, then there exists only one transition t w such that p w t w 2F. By denition, since M(h u!v ) =m and p w ; t w 2h u!v , then the path h w!v is a part of h u!v and M(h w!v ) =m 1. 188 Q.E.D ä Lemma 6: In a MG-LT, 8t u 2T , ifC(t u )\ arg max C k 2C (C k ) M(C k ) 6=;; then 8i 1; max h2 ^ H(tu;tu;i) (iM(h)) u + (h) = (i1) u + MAX (A.8) | That is, given a MG-LT, for any transitiont u that belongs to a critical circuit, the maximum dependency of the ith occurrence of t u on a previous transition - through a loop- cannot be higher than the dependency on the (i 1)th occurrence along the path that is one of the critical circuits to which t u belong. Proof: We prove this Lemma by contradiction. • First, let us assume that the arg max of the left hand side of (A.8) cannot be a path with a single token. If h x with M(h x ) = m x > 1 does belong to that arg max, then our assumption can be written as (i1) u + max h2 ^ H(tu;tu;1) (h)< (imx) u + (h x ) (A.9) • From Lemma 4, every h2 ^ H(t u ;t u ; 1) is a directed circuit. Additionally, from the conditions in (A.8), we can then write max h2 ^ H(tu;tu;1) (h) = max Cu2C(tu) (C u ) = MAX (A.10) 189 and (A.9) becomes (i1) u + MAX < (imx) u + (h x ) (A.11) • From the ring rule, for the i th ring of t u , along the path of going through the critical circuit m x times, we can write (i) u (imx) u +m x MAX (A.12) Combining the previous two inequalities, we get (i1) u + MAX < (i) u m x MAX + (h x ) (A.13) • The path h x is a loop by denition, and as stated in the proof of Lemma 4, a loop has to be formed by the repetition and/or concatenation of directed circuits. Also, since M(h x ) =m x , the loop h x has to be formed of m x circuits, and since MAX in an upper bound for circuits delay, we can write (h x )m x MAX (A.14) 190 Combining the previous two inequalities, we get (i1) u + MAX < (i) u m x MAX +m x MAX (A.15) • This leads to (i) u > (i1) u + MAX (A.16) From the understanding of Theorem 1 and Lemma 4, and since MAX is at least as long as any other circuit, the previous inequality cannot be correct for any single-token loop leading to a contradiction. Therefore, a path with a single token has to belong indeed to the arg max of the left hand side of (A.8). Then, from (A.9) and (A.10), we obtain the right hand side of (A.8). Q.E.D Theorem 2 In a MG-LT, 8t u 2T , ifC(t u )\ arg max C k 2C (C k ) M(C k ) 6=;; then (i) u = (im) u +m MAX ; 8im 1 (A.17) Proof: 191 (a) For any t u that takes part in a critical circuit as formalized in (A.17), substituting from Theorem 1, we get 8i 0 (i) u = max tv2T max h2 ^ H(tv;tu;i) (iM(h)) v + (h) (A.18) (b) We prove that the arg max of the outer maximum in (A.18) is t u by contradiction. (c) First, let us assume that it is indeed t u . • Substituting from Lemma 6, we can thus write (i) u = max h2 ^ H(tu;tu;i) (iM(h)) u + (h) = (i1) u + MAX ; 8i 1 (A.19) • Applying (A.19) recursively, we can directly deduce that for an arbitrary integer m, (i) u = (im) u +m MAX ; 8im 1 (A.20) And thus we prove (A.17). (d) Now we prove that the arg max of the outer maximum in (A.18) is t u by contradiction. 192 • For t u not to be the arg max of the outer maximum set in (A.18), this would mean that9t x 2T , t x 6= t u , and it is an arg max of the outer maximum. Substituting from (A.18) and (A.19), this means that (iM(hx!u)) x + (h x!u )> (i1) u + MAX (A.21) where h x!u 2 ^ H(t x ;t u ;i) and it is one of the arg max of the inner maxi- mum in (A.18). • This M(h x!u ) can have any arbitrary value m. However, if we apply Lemma 5 recursively, we nd that there exists a transition t y such that h y!u @h x!u and that M(h y!u ) = 1. • Therefore, for the inequality (A.21) to be true, the following is necessary (i1) y + (h y!u )> (i1) u + MAX (A.22) • From the ring rule across the path h y!u , we can directly write (i1) u (i2) y + (h y!u ) (A.23) • Combining (A.22) and (A.23), we get (i1) y > (i2) y + MAX (A.24) 193 From the understanding of Theorem 1 and Lemma 4, and since MAX is at least as long as any other circuit, the previous inequality cannot be correct for any single-token loop leading to a contradiction. Q.E.D ä Corollary 4: In a MG-LT, 8t s 2S, ifC(t s )\ arg max C k 2C (C k ) M(C k ) 6=;; then (i) s =i MAX ; 8i 0 (A.25) Proof: We prove this Corollary by induction on the i th ring. (a) In Section 4.4, we chose to assign (0) s = 0 for any t s 2S. This is the induction basis step. (b) We assume the following is true, for the k th ring of t s (k) s =k MAX (A.26) (c) For the sources which take part of a critical circuit as conditioned in (A.25), we can apply Theorem 2 with m = 1 to the k + 1 th ring and we get (k+1) s = (k) u + MAX ; 8i 1 (A.27) 194 (d) Substituting from (A.26), we get (k+1) s =k MAX + MAX = (k + 1) MAX ; 8i 1 (A.28) Q.E.D ä Corollary 1: In a MG-LT, 8t s 2S, ifC(t s )\ arg max C k 2C (C k ) M(C k ) 6=;; then (i) s = MAX ; 8i 0 and thus ^ T n =T sys = MAX ; 8t n 2T (A.29) Proof: It comes directly without proof from Corollary 4 and Lemma 2. A.3 Firing Period In this section, we provide the proofs for key lemmas and Theorem 3. ä Lemma 7: In a MG-LS, ^ H(t s ;t u ; 0) 1; 8t u 2T (A.30) Proof: It comes directly without proof from the denition of the root of a process graph in [53]. It is intuitive that every transition ring has to be spawned initially 195 from a source. Therefore, in the case of a single source, there has to exist at least one token-free path from t s to every transition. ä Lemma 8: In a MG-LTSC, 8t n 2T;8i 0, max h2 ^ H(ts;tn;i) (iM(h)) s + (h) =iT sys + max h2 ^ H(ts;tn;0) ((h)) (A.31) | That is, given a MG-LTSC, the delay of any critical path -with a maximum of ith tokens from the source t s to any transition t n - has to equal i times T sys in addition to the maximum token-less path delay between t s and t n . Proof: We prove this Lemma by contradiction. • Let us assume that (A.31) is not correct such that M(h) = 0 is not an arg max of the maximum on the left hand side. Basically, we are assuming that there exists a path h m whose M(h m ) = m > 0 that is longer than the path h 0 with M(h 0 ) = 0 where (h 0 ) = max h2 ^ H(ts;tn;0) ((h)). Note that h 0 has to exist according to Lemma 7. Formally, our assumption is (im) s + (h m )> (i) s + (h 0 ) (A.32) • Substituting from Corollary 4 and Corollary 1, we get (h m )mT sys > (h 0 ) (A.33) 196 • Since there are no circuits with more than one token, and that a MG-L is strongly connected by denition [28], then every single-token-section that belongs to the pathh m has to be either in a distinct circuit or this section of the path is actually a whole circuit. • Every section/whole of those circuits has to be at most as long as their circuits. The upper bound being the value T sys . • Therefore, the value mT sys has to be larger than the summation of all those single-token-sections in the path. In the extreme case of all these sections being whole circuits, the left hand side of (A.33) would only equal the token-free part. • The right hand side of (A.33) is the maximum of such token-free paths by de- nition, which leads to a contradiction. • Thus, one of the arg max of the maximum in (A.31) has to be a path with M(h) = 0, this results in max h2 ^ H(ts;tn;k) (kM(h)) s + (h) = (k) s + max h2 ^ H(ts;tn;0) ((h)) (A.34) • Then substituting from Corollary 4 we obtain (A.31). Q.E.D 197 Theorem 3 In a MG-LTSC, 8t n 2T;8i 0, (i) n = max h2 ^ H(ts;tn;0) (h) +iT sys (A.35) Proof: We prove this Theorem by induction on the i th ring. (a) At i = 0, we apply Theorem 1 to get (0) n = max tu2T max h2 ^ H(tu;tn;0) (0) u + (h) (A.36) (b) Since the only initially enabled transitions are sources by denition, the rst ring oft n has to be spawned initially from a source. From Lemma 7, such path exists and we can write (0) n = max h2 ^ H(ts;tn;0) (0) s + (h) (A.37) But (0) s = 0 by denition, and hence we get the induction basis step as follows (0) n = max h2 ^ H(ts;tn;0) (h) (A.38) 198 (c) Now let us assume the following is true at the k th ring. (k) n = max h2 ^ H(ts;tn;0) (h) +kT sys (A.39) (d) At the k + 1 th ring, we use Theorem 1 to write (k+1) n = max tu2T max h2 ^ H(tu;tn;k+1) (k+1M(h)) u + (h) (A.40) (e) Similar to the proof of Theorem 2, we prove that the arg max of the outer maximum is t s by contradiction. (f) First, let us assume that it is indeed t s . Then we can write (k+1) n = max h2 ^ H(ts;tn;k+1) (k+1M(h)) s + (h) (A.41) Then, substituting from Lemma 8, we get (k+1) n = max h2 ^ H(ts;tn;0) (h) + (k + 1)T sys (A.42) Which would complete the induction on the i th ring. (g) Now we prove the assumption by contradiction. 199 • Fort s not to be the arg max of the maximum in (A.40), this would mean that9t x 2T; t x 6=t s , and it is an arg max of the maximum such that (k+1M(hx)) x + (h x )> max h2 ^ H(ts;tn;k+1) (k+1M(h)) s + (h) (A.43) whereh x 2H(t x ;t n ;k + 1) and it is one of the arg max of the maximum. • Substituting from Lemma 8, the inequality becomes (k+1M(hx)) x + (h x )> max h2 ^ H(ts;tn;0) (h) + (k + 1)T sys (A.44) And let 0 denote the value of max h2 ^ H(ts;tn;0) (h) for simplicity. • Substituting from Corollary 4, the inequality becomes (k+1M(hx)) x + (h x )> 0 +T sys + (k) s (A.45) • From the induction hypothesis in (A.39), at the k th ring, the path initiated from the (k) s is at least as long as any other path including the path h x . Therefore, also substituting from Corollary 4, the induction hypothesis implies that (k) s + 0 (kM(hx)) x + (h x ) (A.46) 200 • Combining the two previous inequalities, we get (k+1M(hx)) x > (kM(hx)) x +T sys (A.47) • From the understanding of Theorem 1 and Lemma 4, and since T sys is at least as long as any other circuit, the previous inequality cannot be correct for any single-token loop leading to a contradiction. Q.E.D A.4 Uncertainty Condition In this section, we provide the proofs for a key lemma and Theorem 4. ä Lemma 9: In a MG-LTS, 8t u 2T , ifC(t u )\ arg max C k 2C (C k ) M(C k ) 6=;; then (i) u = max h2 ^ H(ts;tu;0) (h) +i MAX ;8i 0 (A.48) Proof: We prove this Lemma by induction on the i th ring. 201 (a) Since the only initially enabled transitions are sources by denition, the rst ring of t u has to be spawned initially from a source. From Lemma 7, such path exists and we can write (0) u = max h2 ^ H(ts;tu;0) (0) s + (h) (A.49) But (0) s = 0 by denition, and thus we get the induction basis step. (b) Now let us assume the following is true at the k th ring for an arbitrary integer k, (k) u = max h2 ^ H(ts;tu;0) (h) +k MAX (A.50) (c) Applying Theorem 2 at the (k + 1) th ring and with m = 1, we get (k+1) u = (k) u + MAX (A.51) (d) Substituting from (A.50) into (A.51), we get (k+1) u = max h2 ^ H(ts;tu;0) (h) + (k + 1) MAX (A.52) Q.E.D 202 Theorem 4 In a MG-LTS, ifC(t s )\ arg max C k 2C (C k ) M(C k ) =;; then9i ? ;9t n 2C n 2C(t s ); and9t u 2T; t u 6=t s ; such that (i ? ) s + max h2 ^ H(ts;tn;0) ((h))< (i ? M(hu!n)) u + (h u!n ) (A.53) Proof: (a) Applying Corollary 3 to t s , we know that ^ T s = MAX (A.54) (b) Applying Theorem 1 to the ring of t s ,8i 0, we get (i) s = max tu2T max h2 ^ H(tu;ts;i) (iM(h)) u + (h) (A.55) We hereby prove by contradiction that the arg max of the outer maxi- mum of (A.55) cannot be t s for every i. • We rst assume that the arg max is indeed t s for 8i 0. Therefore, (i) s = max h2 ^ H(ts;ts;i) (iM(h)) s + (h) (A.56) 203 • Leth x , withM(h x ) =m x , belong to the arg max of the above equation. Therefore, we can write that (i) s = (imx) s + (h x ); 8i 0 (A.57) • Thish x is a loop by denition, and as stated in the proof of Lemma 4, a loop has to be formed by the repetition and/or concatenation of directed circuits. Also, since M(h x ) = m x , the loop h x has to be formed of m x circuits. Since MAX is an upper bound for circuits delay, we can write that (h x )<m x MAX (A.58) Note that the value (h x ) is strictly less than the right hand side because h x contains at least one directed circuit with t s in it, and from the con- ditions in (A.53), t s does not take part of any critical circuit. • From the denition of the cycle time in (4.8) and from (A.54), we can write ^ T s = lim i!1 (i) s i = MAX (A.59) • The equations (A.57), (A.58), and (A.59) cannot be all correct for every i leading to a contradiction. 204 (c) Therefore, a value of i has to exist where the arg max of the outer maximum of (A.55) cannot be t s . And hence9i ? and9t u 2T , t u 6=t s , and9h u!s 2 ^ H(t u ;t s ;i ? ) such that max h2 ^ H(ts;ts;i ? ) (i ? M(h)) s + (h) < (i ? M(hu!s)) u + (h u!s ) (A.60) (d) Going through the elements constituting the path h u!s , a transitiont n has to exist that shares a circuit with t s (formally t n 2 C n 2C(t s )) because the path ends at t s . (e) Since the pathh u!s is more critical to (i) s than (iM(hx)) s as we proved, it has to be more critical to every transition's ring in it than the paths from (i) s . Otherwise, the other path would become the most critical to t s , which we already proved its contradiction. Therefore, the part h u!n @h u!s is more critical to (i) n than any path from (i) s . This can be formalized as follows (h u!n ) + (i ? M(hu!n)) u > max h2 ^ H(ts;tn;i ? ) (i ? M(h)) s + (h) (A.61) (f) Applying Lemma 7 to the inequality above, we can directly deduce (A.53). Q.E.D 205 Appendix B SystemVerilog Models This appendix provides a complete description of our proposed SV models intro- duced in Section 7.3. First, a brief background about SystemVerilog is provided. Second, the SFQ models are presented, along with our vision for SV and some prospective models. Third, the AQFP models are proposed, in addition to our solution to the boundaries and corners cases. Fourth, we propose the models for the interface circuitry between SFQ and AQFP. B.1 SystemVerilog SV is an extension to Verilog, and thus it is a superset of Verilog [115]. It is considered as the evolution of Verilog, and it has many added new features and some extensions to already existing features [153]. One of the main extensions is the introduction of interfaces which are basi- cally the feature that allows bundling of ports. Interfaces are instantiated in the design, can have parameters, functions, and tasks. Those interfaces can be used as the interconnections between modules, and hence can be used to encapsulate all the communication between dierent blocks within a single interface. 206 Another essential extension is the data type logic, which is an enhanced version of the data type reg in Verilog. First, it is explicitly clear that logic is distinguished from the hardware register. Second, driving the logic data type is not limited to behavioral RTL codes like reg, but it can be driven by a gate or a module. SV also introduces a variety of object-oriented-programming-like data types in order to improve run time and memory utilization of HDL simulators. It also includes semaphores, events extensions, module classes, assertions, and others. Moreover, SV has the ability to use enumerate data type, and extended the classical fork-join to include join none and join any. B.2 SFQ This section presents our proposed models for SFQ circuits. As discussed in Sec- tion 7.3.1, modeling SFQ in HDL is challenging due to the pulse-based commu- nications, the DRO nature of gates, wire delays, bias networks eect, and the surrounding environment. As discussed in Section 7.3.1, our SFQ models describe the SFQ pulses as voltage pulses that convey the energy transfer event. The models are based on creating an SV interface for SFQ signals. This interface has tasks for sending and receiving uxons, and equipped with DRO. For timing characteristics, we use 207 specify blocks in order to be compatible with SDF back-annotation. Timing checks are performed using the built-in Verilog directives. First, the interface is shown. Then, some prospective models are proposed under certain modications we suggest for the Verilog Standards as our vision. After that, we show our models of some auxiliary modules that allows our current models to work around this limitation and satisfy the claims we make in this manuscript about the advantages of the proposed models. B.2.1 Proposed SFQ Interface As shown in SV. 1, The parameter pw represents the width of the SFQ pulse and should be assigned a tiny value. The value pw does not have any eect as long as it is smaller than the smallest delay value. The tasks send and receive are executed by the module that is assigned to the transmitter (tx) direction, and the one assigned to the receiver (rx) direction, respectively. The isReceived function is the one responsible of the DRO operation, and the isStored function is used for the SFQ/AQFP interface (see Section B.4). SV. 1: SFQ: The interface interface SFQ; parameter real pw=pw; //sfq pulse width logic data=0; logic sent=0; modport tx (output sent, data, import send); modport rx (input data, output sent, import receive, isReceived, isStored); task send (); begin 208 data <= 1 b1; sent <= 1 b1; data <= #pw 1 b0; end endtask task receive (); @(posedge data); endtask function isReceived; input dump; isReceived = sent; sent <= 1 b0; endfunction function isStored; input dump; isStored = sent; endfunction endinterface : SFQ B.2.2 Our Vision In this subsection, we show our prospective models. As an example, SV. 2 and SV. 3 show the AND and the splitter gates, respectively. The gates always awaits for the inputs, and sends the outputs accordingly. Timing checks are done explicitly using the Verilog timing directives within a specify block to be overwritten during SDF back-annotation. The gate delay should be the worst case propagation delays; same for the setup and hold times. SV. 2: SFQ: The prospective 2-input AND gate module and2 (SFQ clkin, in1, in2, out); always begin clkin.receive(); if (in1.isReceived(1 b1) & in2.isReceived(1 b1)) out.send(); end 209 specify specparam delay = tgate, hold = thold, setup = tsetup; (clkin.data => out.data) = delay; $setuphold(posedge clkin.data, posedge in1.data, setup, hold); $setuphold(posedge clkin.data, posedge in2.data, setup, hold); endspecify endmodule //and2 SV. 3: SFQ: The prospective splitter gate module split (SFQ in, out1, out2); always begin in.receive(); fork out1.send(); out2.send(); join end specify specparam delay1=tsplit, delay2=tsplit; (in.data => out1.data) = delay1; (in.data => out2.data) = delay2; endspecify endmodule //split Unfortunately, Verilog specify blocks do not support hierarchical references or modports. This would have simplied our models signicantly, and our vision is to add such support to the Verilog and SystemVerilog standards. B.2.3 Our Models Due to this limitation, specify blocks can only accept typical ports denitions, and thus we propose the models in this subsection using a few helper modules in order to work around this limitation. SV. 4 and SV. 5 show the AND and splitter gates 210 as an example, along with an example of instantiating the interfaces and the gates in SV. 6. The helper modules are shown in SV. 7, SV. 8, and SV. 9. SV. 4: SFQ: 2-input AND gate module SFQand2 (SFQ clkin, in1, in2, out); SFQpropagationDelay gPD (); always begin clkin.receive(); if (in1.isReceived(1 b1) & in2.isReceived(1 b1)) begin gPD.tPD(); out.send(); end end SFQtimingcheck2 TC (clkin.data, in1.data, in2.data); endmodule //SFQand2 SV. 5: SFQ: Splitter gate module SFQsplit (SFQ in, out1, out2); SFQpropagationDelay gPD1 (); SFQpropagationDelay gPD2 (); always begin in.receive(); fork begin gPD1.tPD(); out1.send(); end begin gPD2.tPD(); out2.send(); end join end endmodule //SFQsplit SV. 6: SFQ: An instantiation example SFQ clk(), sigIn1(), sigIn2(), sigMid(), sigOut1(), sigOut2(); SFQand2 g0 (clk.rx, sigIn1.rx, sigIn2.rx, sigMid.tx); SFQsplit g1 (sigMid.rx, sigOut1.tx, sigOut2.tx); 211 SV. 7: SFQ: Helper module: SFQpropagationDelay module SFQpropagationDelay; logic in, out, state; SFQgateDelay g0 (out,in); initial begin state = 1 b0; in = 1 b0; end task tPD (); in = !state; wait (out==in); endtask endmodule //SFQpropagationDelay SV. 8: SFQ: Helper module: SFQgateDelay module SFQgateDelay (out,in); input in; output out; buf g0 (out, in); specify specparam delay = tgate; (in => out) = delay; endspecify endmodule //SFQgateDelay SV. 9: SFQ: Helper module: 2-input timing check module SFQtimingcheck2 (clk,data1, data2); input clk, data1, data2; specify specparam hold = thold, setup = tsetup; $setuphold(posedge clk, posedge data1, setup, hold); $setuphold(posedge clk, posedge data2, setup, hold); endspecify endmodule //SFQtimingcheck2 The module SFQgateDelay is required to contain the specify blocks with conventional ports. Then, we use the module SFQpropagationDelay to instantiate SFQgateDelay, and use its task to translate the delay from the specify block into our behavioral description of the gates as in SV. 4 and SV. 5. Regarding the 212 SFQtimingcheck2, it is used to do the timing checks and detect setup and hold violations. The number in the end of the module name is the number of inputs; other avors exist for dierent number of inputs. From an SDF compatibility point of view, SFQgateDelay required to enable both supporting pin-to-pin (i.e., IOPATH, input-to-output path delays across a cell) and interconnect delay modeling style (i.e., INTERCONNECT, interconnect delays to be specied on a point-to-point source/load basis [124]). This is done by leveraging the Verilog timing directives within a specify block to be overwritten during SDF back-annotation. In order to model the interconnect delay, the value of INTERCONNECT delay is added to the IOPATH delay of the gate that is assigned to the transmitter direction. We use an in-house Python script to translate the SDF le generated by the static timing analysis (STA) tool into a format compatible with the proposed gate modeling. B.3 AQFP This section presents our proposed models for AQFP circuits. As discussed in Section 7.3.1, modeling AQFP in HDL is challenging because every gate requires an AC clocking signal and the polarity of this signal triggers the gates' outputs. Also, the representation of bits uses a three-level convention of current values. Moreover, the excitation current propagates through each gate, in addition to the boundaries issue with routing those signals. 213 As discussed in Section 7.3.2, our AQFP models represents the worst cases of every delay event, and denes the prohibited timing region in the same fashion of setup and hold times. We model the current values as voltage values, but within a four-valued enumerated data type we dene as logicAQFP (see SV. 10) for the four states at which a gate could exist: logic `1' (q1 ), logic `0' (q0 ), idle (we call it qZ following the AQFP literature [133, 135]), and unknown (qX ). We use the dirAQFP data type to model the polarity of the AC and DC excitation currents used for inter-gates routing, and the phaseAQFP data type to represent the phase of operation of each gate [33, 135]; this adds a debugging feature that is missing from the models of [133, 135]. SV. 10: AQFP: Enumerated data types typedef enum logic[1:0] {q0 = 2 b11, qZ =2 b00, qX=2 b10, q1=2 b01} logicAQFP; typedef enum {noDir, inToOut, outToIn} dirAQFP; typedef enum {noPhase, phase1, phase2, phase3, phase4} phaseAQFP; B.3.1 Proposed AQFP Interfaces Our models have two AQFP interfaces: one for the data path, ioAQFP, (see SV. 11) and the other for the clock currents, clkAQFP, (see SV. 12). The ioAQFP has the tasks needed for the data propagation, while clkAQFP has the states of the clock path that are used to model the polarity of the excitation currents. The 214 xio and dcio have the states of the AC and DC components of the excitation current,respectively. SV. 11: AQFP: Data path interface interface ioAQFP; logicAQFP data = qZ; modport tx (output data, import send); modport rx (input data, import sample); task send; input logicAQFP val; if ( (val == q0) (val == q1) ) data <= val; else data <= qX; endtask task resetSend; data <= qZ; endtask task sample; output logicAQFP val; val = data; endtask endinterface : ioAQFP SV. 12: AQFP: Clock path interface interface clkAQFP; logic xio = 1 b0; logic dcio = 1 b0; endinterface : clkAQFP B.3.2 Our Models Same as discussed in Section B.2.3, our models requires a few helper modules in order to work around specify blocks limitation. SV. 13, SV. 14, and SV. 15 show a buer, a majority, and a splitter gates as examples of our gates, along with an example of instantiating the interfaces and the gates in SV. 16. Note that 215 the majority gate is an essential building block for AQFP logic [36]. The helper modules are shown in SV. 17, SV. 18, SV. 19, SV. 20, and SV. 21. SV. 13: AQFP: Buer gate module AQFPbuf1 (interface clkin, clkout, in, out); logicAQFP val; AQFPclockPhase clk (clkin, clkout); AQFPsendModule snd (out); always begin clk.waitForSamplingPt(); in.sample(val); snd.send(val); end AQFPtimingcheck1 TC0 (clk.localClk, in.data); endmodule //AQFPbuf1 SV. 14: AQFP: Majority gate module AQFPmaj (interface clkin, clkout, a, b, c, out); logicAQFP vala, valb, valc, valout; AQFPclockPhase clk (clkin, clkout); AQFPsendModule snd (out); always begin clk.waitForSamplingPt(); fork a.sample(vala); b.sample(valb); c.sample(valc); join valout = ( (vala == qZ) || (valb == qZ) || (valc == qZ) || (vala == qX) || (valb == qX) || (valc == qX))? qZ: ( ((vala == q1) && (valb == q1)) || ((vala == q1) && (valc == q1)) || ((valb == q1) && (valc == q1)) )? q1: q0; snd.send(valout); end AQFPtimingcheck1 TCa (clk.localClk, a.data); AQFPtimingcheck1 TCb (clk.localClk, b.data); AQFPtimingcheck1 TCc (clk.localClk, c.data); endmodule //AQFPmaj 216 SV. 15: AQFP: Splitter gate module AQFPsplitter (interface clkin, clkout, in, out1, out2); logicAQFP val; AQFPclockPhase clk (clkin, clkout); AQFPsendModule snd1 (out1); AQFPsendModule snd2 (out2); always begin clk.waitForSamplingPt(); in.sample(val); fork snd1.send(val); snd2.send(val); join end AQFPtimingcheck1 TC0 (clk.localClk, in.data); endmodule //AQFPsplitter SV. 16: AQFP: An instantiation example ioAQFP sigIn1(), sigIn2(), sigIn3(), sigMid1(), sigMid2(), sigOut1(), sigOut2(); clkAQFP clkIn(), clkMid1(), clkMid2(), clkOut(); AQFPmaj g0 (clkIn, clkMid1, sigIn1.rx, sigIn2.rx, sigIn3.rx, sigMid1.tx); AQFPbuf1 g1 (clkMid1, clkMid2, sigMid1.rx, sigMid2.tx); AQFPsplit g2 (clkMid2, clkOut, sigMid2.rx, sigOut1.tx, sigOut2.tx); SV. 17: AQFP: Helper module: AQFPmoduleDelay module AQFPmoduleDelay (out,in); input in; output out; buf g0 (out, in); specify specparam delay = tDelay; (in => out) = delay; endspecify endmodule //AQFPmoduleDelay 217 SV. 18: AQFP: Helper module: AQFPsendModule module AQFPsendModule (interface out); logic inDATA, outDATA, inPW, outPW; AQFPmoduleDelay gDATA (outDATA,inDATA); AQFPmoduleDelay gPW (outPW, inPW); initial begin inDATA = 1 b0; inPW = 1 b0; end task send; input logicAQFP val; inDATA = !inDATA; wait (outDATA==inDATA); out.send(val); inPW = !inPW; wait (outPW==inPW); out.resetSend(); endtask endmodule //AQFPsendModule SV. 19: AQFP: Helper module: AQFPclockPhase module AQFPclockPhase (interface clkin, clkout); AQFPclockDelay gPD (); dirAQFP xDir = noDir; dirAQFP dcDir = noDir; phaseAQFP phase = noPhase; logic localClk = 1 b0; logic error = 1 b0; always @(clkin.dcio, clkout.dcio) if ( (clkin.dcio == 1 b1) && (clkout.dcio == 1 b0) ) begin //posedge dcin dcDir = inToOut; clkout.dcio = 1 b1; end else if ( (clkin.dcio == 1 b0) && (clkout.dcio == 1 b1) ) begin //posedge dcout dcDir = outToIn; clkin.dcio = 1 b1; end else //negedge dcin or negedge dcout error = 1 b1; always @(clkin.xio) if (xDir==noDir) begin xDir = inToOut; 218 gPD.tPD(); clkout.xio <= clkin.xio; end else if ( (xDir==inToOut) && (clkout.xio != clkin.xio) ) begin gPD.tPD(); clkout.xio <= clkin.xio; end else if ( (xDir==outToIn) && (clkin.xio == clkout.xio) ) ; else error = 1 b1; always @(clkout.xio) if (xDir==noDir) begin xDir = outToIn; gPD.tPD(); clkin.xio <= clkout.xio; end else if ( (xDir==outToIn) && (clkin.xio != clkout.xio) ) begin gPD.tPD(); clkin.xio <= clkout.xio; end else if ( (xDir==inToOut) && (clkin.xio == clkout.xio) ) ; else error = 1 b1; always @(dcDir, xDir) if ( (dcDir!=noDir) && (xDir!=noDir) ) if ( (xDir==inToOut) && (dcDir==inToOut) ) phase = phase1; else if ( (xDir==outToIn) && (dcDir==inToOut) ) phase = phase3; else if ( (xDir==inToOut) && (dcDir==outToIn) ) phase = phase2; else if ( (xDir==outToIn) && (dcDir==outToIn) ) phase = phase4; task waitForSamplingPt; begin wait(phase!=noPhase); case (phase) phase1: @(posedge clkin.xio); phase2: @(negedge clkin.xio); phase3: @(negedge clkout.xio); phase4: @(posedge clkout.xio); endcase localClk <= 1 b1; 219 localClk <= #localClkPW 1 b0; end endtask always @(posedge error) begin $display ("\n(Error)~ illegal transition in module %m at time=%g",$realtime); $finish; end endmodule //AQFPclockPhase SV. 20: AQFP: Helper module: AQFPclockDelay module AQFPclockDelay; logic in, out; AQFPmoduleDelay g0 (out,in); initial begin in = 1 b0; end task tPD; in = !in; wait (out==in); endtask endmodule //AQFPclockDelay SV. 21: AQFP: Helper module: Timing check module AQFPtimingcheck1 (clk,data); input clk, data; specify specparam hold = thold, setup = tsetup; $setuphold(posedge clk, posedge data[0], setup, hold); $setuphold(posedge clk, posedge data[1], setup, hold); endspecify endmodule //timincheck Note that the majority gates needs to rst verify that the sampled input value is valid; this is required to match the functionality of AQFP gates [33, 135]. The helper modules usages are summarized as follows: The AQFPmoduleDelay works the same way as its SFQ counterpart in Section B.2.3. The most essential 220 helper module is the AQFPclockPhase, it interprets the direction/polarity of the AC and DC excitation currents. This module instantiates AQFPclockDelay to model the delay of the excitation current within a gate. AQFPsendModule is distinguishable from its SFQ counterpart because of the additional delay to model, which the time duration of the logic `1' or `0' before returning to their idle state. The timing check module takes advantage of the 2-bit nature of the enumerated data type logicAQFP, and distinguishes between the setup and hold of the rising transition of the current representing logic `1' and the falling transition of the current representing logic `0' [36, 135]. From an SDF compatibility point of view, AQFPmoduleDelay is instanti- ated in AQFPsendModule to support a pin-to-pin and interconnect delay modeling style, as well as data pulse width propagation. In addition, this module is used in AQFPclockDelay to model the intra-gate clock skew. The nal point that is worth mentioning in the AQFP section is the bound- aries of the excitation current connections. In the instantiation example shown in SV. 16, the output clock signals of a gate are used as inputs to the next one in the row. However, this is not the case at the borders of the row. Figure B.1 shows an example of 4-phase AQFP inter-rows clocking as suggested in [37]. We suggest the use of special wiring cells to do this wiring intertwining. SV. 22 shows the only two possible ways of doing it in two-phase, and an example of how to do it for four-phase clocking (same as in Figure B.1). 221 Figure B.1: Example of 4-phase AQFP inter-rows clocking [37]. SV. 22: AQFP: Wiring cells for rows borders //2-phase: type-I module AQFPclkWiring2P_I (clkAQFP clkin, clkout); always clkout.xio <= clkin.xio; always clkout.dcio <= clkin.dcio; end module //AQFPclkWiring2P_I //2-phase: type-II module AQFPclkWiring2P_II (clkAQFP clkxin, clkxout); always clkxout.xio <= clkxin.xio; always clkxin.dcio <= clkxout.dcio; end module //AQFPclkWiring2P_II //4-phase: type-I module AQFPclkWiring4P_I (clkAQFP clkin, clkout, clkdcout, clkxout); always clkxout.dcio <= clkin.dcio; always clkdcout.xio <= clkin.xio; always clkout.xio <= clkxout.xio; always clkout.dcio <= clkdcout.dcio; end module //AQFPclkWiring4P_I 222 B.4 SFQ/AQFP Interface As discussed in Section 7.3.1, SFQ oers low power at high performance while AQFP oers ultra-low power at a lower performance. Hybrid SFQ-AQFP sys- tems [34, 35] seems to be a good option when targeting both low-power and high- performance which is demanded by every application nowadays. For parts of the chip that are not critical to performance, AQFP could be used to save the SoC (System-On-Chip) overall power budget. Moreover, SFQ PTLs could be used as a long interconnect option between AQFP blocks [34, 35] since long interconnects are indeed a challenge for AQFP [33]. The work in [35] proposed a circuit that functions as an interface from SFQ to AQFP; the circuit diagram is shown in Figure B.2. They use an AQFP buer cell that probes -using mutual inductance- the internal state of an SFQ DFF cell. The AQFP cell needs to be AC clocked as usual, as well as the SFQ cell needs its own SFQ clock. The authors of [35] did not discuss any concerns of time domains crossing of such design, but this is outside the scope of this manuscript; our exclusive focus is about modeling it. Regarding the other way around, the work in [34] proposed a circuit to interface AQFP to SFQ; the circuit diagram is shown in Figure B.3. They use an AQFP buer followed by a special circuitry mounted on an SFQ DFF. When the AQFP AC clock triggers the buer, the SFQ DFF interferometer senses the correct value by mutual induction, and produces an output pulse accordingly. 223 Figure B.2: Circuit schematic of the SFQ/AQFP interface [35]. Figure B.3: Circuit schematic of the AQFP/SFQ interface [34]. Following any of the models discussed in Section 7.3.2, given their complex- ity and lack of modularity, it would be cumbersome to write a cell in HDL which 224 models the cells that interface both SFQ and AQFP. However, using our proposed interfaces, such task becomes rather straightforward. SV. 23 and SV. 24 show the models of the above-mentioned cells. SV. 23: SFQ-to-AQFP interface cell module SFQtoAQFP (interface SFQclkin, SFQin, AQFPclkin, AQFPclkout, AQFPout); SFQ dumpOut (); SFQbuf1 DFF0 (SFQclkin, SFQin, dumpOut); logicAQFP val; AQFPclockPhase clk (AQFPclkin, AQFPclkout); AQFPsendModule snd (AQFPout); always begin clk.waitForSamplingPt(); val = SFQin.isStored(1 b1) ? q1 : q0; snd.send(val); end SFQtimingcheck1 TC0 (clk.localClk, SFQin.data); endmodule //SFQtoAQFP SV. 24: AQFP-to-SFQ interface cell module AQFPtoSFQ (interface AQFPin, AQFPclkin, SFQout); logicAQFP val; clkAQFP dumpClkOut (); AQFPclockPhase clk (AQFPclkin, dumpClkOut); SFQpropagationDelay gPD (); always begin clk.waitForSamplingPt(); AQFPin.sample(val); if (val==q1) begin gPD.tPD(); SFQout.send(); end end AQFPtimingcheck1 TC0 (clk.localClk, AQFPin.data); endmodule //AQFPtoSFQ Note that in the SFQ-to-AQFP cell, the interfaces SFQclkin and SFQin have to be of the SFQ type, while AQFPout of the ioAQFP type, and AQFPclkin 225 and AQFPclkout are of the clkAQFP type. For the AQFP-to-SFQ cell, the AQF- Pin is of the ioAQFP type, the AQFPclkin is of the clkAQFP type, and SFQout is of the SFQ type. 226 References [1] D. A. Reed and J. Dongarra, \Exascale computing and big data," Communications of the ACM, vol. 58, no. 7, pp. 56{68, 2015. [2] O. A. Mukhanov, \Energy-ecient single ux quantum technology," IEEE Trans- actions on Applied Superconductivity, vol. 21, no. 3, pp. 760{769, 2011. [3] ITRS, \International technology roadmap for semiconductors 2.0: Beyond CMOS," 2015. [4] K. Likharev and V. Semenov, \RSFQ logic/memory family: A new josephson- junction technology for sub-terahertz-clock-frequency digital systems," IEEE Transactions on Applied Superconductivity, vol. 1, no. 1, pp. 3{28, 1991. [5] K. Gaj, Q. P. Herr, V. Adler, A. Krasniewski, E. G. Friedman, and M. J. Feldman, \Tools for the computer-aided design of multigigahertz superconducting digital circuits," IEEE transactions on applied superconductivity, vol. 9, no. 1, pp. 18{38, 1999. [6] C. J. Fourie and M. H. Volkmann, \Status of superconductor electronic circuit design software," IEEE Transactions on Applied Superconductivity, vol. 23, no. 3, pp. 1 300 205{1 300 205, 2013. [7] IRDS, \International roadmap for devices and systems (IRDS): 2017 edition, beyond CMOS," 2017. [8] P. Bunyk, K. Likharev, and D. Zinoviev, \RSFQ technology: Physics and devices," International journal of high speed electronics and systems, vol. 11, no. 01, pp. 257{ 305, 2001. [9] I. V. Vernik, Q. P. Herr, K. Gaij, and M. J. Feldman, \Experimental investigation of local timing parameter variations in RSFQ circuits," IEEE transactions on applied superconductivity, vol. 9, no. 2, pp. 4341{4344, 1999. [10] K. Gaj, E. G. Friedman, and M. J. Feldman, \Timing of multi-gigahertz rapid single ux quantum digital circuits," Journal of VLSI signal processing systems for signal, image and video technology, vol. 16, no. 2-3, pp. 247{276, 1997. 227 [11] Y. Kameda, S. Polonsky, M. Maezawa, and T. Nanya, \Primitive-level pipelin- ing method on delay-insensitive model for RSFQ pulse-driven logic," in Advanced Research in Asynchronous Circuits and Systems, 1998. Proceedings. 1998 Fourth International Symposium on. IEEE, 1998, pp. 262{273. [12] M. Ito, K. Kawasaki, N. Yoshikawa, A. Fujimaki, H. Terai, and S. Yorozu, \20 GHz operation of bit-serial handshaking systems using asynchronous SFQ logic circuits," IEEE transactions on applied superconductivity, vol. 15, no. 2, pp. 255{ 258, 2005. [13] M. Dorojevets, P. Bunyk, and D. Zinoviev, \FLUX chip: Design of a 20-GHz 16- bit ultrapipelined RSFQ processor prototype based on 1.75-m LTS technology," IEEE transactions on applied superconductivity, vol. 11, no. 1, pp. 326{332, 2001. [14] C. A. Mancini, N. Vukovic, A. M. Herr, K. Gaj, M. F. Bocko, and M. J. Feldman, \RSFQ circular shift registers," IEEE transactions on applied superconductivity, vol. 7, no. 2, pp. 2832{2835, 1997. [15] A. M. Herr, M. J. Feldman, and M. F. Bocko, \Timing jitter and bit errors in a 64-bit circular shift register," IEEE transactions on applied superconductivity, vol. 9, no. 2, pp. 3721{3724, 1999. [16] Y. Ando, R. Sato, M. Tanaka, K. Takagi, N. Takagi, and A. Fujimaki, \Design and demonstration of an 8-bit bit-serial RSFQ microprocessor: CORE e4," IEEE Transactions on Applied Superconductivity, vol. 26, no. 5, pp. 1{5, Aug 2016. [17] N. Yoshikawa, F. Matsuzaki, N. Nakajima, K. Fujiwara, K. Yoda, and K. Kawasaki, \Design and component test of a tiny processor based on the SFQ technology," IEEE Transactions on Applied Superconductivity, vol. 13, no. 2, pp. 441{445, June 2003. [18] Z. J. Deng, N. Yoshikawa, S. R. Whiteley, and T. Van Duzer, \Data-driven self- timed RSFQ digital integrated circuit and system," IEEE transactions on applied superconductivity, vol. 7, no. 2, pp. 3634{3637, 1997. [19] H. R. Gerber, C. J. Fourie, W. J. Perold, and L. C. Muller, \Design of an asyn- chronous microprocessor using RSFQ-AT," IEEE Transactions on Applied Super- conductivity, vol. 17, no. 2, pp. 490{493, 2007. [20] L. M uller, H. Gerber, and C. Fourie, \Review and comparison of RSFQ asyn- chronous methodologies," in Journal of Physics: Conference Series, vol. 97, no. 1. IOP Publishing, 2008, p. 012109. [21] E. Sprangle and D. Carmean, \Increasing processor performance by implementing deeper pipelines," in ACM SIGARCH Computer Architecture News, vol. 30, no. 2. IEEE Computer Society, 2002, pp. 25{34. 228 [22] P. A. Beerel, R. O. Ozdag, and M. Ferretti, A Designer's Guide to Asynchronous VLSI. Cambridge University Press, 2010. [23] R. N. Tadros and P. A. Beerel, \A robust and tree-free hybrid clocking technique for RSFQ circuits{ CSR application," 2017 16th International Superconductive Electronics Conference (ISEC), pp. 1{4, 2017. [24] C. J. Fourie, W. J. Perold, and H. R. Gerber, \Complete Monte Carlo model description of lumped-element RSFQ logic circuits," IEEE transactions on applied superconductivity, vol. 15, no. 2, pp. 384{387, 2005. [25] K. Gaj, Q. Herr, and M. Feldman, \Parameter variations and synchronization of RSFQ circuits," in Conference Series-Institute of Physics, vol. 148. IOP Publish- ing LTD, 1995, pp. 1733{1736. [26] M. E. C elik and A. Bozbey, \A statistical approach to delay, jitter and timing of signals of RSFQ wiring cells and clocked gates," IEEE Transactions on Applied Superconductivity, vol. 23, no. 3, pp. 1 701 305{1 701 305, 2013. [27] MIT Lincoln Laboratory, \MIT-LL 10 kA/cm 2 SFQ Fabrication Process: SFQ5ee Design Rules," 2015, version 1.2. [28] T. Murata, \Petri nets: Properties, analysis and applications," Proceedings of the IEEE, vol. 77, no. 4, pp. 541{580, 1989. [29] D. Bryan, \The ISCAS'85 benchmark circuits and netlist format," North Carolina State University, vol. 25, 1985. [30] A. Barone and G. Paterno, Physics and applications of the Josephson eect. Wiley Online Library, 1982, vol. 1. [31] T. Gheewala, \The Josephson technology," Proceedings of the IEEE, vol. 70, no. 1, pp. 26{34, 1982. [32] J. Matisoo, \The tunneling cryotron|A superconductive logic element based on electron tunneling," Proceedings of the IEEE, vol. 55, no. 2, pp. 172{180, 1967. [33] N. Takeuchi, D. Ozawa, Y. Yamanashi, and N. Yoshikawa, \An adiabatic quantum ux parametron as an ultra-low-power logic device," Superconductor Science and Technology, vol. 26, no. 3, p. 035010, 2013. [34] T. Narama, N. Takeuchi, T. Ortlepp, Y. Yamanashi, and N. Yoshikawa, \Design and demonstration of interface circuits between rapid single- ux-quantum and adi- abatic quantum- ux-parametron circuits," IEEE Transactions on Applied Super- conductivity, vol. 26, no. 5, pp. 1{5, 2016. 229 [35] N. Tsuji, T. Narama, N. Takeuchi, T. Ortlepp, Y. Yamanashi, and N. Yoshikawa, \Demonstration of signal transmission between adiabatic quantum- ux-parametrons and rapid single- ux-quantum circuits using superconductive microstrip lines," IEEE Transactions on Applied Superconductivity, vol. 27, no. 4, pp. 1{5, 2017. [36] K. Inoue, N. Takeuchi, T. Narama, Y. Yamanashi, and N. Yoshikawa, \Design and demonstration of adiabatic quantum- ux-parametron logic circuits with supercon- ductor magnetic shields," Superconductor Science and Technology, vol. 28, no. 4, p. 045020, 2015. [37] Q. Xu, C. L. Ayala, N. Takeuchi, Y. Yamanashi, and N. Yoshikawa, \HDL-based cell library for AQFP logic using 4-phase clock," in 9th Superconducting SFQ VLSI Workshop (SSV 2016), 2016. [38] N. Weste and D. Harris, CMOS VLSI Design: A Circuits and Systems Perspective. Addison Wesley Publishing Company Incorporated, 2011. [39] D. Kirichenko, S. Sarwana, and A. Kirichenko, \Zero static power dissipation bias- ing of RSFQ circuits," IEEE Transactions on Applied Superconductivity, vol. 21, no. 3, p. 776, 2011. [40] A. F. Kirichenko, I. V. Vernik, J. A. Vivalda, R. T. Hunt, and D. T. Yohannes, \ERSFQ 8-bit parallel adders as a process benchmark," IEEE Transactions on Applied Superconductivity, vol. 25, no. 3, pp. 1{5, 2015. [41] V. Michal, E. Baggetta, M. Aurino, S. Bouat, and J.-C. Villegier, \Supercon- ducting RSFQ logic: Towards 100ghz digital electronics," in Radioelektronika (RADIOELEKTRONIKA), 2011 21st International Conference. IEEE, 2011, pp. 1{8. [42] C. Antoine, M. Aburas, A. Four, H. Hayano, Y. Iwashita, S. Kato, T. Kubo, and T. Saeki, \Progress on characterization and optimization of multilayers," in SRF, 2017. [43] J. Kang and S. Kaplan, \Current recycling and SFQ signal transfer in large scale RSFQ circuits," IEEE transactions on applied superconductivity, vol. 13, no. 2, pp. 547{550, 2003. [44] K. Sano, T. Shimoda, Y. Abe, Y. Yamanashi, N. Yoshikawa, N. Zen, and M. Ohkubo, \Reduction of the supply current of single- ux-quantum time-to- digital converters by current recycling techniques," IEEE Trans. Appl. Supercond., vol. 27, p. 1, 2017. [45] I. Vernik, S. Kaplan, M. Volkmann, A. Dotsenko, C. Fourie, and O. Mukhanov, \Design and test of asynchronous eSFQ circuits," Superconductor Science and Technology, vol. 27, no. 4, p. 044030, 2014. 230 [46] Y. Polyakov, S. Narayana, and V. K. Semenov, \Flux trapping in superconducting circuits," IEEE Transactions on Applied Superconductivity, vol. 17, no. 2, pp. 520{ 525, 2007. [47] B. Ebert, T. Ortlepp, and F. H. Uhlmann, \Experimental study of the eect of ux trapping on the operation of RSFQ circuits," IEEE Transactions on Applied Superconductivity, vol. 19, no. 3, pp. 607{610, 2009. [48] A. Malakhov and A. Pankratov, \In uence of thermal uctuations on time charac- teristics of a single Josephson element with high damping exact solution," Physica C: Superconductivity, vol. 269, no. 1, pp. 46{54, 1996. [49] M. E. C elik and A. Bozbey, \Analysis of delay and jitter of rapid single ux quantum wiring cells," Journal of superconductivity and novel magnetism, vol. 26, no. 5, pp. 1811{1819, 2013. [50] D. Amparo, M. C elik, and A. Inamdar, \Timing characterization for RSFQ/ERSFQ library cells," in Applied Superconductivity Conference (ASC). IEEE, October 2018. [51] E. G. Friedman, \Clock distribution networks in synchronous digital integrated circuits," Proceedings of the IEEE, vol. 89, no. 5, pp. 665{692, 2001. [52] R. N. Tadros and P. A. Beerel, \A theoretical foundation for timing synchronous systems using asynchronous structures," 2019. [53] H. Hulgaard, S. M. Burns, T. Amon, and G. Borriello, \An algorithm for exact bounds on the time separation of events in concurrent systems," IEEE Transactions on Computers, vol. 44, no. 11, pp. 1306{1317, 1995. [54] W. Hua and R. Manohar, \Exact timing analysis for asynchronous systems," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017. [55] R. H. Havemann and J. A. Hutchby, \High-performance interconnects: An inte- gration overview," Proceedings of the IEEE, vol. 89, no. 5, pp. 586{601, 2001. [56] C. Svensson and M. Afghahi, \On RC line delays and scaling in VLSI systems," Electronics Letters, vol. 24, no. 9, pp. 562{563, 1988. [57] W. C. Elmore, \The transient response of damped linear networks with particular regard to wideband ampliers," Journal of applied physics, vol. 19, no. 1, pp. 55{63, 1948. [58] M. R. Guthaus, G. Wilke, and R. Reis, \Revisiting automated physical synthesis of high-performance clock networks," ACM Transactions on Design Automation of Electronic Systems (TODAES), vol. 18, no. 2, p. 31, 2013. 231 [59] P. J. Restle and A. Deutsch, \Designing the best clock distribution network," in VLSI Circuits, 1998. Digest of Technical Papers. 1998 Symposium on. IEEE, 1998, pp. 2{5. [60] A. Agarwal, D. Blaauw, and V. Zolotov, \Statistical clock skew analysis considering intra-die process variations," in Proceedings of the 2003 IEEE/ACM international conference on Computer-aided design. IEEE Computer Society, 2003, p. 914. [61] G. Geannopoulos and X. Dai, \An adaptive digital deskewing circuit for clock dis- tribution networks," in Solid-State Circuits Conference, 1998. Digest of Technical Papers. 1998 IEEE International. IEEE, 1998, pp. 400{401. [62] A. L. Pankratov and B. Spagnolo, \Suppression of timing errors in short over- damped Josephson junctions," Physical review letters, vol. 93, no. 17, p. 177001, 2004. [63] O. A. Mukhanov, \Rapid single ux quantum (RSFQ) shift register family," IEEE transactions on applied superconductivity, vol. 3, no. 1, pp. 2578{2581, 1993. [64] Z. J. Deng, N. Yoshikawa, J. A. Tierno, S. R. Whiteley, and T. Van Duzer, \Asyn- chronous circuits and systems in superconducting RSFQ digital technology," in Advanced Research in Asynchronous Circuits and Systems, 1998. Proceedings. 1998 Fourth International Symposium on. IEEE, 1998, pp. 274{285. [65] M. Maezawa, I. Kurosawa, Y. Kameda, and T. Nanya, \Pulse-driven dual-rail logic gate family based on rapid single- ux-quantum (RSFQ) devices for asynchronous circuits," in Advanced Research in Asynchronous Circuits and Systems, 1996. Pro- ceedings., Second International Symposium on. IEEE, 1996, pp. 134{142. [66] M. Maezawa, I. Kurosawa, M. Aoyagi, H. Nakagawa, Y. Kameda, and T. Nanya, \Rapid single- ux-quantum dual-rail logic for asynchronous circuits," IEEE trans- actions on applied superconductivity, vol. 7, no. 2, pp. 2705{2708, 1997. [67] P. Patra, S. Polonsky, and D. S. Fussell, \Delay insensitive logic for RSFQ super- conductor technology," in Advanced Research in Asynchronous Circuits and Sys- tems, 1997. Proceedings., Third International Symposium on. IEEE, 1997, pp. 42{53. [68] C. K. Teh and Y. Okabe, \New BSFQ circuit designs with wide margins," IEEE transactions on applied superconductivity, vol. 11, no. 1, pp. 970{973, 2001. [69] ||, \A novel global self-timing methodology for BSFQ circuits," IEEE transac- tions on applied superconductivity, vol. 13, no. 2, pp. 543{546, 2003. [70] H. R. Gerber, C. J. Fourie, and W. J. Perold, \Optimised asynchronous self-timing for superconducting RSFQ logic circuits," in AFRICON, 2004. 7th AFRICON Conference in Africa, vol. 1. IEEE, 2004, pp. 551{556. 232 [71] ||, \RSFQ-asynchronous timing (RSFQ-AT): A new design methodology for implementation in CAD automation," IEEE transactions on applied superconduc- tivity, vol. 15, no. 2, pp. 272{275, 2005. [72] H. Gerber, C. Fourie, and W. Perold, \Optimised asynchronous timing for super- conductive digital circuits," South African Institure of Electrical Engineers, vol. 97, pp. 255{260, 2006. [73] M. Dorojevets, C. Ayala, and A. Kasperek, \Development and evaluation of design techniques for high-performance wave-pipelined wide datapath RSFQ processors," in Proceedings of the International Superconductive Electronics Conference (ISEC), 2009, p. 46. [74] M. Dorojevets, C. L. Ayala, and A. K. Kasperek, \Data- ow microarchitecture for wide datapath RSFQ processors: Design study," IEEE Transactions on Applied Superconductivity, vol. 21, no. 3, pp. 787{791, 2011. [75] P. Bunyk, M. Leung, J. Spargo, and M. Dorojevets, \FLUX-1 RSFQ microproces- sor: Physical design and test results," IEEE transactions on applied superconduc- tivity, vol. 13, no. 2, pp. 433{436, 2003. [76] M. Dorojevets and P. Bunyk, \Architectural and implementation challenges in designing high-performance RSFQ processors: A FLUX-1 microprocessor and beyond," IEEE transactions on applied superconductivity, vol. 13, no. 2, pp. 446{ 449, 2003. [77] A. Fujimaki, M. Tanaka, T. Yamada, Y. Yamanashi, H. Park, and N. Yoshikawa, \Bit-serial single ux quantum microprocessor CORE," IEICE transactions on electronics, vol. 91, no. 3, pp. 342{349, 2008. [78] T. Filippov, M. Dorojevets, A. Sahu, A. Kirichenko, C. Ayala, and O. Mukhanov, \8-bit asynchronous wave-pipelined RSFQ arithmetic-logic unit," IEEE Transac- tions on Applied Superconductivity, vol. 21, no. 3, pp. 847{851, 2011. [79] M. Dorojevets, C. L. Ayala, N. Yoshikawa, and A. Fujimaki, \16-bit wave- pipelined sparse-tree RSFQ adder," IEEE Transactions on Applied Superconduc- tivity, vol. 23, no. 3, pp. 1 700 605{1 700 605, 2013. [80] Y. Sakashita, Y. Yamanashi, and N. Yoshikawa, \High-speed operation of an SFQ butter y processing circuit for FFT processors using the 10 kA/cm 2 Nb process," IEEE Transactions on Applied Superconductivity, vol. 25, no. 3, pp. 1{5, 2015. [81] M. R. Guthaus, D. Sylvester, and R. B. Brown, \Clock tree synthesis with data- path sensitivity matching," in Proceedings of the 2008 Asia and South Pacic Design Automation Conference. IEEE Computer Society Press, 2008, pp. 498{ 503. 233 [82] P. A. Beerel, R. O. Ozdag, and M. Ferretti, A Designer's Guide to Asynchronous VLSI. Cambridge University Press, 2010. [83] M. Amde, T. Felicijan, A. Efthymiou, D. Edwards, and L. Lavagno, \Asynchronous on-chip networks," IEE Proceedings-Computers and Digital Techniques, vol. 152, no. 2, pp. 273{283, 2005. [84] M. Krstic, E. Grass, F. K. G urkaynak, and P. Vivet, \Globally asynchronous, locally synchronous circuits: Overview and outlook," IEEE Design & Test of Com- puters, vol. 24, no. 5, pp. 430{441, 2007. [85] E. Kasapaki, M. Schoeberl, R. B. Srensen, C. M uller, K. Goossens, and J. Spars, \Argo: A real-time network-on-chip architecture with an ecient GALS imple- mentation," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 24, no. 2, pp. 479{492, 2016. [86] I. David, R. Ginosar, and M. Yoeli, \Implementing sequential machines as self- timed circuits," IEEE Transactions on Computers, vol. 41, no. 1, pp. 12{17, 1992. [87] D. Sokolov, J. Murphy, A. Bystrov, and A. Yakovlev, \Design and analysis of dual- rail circuits for security applications," IEEE Transactions on Computers, vol. 54, no. 4, pp. 449{460, 2005. [88] J. Cortadella, M. Lupon, A. Moreno, A. Roca, and S. S. Sapatnekar, \Ring oscil- lator clocks and margins," in Asynchronous Circuits and Systems (ASYNC), 2016 22nd IEEE International Symposium on. IEEE, 2016, pp. 19{26. [89] T. Murata, \Method for realizing the synchronic distance matrix of a marked graph," in Proc. IEEE Int. Symp. Circuits Syst., Rome, Italy, May 1982, vol. 2, 1982, pp. 609{612. [90] C. Ramamoorthy and G. S. Ho, \Performance evaluation of asynchronous concur- rent systems using Petri nets," IEEE Transactions on software Engineering, no. 5, pp. 440{449, 1980. [91] D. Leu, \Properties and applications of the token distance matrix of a marked graph," in Proceedings IEEE 1984 International Symposium on Circuits and Sys- tems, 1984, pp. 1381{1385. [92] J. Magott, \Performance evaluation of concurrent systems using Petri nets," Infor- mation Processing Letters, vol. 18, no. 1, pp. 7{13, 1984. [93] R. Reiter, \Scheduling parallel computations," Journal of the ACM (JACM), vol. 15, no. 4, pp. 590{599, 1968. [94] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain et al., \Loihi: A neuromorphic manycore processor with on-chip learning," IEEE Micro, vol. 38, no. 1, pp. 82{99, 2018. 234 [95] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam et al., \Truenorth: Design and tool ow of a 65 mw 1 million neuron programmable neurosynaptic chip," IEEE Trans- actions on Computer-Aided Design of Integrated Circuits and Systems, vol. 34, no. 10, pp. 1537{1557, 2015. [96] H. Hulgaard, S. M. Burns, T. Amon, and G. Borriello, \An algorithm for exact bounds on the time separation of events in concurrent systems," Technical Report TR #94-02-02, Univ. of Washington, Dept. of Computer Science and Eng., Feb. 1994, (available via anonymous ftp:cs.washington.edu:tr/1994/02/UW-CSE-94-02- 02.PS.Z). [97] R. N. Tadros and P. A. Beerel, \A robust and self-adaptive clocking technique for RSFQ circuits | the architecture," in Circuits and Systems (ISCAS), 2018 IEEE International Symposium on. IEEE, 2018. [98] ||, \A robust and self-adaptive clocking technique for SFQ circuits," IEEE Transactions on Applied Superconductivity, vol. 28, no. 7, pp. 1{11, Oct 2018. [99] N. K. Katam and M. Pedram, \Logic optimization, complex cell design, and retim- ing of single ux quantum circuits," IEEE Transactions on Applied Superconduc- tivity, vol. 28, no. 7, pp. 1{9, Oct 2018. [100] R. N. Tadros and P. A. Beerel, \Optimizing (HC) 2 LC, a robust clock distribution network for SFQ circuits," 2019. [101] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algo- rithms. MIT press, 2009. [102] A. B. Kahng, J. Lienig, I. L. Markov, and J. Hu, VLSI physical design: from graph partitioning to timing closure. Springer Science & Business Media, 2011. [103] J. Kim, \Geometric partitioning for parallelizing post-placement VLSI design pro- cesses," Ph.D. dissertation, University of Michigan, 2006. [104] S. N. Shahsavani, A. Shafaei, and M. Pedram, \A placement algorithm for super- conducting logic circuits based on cell grouping and super-cell placement," in 2018 Design, Automation Test in Europe Conference Exhibition (DATE), March 2018, pp. 1465{1468. [105] B. W. Kernighan and S. Lin, \An ecient heuristic procedure for partitioning graphs," The Bell system technical journal, vol. 49, no. 2, pp. 291{307, 1970. [106] M. J. Berger and S. H. Bokhari, \A partitioning strategy for nonuniform problems on multiprocessors," IEEE Transactions on Computers, no. 5, pp. 570{580, 1987. [107] C. Kingsford, \Lecture notes in bioinformatics," August 2010. 235 [108] S. Dutt, \New faster Kernighan-Lin-Type graph-partitioning algorithms," in Pro- ceedings of the 1993 IEEE/ACM international conference on Computer-aided design. IEEE Computer Society Press, 1993, pp. 370{377. [109] M. H. Volkmann, I. V. Vernik, and O. A. Mukhanov, \Wave-pipelined eSFQ cir- cuits," IEEE Transactions on Applied Superconductivity, vol. 25, no. 3, pp. 1{5, 2015. [110] T. Narama, Y. Yamanashi, N. Takeuchi, T. Ortlepp, and N. Yoshikawa, \Demon- stration of 10k gate-scale adiabatic-quantum- ux-parametron circuits," in Super- conductive Electronics Conference (ISEC), 2015 15th International. IEEE, 2015, pp. 1{3. [111] S. Tam, S. Rusu, U. N. Desai, R. Kim, J. Zhang, and I. Young, \Clock generation and distribution for the rst IA-64 microprocessor," IEEE Journal of Solid-State Circuits, vol. 35, no. 11, pp. 1545{1552, 2000. [112] J. Yuan and C. Svensson, \High-speed CMOS circuit technique," IEEE Journal of Solid-State Circuits, vol. 24, no. 1, pp. 62{70, 1989. [113] H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Addison Wesley, 1990. [114] A. Mishchenko et al., \ABC: A system for sequential synthesis and verication," URL http://www. eecs. berkeley. edu/~ alanmi/abc, 2007. [115] \IEEE standard for SystemVerilog{unied hardware design, specication, and ver- ication language." IEEE Std 1800-2017 (Revision of IEEE Std 1800-2012), pp. 1{1315, February 2018, https://standards.ieee.org/standard/1800-2017.html. [116] R. N. Tadros, A. Fayyazi, M. Pedram, and P. A. Beerel, \SystemVerilog modeling of SFQ and AQFP circuits," 2019. [117] N. Katam, A. Shafaei, and M. Pedram, \Design of complex rapid single- ux- quantum cells with application to logic synthesis," in Superconductive Electronics Conference (ISEC), 2017 16th International. IEEE, 2017, pp. 1{3. [118] S. N. Shahsavani, T. Lin, A. Shafaei, C. J. Fourie, and M. Pedram, \An integrated row-based cell placement and interconnect synthesis tool for large SFQ logic cir- cuits," IEEE Transactions on Applied Superconductivity, vol. 27, no. 4, pp. 1{8, June 2017. [119] C. Fourie, \Single ux quantum circuit technology and CAD overview," in 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Nov 2018, pp. 1{6. [120] \IARPA SuperTools Program," https://www.iarpa.gov/index.php/research- programs/supertools/supoertools-baa. 236 [121] C. J. Fourie, K. Jackman, M. M. Botha, S. Razmkhah, P. Febvre, C. L. Ayala, Q. Xu, N. Yoshikawa, E. Patrick, M. Law, Y. Wang, M. Annavaram, P. A. Beerel, S. Gupta, S. Nazarian, and M. Pedram, \Cold ux superconducting EDA and TCAD tools project: Overview and progress," IEEE Transactions on Applied Superconductivity, pp. 1{1, 2019. [122] E. S. Fang, \A josephson integrated circuit simulator (JSIM) for superconductive electronics application," in Extended Abstracts of 1989 International Superconduc- tivity Electronics Conf.(The Japan Society of Applied Physics, Tokyo, 1989), 1989. [123] P. Shevchenko, \PSCAN2 superconducting circuit simulator," 2016, http://www.pscan2sim.org. [124] \IEEE standard for standard delay format (SDF) for the electronic design process," IEEE Std 1497-2001, 2001. [125] A. Krasniewski, \Logic simulation of RSFQ circuits," IEEE transactions on applied superconductivity, vol. 3, no. 1, pp. 33{38, 1993. [126] O. Mukhanov, S. Rylov, V. Semonov, and S. Vyshenskii, \RSFQ logic arithmetic," IEEE Transactions on Magnetics, vol. 25, no. 2, pp. 857{860, 1989. [127] T. V. Filippov, A. Sahu, A. F. Kirichenko, I. V. Vernik, M. Dorojevets, C. L. Ayala, and O. A. Mukhanov, \20 GHz operation of an asynchronous wave-pipelined RSFQ arithmetic-logic unit," Physics Procedia, vol. 36, pp. 59{65, 2012. [128] K. Gaj, C.-H. Cheah, E. G. Friedman, and M. J. Feldman, \Functional modeling of RSFQ circuits using verilog HDL," IEEE transactions on applied superconductivity, vol. 7, no. 2, pp. 3151{3154, 1997. [129] N. Yoshikawa and J. Koshiyama, \Top-down RSFQ logic design based on a binary decision diagram," IEEE transactions on applied superconductivity, vol. 11, no. 1, pp. 1098{1101, 2001. [130] S. Intiso, I. Kataeva, E. Tolkacheva, H. Engseth, K. Platov, and A. Kidiyarova- Shevchenko, \Time-delay optimization of RSFQ cells," IEEE transactions on applied superconductivity, vol. 15, no. 2, pp. 328{331, 2005. [131] L. C. Muller and C. J. Fourie, \Automated state machine and timing characteristic extraction for RSFQ circuits," IEEE Transactions on Applied Superconductivity, vol. 24, no. 1, pp. 3{12, 2014. [132] C. J. Fourie, \Extraction of DC-biased SFQ circuit Verilog models," IEEE Trans- actions on Applied Superconductivity, vol. 28, no. 6, pp. 1{11, 2018. [133] C. L. Ayala, N. Takeuchi, and Q. Xu, \Timing extraction for logic simulation of VLSI adiabatic quantum- ux-parametron circuits," IEICE technical report, vol. 115, no. 242, pp. 7{12, 2015. 237 [134] V. Adler, C.-H. Cheah, K. Gaj, D. K. Brock, and E. G. Friedman, \A Cadence- based design environment for single ux quantum circuits," IEEE transactions on applied superconductivity, vol. 7, no. 2, pp. 3294{3297, 1997. [135] Q. Xu, C. L. Ayala, N. Takeuchi, Y. Yamanashi, and N. Yoshikawa, \HDL-based modeling approach for digital simulation of adiabatic quantum ux parametron logic," IEEE Transactions on Applied Superconductivity, vol. 26, no. 8, pp. 1{5, 2016. [136] F. Matsuzaki, N. Yoshikawa, M. Tanaka, A. Fujimaki, and Y. Takai, \A behavioral- level HDL description of SFQ logic circuits for quantitative performance analysis of large-scale SFQ digital systems," Physica C: Superconductivity, vol. 392, pp. 1495{1500, 2003. [137] \IEEE standard VHDL language reference manual," IEEE Std 1076-2002 (Revi- sion of IEEE Std 1076, 2002 Edn), pp. 1{300, 2002. [138] Q. P. Herr, A. Y. Herr, O. T. Oberg, and A. G. Ioannidis, \Ultra-low-power superconductor logic," Journal of applied physics, vol. 109, no. 10, p. 103903, 2011. [139] S. Rylov, \Clockless dynamic SFQ (DSFQ) AND gate with high input skew toler- ance," in Applied Superconductivity Conference (ASC). IEEE, October 2018. [140] A. D. Wong, K. Su, H. Sun, A. Fayyazi, M. Pedram, and S. Nazarian, \VeriSFQ: A semi-formal verication framework and benchmark for single ux quantum technol- ogy," in Quality Electronic Design (ISQED), 2019 20th International Symposium on. IEEE, 2019. [141] S. Walia, \Primetime advanced OCV technology{easy-to-adopt, variation-aware timing analysis for 65-nm and below," Synopsys Inc., Apr, 2009. [142] S. R. Nassif, \Design for variability in DSM technologies [deep submicron tech- nologies]," in Proceedings IEEE 2000 First International Symposium on Quality Electronic Design (Cat. No. PR00525). IEEE, 2000, pp. 451{454. [143] J. Xiong, V. Zolotov, and L. He, \Robust extraction of spatial correlation," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 4, pp. 619{631, 2007. [144] AWS Marketplace. FPGA Developer AMI. [Online]. Available: https: //aws.amazon.com/marketplace/pp/B06VVYBLZZ [145] N. A. Sherwani, Algorithms for VLSI Physical Design Automation. Springer US, 1999. [146] C. J. Alpert, D. P. Mehta, and S. S. Sapatnekar, Handbook of algorithms for physical design automation. Auerbach Publications, 2008. 238 [147] A. B. Kahng, J. Lienig, I. L. Markov, and J. Hu, VLSI physical design: from graph partitioning to timing closure. Springer Science & Business Media, 2011. [148] G. Pasandi, A. Shafaei, and M. Pedram, \SFQmap: A technology mapping tool for single ux quantum logic circuits," in International Symposium on Circuits and Systems (ISCAS). IEEE, May 27, 2018. [149] G. Pasandi and M. Pedram, \PBMap: A path balancing technology mapping algorithm for single ux quantum logic circuits," IEEE Transactions on Applied Superconductivity, vol. 29, no. 4, pp. 1{14, 2019. [150] B. Zhang, N. Katam, and M. Pedram, \qSTA: a static timing analysis tool for SFQ circuits." [151] T.-R. Lin, T. Edwards, and M. Pedram, \qGDR: A via minimization oriented routing tool for large-scale superconductive single ux quantum circuits." [152] F. Brglez, D. Bryan, and K. Kozminski, \Combinational proles of sequential benchmark circuits," in Circuits and Systems, 1989., IEEE International Sympo- sium on. IEEE, 1989, pp. 1929{1934. [153] D. K. Tala. (2014) SystemVerilog tutorial. [Online]. Available: http: //www.asic-world.com/systemverilog/tutorial.html 239
Abstract (if available)
Abstract
Living on the verge of the IoT era, the entire world is excited about the potential of mining the monumental amounts of data that would become available in the near future. However, this abundance of data requires supercomputers faster and more powerful than ever that do not require a neighboring nuclear plant for power! ❧ Single Flux Quantum (SFQ) technology has the potential to meet the booming demands for lower power consumption and higher operation speeds in the electronics industry and future exascale supercomputing systems. Nevertheless, the promised benefits of three orders of magnitude lower power at an order of magnitude higher performance have yet to be attained. In particular, ultra-high-speed clocking of large scale SFQ circuits in the presence of unprecedented levels of timing uncertainties represents a tough obstacle for the technology to advance. In this thesis, we propose an innovative self-adaptive clocking technique which is designed to be resilient in such uncertain environments. Our proposed hierarchical chains of homogeneous clover-leaves clocking, (HC)²LC, inherits its robustness from spatially correlated cell delays and from the timing robustness of the SFQ traditional counter-flow clocking. ❧ Our simulations show that averaging over the ISCAS'85 benchmark circuits, at the same speed, and with only an area overhead of 9.00%, (HC)²LC achieves 52.3% and 211.8% yield improvement over zero-skew trees at low and medium ranges of σ of gate delays, respectively.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Radiation hardened by design asynchronous framework
PDF
Dynamic neuronal encoding in neuromorphic circuits
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
QoS-aware algorithm design for distributed systems
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
PDF
Modeling astrocyte-neural interactions in CMOS neuromorphic circuits
PDF
Mixed-signal integrated circuits for interference tolerance in wireless receivers and fast frequency hopping
Asset Metadata
Creator
Tadros, Ramy Nagy
(author)
Core Title
Clocking solutions for SFQ circuits
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
04/24/2019
Defense Date
03/21/2019
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asynchronous,CAD,clock distribution networks,clocking,marked graph,OAI-PMH Harvest,Petri nets,SFQ,superconducting electronics,superconductivity,synchronous,timing,VLSI
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter (
committee chair
), Golubchik, Leana (
committee member
), Pedram, Massoud (
committee member
)
Creator Email
ramynagytadros@gmail.com,rtadros@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-142544
Unique identifier
UC11675853
Identifier
etd-TadrosRamy-7231.pdf (filename),usctheses-c89-142544 (legacy record id)
Legacy Identifier
etd-TadrosRamy-7231.pdf
Dmrecord
142544
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Tadros, Ramy Nagy
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
asynchronous
CAD
clock distribution networks
clocking
marked graph
Petri nets
SFQ
superconducting electronics
superconductivity
synchronous
timing
VLSI