Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Multi-phase clocking and hold time fixing for single flux quantum circuits
(USC Thesis Other)
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MULTI-PHASE CLOCKING AND HOLD TIME FIXING FOR SINGLE FLUX QUANTUM CIRCUITS by Xi Li A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL AND COMPUTER ENGINEERING) December 2022 Copyright 2022 Xi Li Dedication This dissertation is dedicated to my grandmother Yuhua Fan, who I lost on December 31, 2021. In memory of her continuous love and support. ii Acknowledgements First, I would like to express my gratitude to my PhD advisor, Prof. Peter A. Beerel for his support and patience during the past four years. This thesis would never been completed without his guidance and encouragement. He has always been dedicated to research and available to questions even during holidays. His positive thinking and resilience to handle difficulties have been encouraging me along the PhD journey. I learned not only how to conduct good research from him, but also how to effectively present my thoughts and develop a systematic scientific and engineering mindset. I would also like to thank my defense committee: Prof. Sandeep Gupta, Prof. Pierluigi Nuzzo and Prof. Aiichiro Nakano for their time, interests and invaluable comments. I am also grateful to my qualification committee: Prof. Akhilesh Jaiswal for his time and insightful comments. IwouldliketoexpressmygratitudetomycolleaguesandmentorsinSynopsys, Inc. During my 10-months’ internship in the CTS team, my manager Min Pan has always been caring and supportive. Special thanks to my mentor Tong Liu for the detailed guidance and valuable suggestions on technical questions and career growth. Thanks to Evan Wegley for his generous help on the onboarding process and sharing software engineering skills. Thanks to Yalan Zhang, Rajinder Singh and Luca Amaru for assisting with the tools and benchmarks used during my research. iii I would also like to thank my senior Huimei Cheng who introduced me into the research group and provided me useful resources on many aspects. I worked with her on the 3-latch project which is not included in this thesis since it is out of scope. I would like to thank my collaborators on the research projects in this thesis. Thanks to Soheil Nazar Shahsavani for the valuable discussion on the hold time fixing in RSFQ circuit project. Thanks to Xuan Zhou for helping with the experimentsandresultsanalysisontheholdtimefixingproject. ThankstoRobert Aviles for the code integration of the open-source ILP solver on the multi-phase clocking in RSFQ circuit project. Thanks to Dylan Hand for his IT help with the Sierra server. Moreover, I would like to thank my lab mates at USC: Dake Chen, Sourya Dey, Gourav Datta, Moises Herrerabuitrago and Souvik Kundu. Special gratitude to Hsin-Ho Huang for providing career suggestions and referral. I also appreciate the time and help from the USC Ming Hsieh Department of Electrical and Computer Engineering staffs, Diane Demetras and Annie Hua. Thanks to the USC Viterbi Office of Admission and Student Engagement (VASE) staffs, Tracy Charles and Andy Chen for their help. I would also like to thank my friends Kexuan Sun, Di Huang and Yue Niu from USCfortheirsupport. Weovercamethedifficultiesandspentjoyfultimetogether. I am also grateful to my friends Chang Sun, Xun Sun, Huwan Peng, Xindi Liu, Ban Wang and Maolong Tang from University of Washington, Hao-Yen Tang from UC Berkeley. Sincere thanks to my friend and undergraduate roommate Jing Liu for always been responsive and supportive during my hardest time. Thanks to my friends Xin Chen, Xingyi Zhou, Siwen Wang and Guanghui Zhu in China. We knew each other for over ten years. Iwouldliketoexpressmysincerestgratitudetomyfamily. Iamdeeplythankful to my father Yaowu Li and mother Xiaojun Zhu in China. They have always been supportiveonmyeducationfromchildhood. Istartedattendingtheprimaryschool before 4 years old, which would never been done without my mother’s enormous iv support for me and her contribution for the family. My cousin Yanping Li, her husband Bi Han and their family in San Jose, CA provided me generous love and support during the difficult COVID-19 time. They always made me feel inclusive and influenced me with the warm family atmosphere. I wish my nephews Boyan Han and Bohan Li find their passion in life and all the best in the future. I would also like to thank my aunts and my cousins in China for their caring when I am in US. Finally, I specially dedicate this thesis to my beloved grandmother, Yuhua Fan, the most important person in my life. I lost her on December 31, 2021. I lived with her from the earliest infancy until attending undergraduate university. Because of COVID-19, I have not seen her for almost two years when she passed away. I was not able to attend the funeral due to the minimum quarantine period. Her kindness, honest, patience and diligence shape me into the person I am. Her endless love and strong support have always accompanied me and kept me going. v Table of Contents Dedication ii Acknowledgements iii List of Tables ix List of Figures x Abstract xii Chapter 1: Introduction 1 1.1 SFQ Technology Fundamentals . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.2 Josephson Junction . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Single Flux Quantum Pulse . . . . . . . . . . . . . . . . . . 6 1.1.4 Interconnects in SFQ . . . . . . . . . . . . . . . . . . . . . . 7 1.1.5 Clocked SFQ cells . . . . . . . . . . . . . . . . . . . . . . . . 8 1.1.6 Fanout Limit . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.2 SFQ Clocking Techniques . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.1 Review of Clock Distribution Network (CDN) . . . . . . . . 11 1.2.2 Single Clock Architecture . . . . . . . . . . . . . . . . . . . 13 1.2.3 Dual Clock Architecture . . . . . . . . . . . . . . . . . . . . 16 1.2.4 Other Clock Architectures . . . . . . . . . . . . . . . . . . . 17 1.3 Path Balancing and Multi-Threading Feature in SFQ . . . . . . . . 19 1.3.1 Full Path Balancing. . . . . . . . . . . . . . . . . . . . . . . 19 1.3.2 Multi-threading . . . . . . . . . . . . . . . . . . . . . . . . . 20 1.4 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 21 1.5 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 2: Variation-Aware Hold Time Fixing in RSFQ Circuits 24 2.1 Timing uncertainty in RSFQ circuits . . . . . . . . . . . . . . . . . 25 2.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.1 Data Path Delay . . . . . . . . . . . . . . . . . . . . . . . . 26 vi 2.2.2 Insertion Delay . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2.3 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.4 Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.5 Hold Time Margin . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.6 Clock tree topology . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.7 Common clock path (CCP) . . . . . . . . . . . . . . . . . . 30 2.3 Variation Aware Common Path Pessimism Removal . . . . . . . . . 30 2.3.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . 31 2.3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 31 2.3.3 Proposed Methodology - An Overview . . . . . . . . . . . . 32 2.3.4 Variation Aware Common Path Pessimism Removal . . . . . 35 2.4 Placement Methodology . . . . . . . . . . . . . . . . . . . . . . . . 38 2.4.1 Incremental Placement . . . . . . . . . . . . . . . . . . . . . 39 2.4.2 Placement from Scratch . . . . . . . . . . . . . . . . . . . . 42 2.5 Evaluation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.1 Variation Model . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5.2 Dynamic Simulation . . . . . . . . . . . . . . . . . . . . . . 45 2.5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 45 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 3: Multi-Phase Clocking Methodology in RSFQ Circuits 52 3.1 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.2 Review of Multi-Phase Clocked System . . . . . . . . . . . . . . . . 55 3.3 Multi-threading in RSFQ circuit . . . . . . . . . . . . . . . . . . . . 56 3.4 Proposed Multi-Phase Clocking Methodology . . . . . . . . . . . . 58 3.4.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . 58 3.4.2 Proposed ILP Optimization . . . . . . . . . . . . . . . . . . 61 3.4.3 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . 63 3.5 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.6.1 Comparison to Full Path Balancing . . . . . . . . . . . . . . 68 3.6.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.7 Simulation Verification . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.7.1 Simulation of Example Circuit . . . . . . . . . . . . . . . . . 73 3.7.2 Functional Simulation . . . . . . . . . . . . . . . . . . . . . 74 3.7.3 Simulation results - AMD2901 . . . . . . . . . . . . . . . . 74 3.8 Improved DFF sharing . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.8.1 Modified ILP Formulation . . . . . . . . . . . . . . . . . . . 78 3.8.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . 78 Chapter 4: HoldTimeSafeGuaranteedRSFQCircuitswithMulti- Phase Clocks 83 vii 4.1 Hold Time Safe Guaranteed Circuits . . . . . . . . . . . . . . . . . 83 4.1.1 Timing Analysis of Two Phase Clocks . . . . . . . . . . . . 83 4.1.2 Guarantee Hold Time Safe in Multi-Phase Clocks . . . . . . 86 4.2 Modified ILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.1 Problem fomulation . . . . . . . . . . . . . . . . . . . . . . . 86 4.2.2 Number of Inserted DFFs under Hold Safe Condition . . . . 96 4.2.3 Relaxed Clock Phase Constraints . . . . . . . . . . . . . . . 101 4.2.4 Objective Function . . . . . . . . . . . . . . . . . . . . . . . 102 4.3 Clock Phase Assignment . . . . . . . . . . . . . . . . . . . . . . . . 102 4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Chapter 5: Summary and Future Work 107 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Bibliography 112 viii List of Tables 2.1 Number of hold buffers (JTLs) and timing Yield (%) for the fixed margin (7ps), and the proposed variation-based 3σ and 2σ hold margin, with incremental placement . . . . . . . . . . . . . . . . 47 2.2 Number of hold buffers (JTLs) and timing Yield (%) for the fixed margin (7ps), and the proposed variation-based 3σ and 2σ hold margin, with placement from scratch . . . . . . . . . . . . . . . 48 2.3 Number of hold buffers (JTLs) for placement from scratch vs incremental placement approaches . . . . . . . . . . . . . . . . . 48 2.4 Clock Cycle Time (CCT) and Area Comparison between place- ment from scratch vs incremental placement (3σ ) . . . . . . . 49 2.5 Run-time in the fixed margin (7ps), and the proposed variation- based 3σ and 2σ hold margin, with incremental placement . . . 50 3.1 Number of required DFFs with given number of clock phases . . . . 69 3.2 Number of clock splitters with given number of clock phases . . . . 69 3.3 Total area with given number of clock phases. . . . . . . . . . . . . 69 3.4 Improved numbers of required DFFs when sharing DFF as linear pipeline with modified ILP . . . . . . . . . . . . . . . . . . . . . . . 81 3.5 ReducednumbersofrequiredDFFswithimprovedDFFsharingand modified ILP, separately . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1 Number of required DFFs with given number of clock phases, with hold safe condition . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.2 Number of total gates after logic synthesis with given number of clock phases, with hold safe condition . . . . . . . . . . . . . . . . . 105 4.3 Numberoftotalgatesafterlogicsynthesisunderholdsafecondition, compared with single clock without hold safe condition . . . . . . . 106 ix List of Figures Figure 1.1 Josephson Junction (JJ) . . . . . . . . . . . . . . . . . . . 5 Figure 1.2 Current-voltage characteristics of Josephson junctions cal- culated from the RCSJ model [1]. . . . . . . . . . . . . . . 5 Figure 1.3 SFQ pulses . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Figure 1.4 The Schematic of Josephson Transmission Line . . . . . . . 7 Figure 1.5 DFF Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Figure 1.6 Splitter Cell . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Figure 1.7 Dual clocks architecture in SFQ circuits [2]. . . . . . . . . . 17 Figure 1.8 Two Phase Clocks . . . . . . . . . . . . . . . . . . . . . . . 18 Figure 1.9 Two Phase Clocks Applied On a Linear Pipeline . . . . . . 18 Figure 1.10 Path balancing in a Combinational Block . . . . . . . . . . 20 Figure 1.11 Path balancing in a Sequential Block . . . . . . . . . . . . 20 Figure 1.12 Multi-threading in a Combinational Block . . . . . . . . . . 21 Figure 2.1 Example of a timing path in a RSFQ circuit. . . . . . . . . 27 Figure 2.2 A balanced binary tree with 8 sinks . . . . . . . . . . . . . 32 Figure 2.3 Overview flow of the proposed algorithm. . . . . . . . . . . 33 Figure 2.4 Grid-based variation model with uniform grids. . . . . . . . 44 Figure 3.1 Multi-Phase Clocks . . . . . . . . . . . . . . . . . . . . . . 56 Figure 3.2 A Fully Path Balanced Gate-Level Pipelined Sequential Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Figure 3.3 The circuit of Fig. 3.2 re-designed with two-phase clocking and two threads. . . . . . . . . . . . . . . . . . . . . . . . . 59 Figure 3.4 The formalized DAG for the circuit of Fig. 3.2. . . . . . . . 59 Figure 3.5 The design flow of multi-phase clocks. . . . . . . . . . . . . 67 Figure 3.6 Total Wire Length vs. Number of Clock Phases for Small to Medium Sized Benchmarks . . . . . . . . . . . . . . . . 70 Figure 3.7 TotalWireLengthvs. NumberofClockPhasesforMedium to Large Sized Benchmarks . . . . . . . . . . . . . . . . . . 71 Figure 3.8 AMD2901: Layout with five clock trees . . . . . . . . . . . 72 Figure 3.9 Example simulation waveform of the circuit shown in Fig. 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 x Figure 3.10 Simulation flow to verify the function correctness of the post-DFF-inserted RSFQ circuits . . . . . . . . . . . . . . 75 Figure 3.11 Golden result with input vectors in set 1 . . . . . . . . . . 75 Figure 3.12 Golden result with input vectors in set 2 . . . . . . . . . . 76 Figure 3.13 Golden result with input vectors in set 3 . . . . . . . . . . 76 Figure 3.14 Golden result with input vectors in set 4 . . . . . . . . . . 76 Figure 3.15 Test result with 10-phase clocks and 4 threads . . . . . . . 77 Figure 3.16 DFF sharing of a node with four fanout nodes . . . . . . . 77 Figure 4.1 Timing Checks in Two Phase Clocking . . . . . . . . . . . 84 Figure 4.2 DAG in NWDST problem for hold safe DFF insertion . . . 88 Figure 4.3 SolutiontreeG R andsub-treeG A fortheexampleinFig.4.2b 90 Figure 4.4 Branch node v b with two branches in solution Steiner tree . 90 Figure 4.5 Node connections in Condition 1 . . . . . . . . . . . . . . . 92 Figure 4.6 Node connections in Condition 2 . . . . . . . . . . . . . . . 93 Figure 4.7 Transition from (1, 3) to (15,2) with minimum DFFs . . . . 98 Figure 4.8 N 2 vs. S j − S i . . . . . . . . . . . . . . . . . . . . . . . . . 100 xi Abstract Single flux quantum (SFQ) logic is a promising technology to replace CMOS logic for future exa-scale supercomputing but requires the development of reliable EDA tools that are tailored to the unique characteristics of SFQ circuits. SFQ circuits oftenhaveultra-highclockfrequencies,significanttimeuncertainty,anddeepclock trees, all of which makes the design of clock distribution networks (CDN) and timing closure extremely challenging. Moreover, the gate-level pipelining nature of SFQ requires many path-balancing registers to enable proper multi-threaded computation, which significantly increases the chip area and power. This thesis first presents a variation-aware hold time fixing methodology that considers both local and global timing uncertainties and effectively utilizes the common path pessimism removal to reduce the overhead of hold buffers with com- petitive timing yield, compared with fixed time margin. It then presents a novel multi-phase clocking methodology targeting multi-threaded gate-level pipelined sequential circuits. An integer linear programming (ILP) algorithm is formulated to minimize the number of required registers given the number of available clock xii phases. The proposed method reduces the number of path-balancing registers by 55.5% with two clock phases and up to 95.5% with ten clock phases. The clock tree synthesis (CTS) and placement and route (PnR) results show that the de- crease in registers yields a decrease in total gate area by 40.6% and clock tree wire length by 54.9% with two clock phases, and by 69.6% and 69.8% with ten clock phases, respectively, despite the increase in the number of clock phases. To further reduce the number of required registers, we present an enhancement of the flow to maximize register sharing across the fanout gates. The enhanced flow saves the number of inserted registers by an average of 26.3%, compared with the originalflow. Last, weproposeaholdtime safeextensionofthe multi-phase clock- ing methodology which extends the benefits of the multi-phase clocking to timing robustness. xiii Chapter 1 Introduction 1.1 SFQ Technology Fundamentals 1.1.1 Introduction Superconductive electronics (SCE), and single-flux quantum (SFQ) [3] in partic- ular, is a promising replacement for complementary metal–oxide–semiconductor (CMOS) technology for exascale supercomputing. With the increasing need for big data and supercomputing, the hundreds of megawatts of power needed by cur- rentexa-scalecomputingplatformsisagrowingconcern[4]. RapidSFQtechnology was introduced back in the late 1908s [3], with a theoretical potential of meeting high performance needs with three orders of magnitude lower power compared to state-of-the-art semiconductor technologies [5]. SFQ technology is based on materials that show zero electrical resistance and magnetic field expulsion when they are cooled below a characteristic critical tem- perature. Different from perfect conductivity based on classic physics, supercon- ductivity is a quantum mechanical phenomenon. It is explained by the Meissner effect [6] that the interior magnetic field cannot be a nonzero value. 1 SFQ technology has the theoretical potential of ultra-high operating frequency. When the environment temperature is 4.2 Kelvin, the transmission of pulses has a speedapproachingthespeedoflight,usingpassivesuperconductivemicrostriplines [7]. A recent study presents a taped-out 4-bit processor in SFQ logic operating at 32GHz[8]. Despitethecoolingoverhead,thedevicesinSFQcircuitsstillconsumes significantly less amount of power compared to the CMOS devices [9]. There has been studies on efficient cooling systems motivated by the integrated SFQ systems [10]. The development of these advanced cooling systems help the further application of SFQ technology. However, thepotentialofSFQisyettobeachievedforcomplexdesignssuchas microprocessors for a variety of reasons. Most notably, the lack of an established reliable electronic design automation (EDA) tools has been a long-term obstacle for SFQ technology to scale [11,12]. Despite the fact the mature tools lead to the continuous improvement of integrated circuit in CMOS technology, the EDA tools for SCE circuit was not proposed until recently [13,14]. Due to several intrinsic differences between CMOS and SCE technologies, the advanced EDA tools for CMOS cannot be directly applied on SFQ circuit. We will present these major differences between CMOS and SFQ in the rest of this chapter. 2 1.1.2 Josephson Junction Superconductive electronics is based on Josephson Junctions (JJ), which serves as ultrafast (picosecond) switches. Fig. 1.1a shows a JJ formed by two supercon- ductors separated by a small insulator. This insulator can be in the form of a construction, point contact, or a thin insulating tunnel barrier [15]. The super- conductor material that is commonly used is Niobium (Nb) with atomic number 41. A simplified model called “RCSJ model” in Fig. 1.1b is proposed by W.C. Stewart and D.E. McCumber [1] to describe the current-voltage characteristics and the dynamics of JJ. In the circuit in Fig. 1.1b, we have a parallel connection of an ohmic resistance, a capacitor, and a (non-classical) device symbolizing the Josephson current which is indicated by a cross. Currents smaller than the critical current I c can flow as supercurrents across the junction as below. I J =I c ∗ sinγ (1.1) ˙ γ = 2π Φ 0 U(t) (1.2) Where γ is the gauge-invariant phase difference of the two superconductors, Φ 0 denotes the flux quantum, U(t) is the voltage across the junction. Equation (1.1) and (1.2) are the current-phase and voltage-phase relations of JJ. 3 ForthecircuitinFig.1.1b, followingtheKirchhoff’slaw, weobtainthecurrent I as below: I =I c sinγ + U R +C ˙ U (1.3) Where U R represents the displacement current across the capacitor C. With the capacitor neglected, the model becomes “RSJ model”. Using Equation (1.2), we can eliminate U from Equation (1.3) and obtain, I =I c sinγ + Φ 0 2πR ˙ γ + CΦ 0 2π ¨γ (1.4) Further, we reduce Equation (1.4) into dimensionless form in which the currents are measured in units ofI c , the voltages in units ofV c =I c R, and the time in units of τ c = Φ 0 2ϕI cR . i = sinγ + ˙ γ +β c ¨γ (1.5) Equation (1.5) contains a single material-dependent parameter, Stewart- McCumber parameter as: β c = 2πI c R 2 C Φ 0 (1.6) ThisparameterdeterminesthebehaviouroftheJosephsonjunction. Whenβ c > 1, the JJ is underdamped. When β c < 1, the JJ is overdamped. Fig. 1.2 shows the current-voltage characteristics of JJs from RCSJ model [1]. In the underdamped state,thecurrentishystereticwithareturncurrent,whileintheoverdampedstate, 4 (a) A schematic of a JJ (b) A RCSJ model of JJ (c) Overdamped JJ Figure 1.1: Josephson Junction (JJ) there is no hysteresis. In real usages, we connect a small resistance in parallel with the JJ to steer current from it and keep the JJ in the overdamped state. The SFQ cells and circuits are built with the overdamped JJs in this thesis. Figure 1.2: Current-voltage characteristics of Josephson junctions calculated from the RCSJ model [1]. 5 1.1.3 Single Flux Quantum Pulse Different from CMOS logic, the bianry information in SFQ logic is not presented by a DC voltage, but very short (picosecond) voltage pulses V(t) of a quantized area as in Fig. 1.3: Z V(t)dt = Φ 0 = h 2e ≈ 2.07mV × ps (1.7) Where h is the Planck constant. These pulses can be naturally generated, reproduced, amplified, memorized, andprocessedbyelementarycircuitscomprisingoverdampedJosephsonjunctions. As a general convention of SFQ logic, assuming the clock pulse always arrives periodically in time: The existence of a pulse to a terminal S i represents a logic “1” during the current clock period, while the absence of a pulse represents a logic “0” values of the signal S i . Note that each arrived input pulse will either change or keep unchanged the internal state of a cell, but the cell does not generate an output pulse immediately. Only when the clock pulse arrives, depending on the internal state of the cell, an output pulse could be developed and the cell is reset. An elementary cell of the SFQ circuits is similar to an asynchronous logic circuit coupled with a register (flip-flop) storing its output bit(s) until the end of the clock period [16]. 6 (a) A SFQ pulse (b) A graphical representation of it Figure 1.3: SFQ pulses 1.1.4 Interconnects in SFQ Josephson Transmission Line (JTL) and Passive Transmission Line (PTL) are usually used as interconnects to transmit a signal in SFQ circuits [17]. AsshowninFig.1.4[16],thejunctionsareconnectedinparallelwithlowinduc- tance strips in JTL. When an input pulse arrives, it switches J1. The inductance L1 is not large enough to hold an SFQ, so J2 also switches producing an output pulse. JTL allows the signal to pass from either direction. In a standard library, the JJ and indunctor in JTL cells are usually uniformly sized with the number of stages change. Figure 1.4: The Schematic of Josephson Transmission Line A PTL usually contains a driving circuit and a receiving circuit. The driving circuit consists of a superconductive stripline, one JJ, and a small inductance, 7 whileareceivingcircuitconsistsoftwoJJswithasmallinductanceinthereceiving circuit[18]. Thetransmissionofpulsesapproachesthespeedoflight. Themainuse of JTL in real designs is interconnecting more complex cells over short distances. For long lines, PTL has advantages over JTL in terms of output delay, power and routingflexibilitysinceitonlyrequiresactiveJJsinthedriverandreceivercircuit. 1.1.5 Clocked SFQ cells Existing cell libraries includes D-Flip-Flop (DFF), AND, NOT, OR, and NAND gates, etc. In addition to the sequential registers, all the combinational cells in CMOSlogicrequireaclocksignaltogenerateoutputsignalsinSFQlogic. Fig.1.5 showstheschematicandtheMooreFSMdiagramofaDFFcellinSFQlogic,which is also known as RS (reset-set) flip-flop. It is built around the loop J2-L-J4 which has two stable states: “1” and “0”, with and without a magnetic flux quantum inside the loop, correspondingly. In state “0”, an arriving pulse at the input “Set” enters and is stored inside the loop. In state “1”, the DC current in the loop flows clockwise and as a result the junction J2 is biased far from its critical current value. IfanotherSFQpulseappliedtotheinput“Set”, itflipsthejunction J1and the DFF remains in state “1”. However, if a pulse arrives at the “Clock” input when the DFF is in state “1”, the junction J4 would flip, releasing the stored flux quantum and thus clearing the quantizing loop; In state “0”, junction J3 is closer 8 (a) Schematic of DFF in RSFQ Logic (b) Moore Diagram of the DFF Cell Figure 1.5: DFF Cell than J4 to its threshold value and a clock pulse flips J3, so that the DFF remains in state “0”. Note that SFQ cells need a current bias network to operate. Traditionally, bias currents distributed using resistors, which is the case for Rapid SFQ, or Resistive SFQ, known as RSFQ. Other techniques in SFQ family include efficient rapid singlefluxquantum(ERSFQ)[19],energy-efficientSFQ(eSFQ)[20]andreciprocal quantum logic (RQL) [9]. They eliminate the static power losses of RSFQ by avoiding the use of bias resistors. Similar with DFF, all the combinational cells in SFQ circuits are composed from JJs. Their basic operations need the clock pulse to trigger an output pulse. Thistypeofoperationisthemostnaturalutilizationoftheuniquedynamicsofthe JJ circuits. This explains why the combinational cells in SFQ logic need a clock. As a result, the number of clock sinks in SFQ circuits can be much larger than CMOS. Moreover, the frequency of this clock is very high as the delay between 9 (a) SchematicofSplitterinSFQLogic (b) Operation of Splitter Figure 1.6: Splitter Cell clockedelementsisquitesmall. Thesetwofeaturesmakesclocktreesynthesisvery challenging. Moreover, unlike in CMOS, there is a special cell called “splitter” which pro- vides the splitting of the pulse as shown in Fig. 1.6. The cell duplicates the pulse arriving at input “IN” into two outputs “OUT1” and “OUT2” without a decrease of the pulse voltage amplitude after some gate delay. The operation of the splitter does not need the clock signal. 1.1.6 Fanout Limit Current logic cells in SFQ circuits have a fanout limit of two . The splitter cell as in Fig. 1.6 are used to duplicate the signal in the circuit, both in data paths and the clock trees. This feature significantly increases the number of gates compared 10 with CMOS circuits. Take the clock tree as example, splitters are required to duplicate the clock signal from root to all clock sinks. In the case of binary clock trees, N− 1 splitters are required for the circuit with N clock sinks. Moreover, to the best of our knowledge, the number of fanin in SFQ cells are mostly limited to two. This feature also results in larger logic depth after synthesis compared with CMOS logic which affects the throughput of the sequential SFQ circuits, as will be shown in the following chapters in this thesis. 1.2 SFQ Clocking Techniques This section reviews the clock distribution network, then presents several existing clocking architectures in SFQ circuits. 1.2.1 Review of Clock Distribution Network (CDN) In a synchronous digital system, the clock signal is used as a time reference for the data movement within the circuit. The design of clock distribution network (CDN)iscriticaltotheperformanceandreliabilityofthesynchronoussystem[21]. Even in a CMOS circuit, the design of CDN is not a easy task, especially with high frequency, large scale circuits [22]. There are several important factors: • Clock skew. In the ideal case, all the clocked gates (clock sinks) receive the clocksignalatthesametime. Inreality, differentclocksignalpathscanhave 11 variousdelaysduetoseveralfactors, includingthewirelengthfromtheclock source to the clock sinks, the delays of active buffers in the CDN, and the mismatches in the passive and active interconnect parameters. These could cause a difference of the arrival time of the clock signals, referred to as clock skew. There are timing constraints on the arrival time of data signals and clock signals on the sequential gates. Violations of these constraints could cause the entire chip to fail. • Delay uncertainty. At high frequencies, delay uncertainty increases due to process and environmental variations. This can make controlling of clock skew harder. • Timing criticality aware CDN design. To provide more time for the crit- ical worst case data paths and improve performance, localized clock skew are used in the techniques such as “double-clocking” [23], “deskewing data pulses”[24],“cyclestealing”[25,26],“usefulclockskew”[27],and“prescribed skew” [28]. The utilization of these techniques is quite important in the high performance VLSI designs. • Structured custom VLSI circuits CDN design. Although the mature EDA toolstargetedonCMOSlogichavebeendevelopedformanyyears,thedesign 12 of a tightly controlled CDN on a large non-redundant hierarchically struc- tured integrated circuit within specific temporal bounds is still difficult and problematic [21]. 1.2.2 Single Clock Architecture As stated in Section 1.2.1, the CDN design becomes more important when operat- ing at ultra-high clock frequencies. SFQ logic is one of the high frequency designs. As a result, the clocking techniques in the high frequency SFQ logic play a critical role in the design flow. Meanwhile, there are several differences between the CDN design of SFQ and CMOS logic. • ThenumberofclocksinksinSFQlogicaremuchlargerthanCMOSlogic. As a result, the synchronization of clock signals is relatively more challenging. • The duplication of clock signals in SFQ circuits relies on the splitter cells. Given the splitter cell has single inputs, two outputs and the clock tree is fully balanced binary tree, the number of splitter cell is N − 1 when there are N clock sinks in the circuit. This means the size of CDN is comparable to the original circuit, which not only increases the total area of the circuit, but also affects the following placement and routing. 13 • Due to the limit of interconnect and cell types, the controlling of clock skew mostly depends on the placement of clock splitters. While in CMOS cir- cuits, compensation techniques such as passive RC delay elements [24] and geometrically sized transistor widths [29] are commonly used. SimilarwithCMOScircuit, thesingleclockarchitectureusesaglobalclocksig- nal distributing to all clock sinks. The clock signals are a steady periodic sequence of SFQ pulses with defined clock period. Since SFQ circuits has the characteristics of gate-level pipeline, to ensure the correct data flow, all inputs of each clocked cell must be in the same logic level, which is called the path-balancing constraint. The explanations on this feature will be detailed on Sec 1.3.1. Different approaches has been reported in the literature to synthesize the clock tree with a single global clock. Multiple clock topologies have been proposed and the trade-offs are discussed in [30]. A minimum-skew clock tree synthesis and placement algorithm is proposed in [31–33]. The algorithm first reads the placed netlist with locations of all clock sinks and generates a fully balanced binary tree to reduce the maximum level difference in the clock tree among clock sinks to zero. Second, the method of means and medians (MMM) algorithm and deferred- merge embedding (DME) algorithm [34] are used to generate a zero-skew solution with minimum cost in terms of wirelength, assuming a linear delay model. Then, the splitter cells in the clock tree are placed initially at the embedding points of the clock network generated by the previous step. Finally, a mixed integer linear 14 programming (MILP) based approach is used to map the clock splitters to the routing channels and legalize the final location of the clock splitters. Considering the timing criticality of the data paths as well as the total wirelength, the work in [33] presents a bottom-up clock tree topology algorithm using an integer linear programming (ILP) approach. An extended MILP algorithm based on [32] and a two-stage global placement method followed by a placement refinement method are proposed in [35]. The work modified the objective function of the MILP problem in [32], to minimize both total wirelength and clock skew. The layout generated allows placement of the clock splitters and logic cells in the same row while the two types of cells are placed in neighboring rows in [32]. Useful skew, which is a commonly used technique in CMOS circuits, is in- troduced into RSFQ circuits in [36]. The clock skew is optimized for robustness to parameter variations and converted into a schedule of clock arrival times by quadratic programming. The clock arrival times are satisfied by the refinement of the clock splitters and the delay elements, and the tuning of interconnect paths. Hierarchical chains of homogeneous clover-leaves clocking structure (HC) 2 LC is described in [37]. The primary target of this structure is to increase the robust- ness since the clock period of the system adapts to the slowest hierarchical chain, and reduce the race condition hazards by counter-clocking structure. 15 1.2.3 Dual Clock Architecture Single clock architecture requires path balancing on datapath, which leads to the insertionofalargeamountofpathbalancingregisters. Toaddressthisissue,adual clock architecture is proposed in [2]. The architecture in Fig. 1.7 utilizes two clock signals: fast clock and slow clock. The fast clock is used to propagate the logic through a possibly unbalanced network of logic while the slow clock aims to clock the architectural registers and sets the throughput of the circuit. This method requires the addition of non-destructive read-out DFFs (NDROs) as repeat band (the green band in Fig. 1.7) to repeat pulse signals. The mask band (the blue band in Fig. 1.7) is to prevent the propagation of invalid values in order to ensure correctoperationwithunbalanceddatapaths. Thedepthsoftheblock1and2are D 1 and D 2 , respectively. 1 The frequency of the fast clock is D+2 times that of the slow clock where D =max(D 1 ,D 2 ). Each set of primary inputs is repetitively fed into the circuit for D+2 fast clock cycles, while the valid primary outputs can be collected after D+2 fast clock cycles and one cycle of slow clock. This implies the dual clock structure has a throughput degradation compared with the single clock architecture. The throughput degradation depends on the largest imbalance factor in the logic blocks. Although the dual clock architecture may likely be 1 A logic block with depthD hasan imbalancefactor rangesfrom 0toD, where themaximum difference of logic levels of all fanin gates for a gate is imbalance factor. 16 extendable to sequential circuits, the existing literature has only demonstrated it on combinational circuits [2]. Figure 1.7: Dual clocks architecture in SFQ circuits [2]. 1.2.4 Other Clock Architectures Self-timing Systems Data-drivenself-timingsystemshasbeenproposein[38–41]. Intheseasynchronous approaches, the clock signals are generated from the data, and no global clock is used to drive the RSFQ system. A 20GHz RSFQ-based Arithmetic-Logic Unit (ALU) as the critical component of an 8-bit RSFQ processor datapath is demon- strated in [41]. This approach benefits from no need for global clock distribution networkandlessclockskewconsideration. However,theperformanceofthedesign is unstable due to sensitivity to logic delays, and handshaking circuitry increases the design area. 17 Two-Phase Clocks Two-phase clock architecture is explained in [7]. In this clocking structure, the phases of the clocks are shifted by half of the clock period with each other, as shown in Fig. 1.8 and Fig. 1.9. The major advantages of this structure is the circuits are free of timing violations. For any clock skew, there exists a clock frequency lower than which the circuits always work correctly. Figure 1.8: Two Phase Clocks Figure 1.9: Two Phase Clocks Applied On a Linear Pipeline 18 1.3 Path Balancing and Multi-Threading Fea- ture in SFQ 1.3.1 Full Path Balancing Full path balancing (FPB) is a special constraint in SFQ circuits since every logic gate is clocked. As defined in [42], for each gate, logic depth is defined as the number of gates from primary input to the gate itself. If all inputs have the same logic depth, then the circuit is said to be fully path balanced. As shown in Fig. 1.10a, the combinational block in the original circuit have primary inputs A-E connected to the logic gates which are injected the same clock cycle. The logic depth of the inputs are marked with blue. The circuit does not havethecorrectfunctionsincetheinputpulsesdonotarriveatthesameclockcycle for each gate. A path balanced circuit with DFFs inserted is shown in Fig. 1.10b. With extra latching units inserted, each logic gate receives correct input data. In a sequential block with feedback loop in Fig. 1.11a, the registers in the original circuit are marked as “FF”. An extra FF is required to be inserted in the feedback loop in Fig.1.11b. In general, it is required that the depth of a feedback loop is the same as the depth from a primary input/original register to another fanout original register. 19 (a) A combinational logic block with- out path balancing, input pulses arrive at different clock period (b) The logic block with path balanc- ing, input pulses arrive at the same clock period, to guarantee correct op- eration Figure 1.10: Path balancing in a Combinational Block (a) A sequential logic block without path balancing (b) The logic block with path balanc- ing Figure 1.11: Path balancing in a Sequential Block 1.3.2 Multi-threading TheFPBfeatureinSec.1.3.1yieldsultra-deepgate-levelpipelinesinSFQcircuits. To achieve high throughput, multi-threading is utilized. As in the path balanced circuit in Fig. 1.12, in each clock period, a new set of primary inputs is applied and the corresponding output is collected. In general, for fully path balanced combinationalcircuits,multi-threadedbehaviorwithasmanythreadsasthedepth of the circuit yields the maximum throughput. 20 Figure 1.12: Multi-threading in a Combinational Block 1.4 Thesis Contributions Thecontributionsofthethesisincludeavariation-awareholdtimingfixingmethod- ology of SFQ circuits and a multi-phase clocking methodology that can reduce the number of path balancing DFFs using an integer linear programming (ILP) for- mulation. Consequently, a hold time safe flow is extended on the multi-phase clocking SFQ circuits which guarantees 100% timing yield. More specifically, the contributions of this thesis include the following. • We present a variation-aware hold time fixing methodology. The method- ology considers both local and global timing uncertainties and effectively uses common path pessimism removal to reduce the number of inserted hold buffers on each timing path. 21 • We present an evaluation on the hold timing fixing using dynamic timing analysis with a grid-based placement-aware variation model on multiple IS- CAS’85 benchmark circuits. • We propose a novel multi-phase clocking methodology targeting multi- threaded gate-level pipelined sequential circuits. • We propose a novel integer linear programming (ILP) algorithm to minimize the number of required D-Flip-Flops (DFFs) given the number of available clockphasesandacorresponding numberofprocessingthreads. Weenhance the ILP algorithm and maximize register sharing across the fanout gates which further reduce the number of required DFFs. • We evaluate the savings in the number of DFFs, area, total clock tree wire length after clock tree routing on a suite of sequential benchmarks, includ- ing AMD2901 and seven open-cores designs. We evaluate the enhanced DFF sharingflowonthesavingsofthenumberofDFFswiththesequentialbench- marks and several combinational benchmarks from ISCAS’85. The proposed methodreducesthenumberofpath-balancingDFFsby55.5%withtwoclock phases and up to 95.5% with ten clock phases with savings on area and clock treewirelength. TheenhancedDFFsharingflowfurtherreducedthenumber of DFFs by 26.3% compared with the original method. 22 • We verify the function correctness of the sequential benchmarks with dy- namic simulation, with output results compared between the RTL Verilog and the netlist with multi-phase clocks. • We present the benefits of the multi-phase clocking methodology on tim- ing robustness. A hold time safe extension which guarantees 100% timing yield by reducing the system clock frequency when time violation exists is presented. 1.5 Thesis Organization The rest of the thesis is organized as follows. Chapter 2 describes the proposed variation-aware hold time fixing in SFQ circuits and the evaluation flow. Chapter 3 presents the multi-phase clocking methodology in SFQ circuits, including ILP formulation and evaluation with design flow from synthesis to clock tree routing. An enhanced ILP algorithm and maximum register sharing flow across the fanout gatesarepresentedwithexperimentalresults. Chapter4proposesaholdtimesafe guaranteed flow as an extension based on the multi-phase clocking methodology. Finally, Chapter 5 concludes the thesis and points out several promising directions for future research. 23 Chapter 2 Variation-Aware Hold Time Fixing in RSFQ Circuits RSFQ circuits often have ultra high clock frequencies and significant time uncer- tainty [43,44], which makes the design of clock distribution networks (CDN) and timing closure extremely challenging. In particular, managing clock skew in high-frequency SFQ CDNs is extremely important [45] because, in addition to the storage gates, all logic gates are clocked and all fanouts are implemented with splitters, which increases the size and inser- tion delay of the clock network significantly. Moreover, due to variations in the fabrication process, and biasing or temperature variations, timing uncertainties in the clock trees are significant. Therefore, developing methodologies for genera- tionofhigh-qualityclocktreesandrobusttiminguncertainty-awaretimingclosure algorithms is paramount. A recent work [33] proposed a minimum-skew CDN topology generation algo- rithm using a fully-balanced tree structure which accounts for timing criticalities in the datapath and the total wirelength of the clock tree, and minimizes the total 24 negative slack in the presence of timing uncertainties. Even when deploying such CDN algorithms, EDA flows need to utilize efficient techniques to close timing, i.e., fix setup and hold timing issues. In particular, potential hold-time viola- tions are typically mitigated by adding clockless buffers into problematic short paths in the data path. The high clock frequency and gate-level pipelining of SFQ circuits makes this task particularly challenging. Moreover, as in CMOS, these buffers should account for expected timing variations and should be validated ei- ther through someform of static timing analysis (STA) [46] or classic Monte Carlo simulation [47]. 2.1 Timing uncertainty in RSFQ circuits The variation of gate propagation delay in standard library of RSFQ circuits come from several sources [48]. • Global variations. The delay of the RSFQ standard cells can be affected by thedifferentcorners. Duetoaglobalscalingofprocessparameters,operation conditions differ in various corners such as nominal, fast, slow, etc. • Local variations. The variability of the fabrication process causes local on- chip variations. Local variation can affect each cell on the same chip sep- arately while global variation equally affect parameters of all devices on a particular chip. 25 • Thermal noises [49]. Since resistors are used for shunting down JJs and bias current distribution, they induces thermal noises in RSFQ circuits, which becomes more significant when the number of cells increases. • Input and output loads. The loads affect the delay of a cell by the bias current redistribution, load cell inductance, and load JJ critical current. • The state of the succeeding cells. If a succeeding cell is in in state “1”, a RSFQ pulse is stored in the quantizing loop. The delay of the adjacent cells will be affected since the circulating current induces additional current redistribution. 2.2 Definitions In this section, we introduce some definitions and notations used throughout the chapter. 2.2.1 Data Path Delay Inconjunctionwithinterconnectdelay, combinationalgatesmakeupthedatapath delay between sequential SFQ elements. For example, Fig. 2.1 illustrates two sequentialSFQelementsG 1 andG 2 ,whileS 1 − S 5 areclocksplittersthatdistribute the clock signal from a clock source to each sequential element in the circuit. The 26 data path delay of the path between G 1 and G 2 refers to the summation of clock- to-Q delay of G 1 , the delay of the splitters, and interconnect delays between the gates. Figure 2.1: Example of a timing path in a RSFQ circuit. 2.2.2 Insertion Delay The insertion delay refers to the total delay from the clock source to clock input of a sequential element [50]. This is also known as the arrival time of the clock signal at the clock pin of each sequential gate. For G 1 in Fig. 2.1, it refers to the delay from clock source (Clk Src) through splitters S 1 , S 3 , and S 4 as well as the delay of interconnects connecting the clock source to G 1 . 27 2.2.3 Clock Skew Two sequential elements connected by data path splitters and interconnect are called sequentially adjacent gates. In Fig. 2.1, G 1 and G 2 are called launching and capturing flops, respectively. The difference between the clock arrival time at the launching and capturing flip-flops is defined as the clock skew between the pair of gates. For the pair of flip-flops illustrated in Fig 2.1, the clock skew is the difference between the delay from Clk Src through splitters S 1 , S 3 , and S 4 as well as the delay of interconnects connecting the clock source to G 1 on one hand, and the delay from clock source through splitters S 1 , S 3 , and S 5 as well as the delay of interconnects connecting the clock source to G 2 on the other hand. 2.2.4 Hold Time Each sequential element requires some time to reliably capture input data at its data pin. Hold time is defined as the amount of time after the arrival of the clock pulse during which input data must remain stable. It imposes a timing constraint on every pair of sequentially adjacent gates G 1 and G 2 . Let T skew (G 1 ,G 2 ) denote theclockskewandT min DP (G 1 ,G 2 )denotetheminimumdatapathdelaybetweenG 1 and G 2 . Let T max hold represent the maximum hold time required for G 2 . Then, hold slack is defined as the difference between when the data can change and how long it should remain stable, i.e., Slack hold (G 1 ,G 2 )=T skew (G 1 ,G 2 )+T min DP (G 1 ,G 2 )− T max hold (2.1) 28 A negative hold slack leads to a hold time violation. As shown in Equation (2.1), hold slack is not a function of the clock cycle time. Therefore, it cannot be fixed by adjusting clock frequency and even a single hold time violation may lead to circuit malfunction. UnlikeCMOStechnology, wheremultipleholdtimerefinementtechniquessuch asgatedown-sizing,switchingbetweenlow-thresholdandhigh-thresholdgates,and wire sizing are commonly used, such options are not available in the current SFQ technology and existing cell libraries. Hence, hold buffer insertion is one of the most applicable approaches for SFQ circuits. 2.2.5 Hold Time Margin Hold time margin (i.e., safety margin) is the extra delay added to data paths to avoid hold violations accounting for variations in the timing parameters in Equa- tion (2.1). Providing this “cushion” to the design can help make it more robust to timing variations and possible hold time violations. Hold margin is satisfied when Slack hold (G 1 ,G 2 )≥ T margin hold (G 1 ,G 2 ) (2.2) 2.2.6 Clock tree topology We define the clock tree topology (CTT) as a directed binary tree T connecting clock source to all the clock sinks. The root of T denotes the clock splitter S 0 29 connected to the clock source while the nodes {S n+1 , S n+2 , ..., S n+k } represent the leaf nodes (clock sinks), i.e., sequential elements in a circuit. A clock topology generation algorithm creates nodes {S 1 ,...,S n } representing the clock splitters, addsthemtotheclocktree,andconnectstheclocksourcetoallthesinknodes. We assumetheleafnodesareatlevel0,whiletherootisatthehighestlevelofthetree. In Fig. 2.1, the CTT refers to the binary tree including Clk Src, S 1 through S 5 , and the edges between clock splitters S 4 ,S 5 , and sink nodes G 1 ,G 2 , respectively. 2.2.7 Common clock path (CCP) A clock path (CP) is a path in T from root node through tree nodes to each leaf node (clock sink). For two leaf nodes in T, the common clock path (CCP) refers to the portion of the clock tree that is common between the CPs to each gate, while the non-common clock path (NCP) represents the non-common portion of the CPs. In Fig. 2.1, the common clock path includes nodes S 1 and S 3 and the edges connecting Clk Src to S 3 . 2.3 Variation Aware Common Path Pessimism Removal In this section, we provide a motivating example, formally define the problem, and illustrate the proposed methodology in detail. 30 2.3.1 Motivating Example Consider the clock tree topology depicted in Fig. 2.2 where a clock signal is prop- agated to eight sink nodes using a balanced binary tree. Assume, the propagation delayofinternalnodes(i.e.,splitters)aswellasholdbufferstobe5 pswitha± 20% variation on the delay due to timing variations. Consider a data path connecting node 7 to node 14. Considering the worst-case scenario in terms of hold slack, due to timing variations the delay of all clock splitter nodes on the launch CP decrease to 4ps and all the splitter delays on the capture CP increase to 6ps. As a result, the hold slack may be reduced by 6× 1ps = 6ps. However, such an estimation is overly pessimistic as node 0 is on the common CP to nodes 7 and 14. Therefore, the worst-case incurred hold slack cannot be less than − 4ps. Consequently, ac- counting for CPPR during the timing closure can reduce the number of required hold buffers, connecting node 7 to 14, from 3 to 2, assuming each hold buffer adds a delay of 2ps. This reduces the area overhead by 33% while achieving the same timing yield. 2.3.2 Problem Formulation The input to this problem is a netlist graph (G) comprising of the logic cells (i.e., nodes), their connections (i.e., edges), the location of each node on the layout area, descriptions of the clock network, including the clock tree topology (T) and 31 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (a) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (b) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 (c) Figure 2.2: A balanced binary tree with 8 sinks. (a) At nominal condition, the max clock skew is zero. (b) If all the splitter delays on path S 0 → S 7 decrease to the min propagation delay and all the splitter delays on the S 0 → S 14 increase to the max propagation delay, a hold time violation may occur on data path S 7 → S 14 . (c) Two buffers are added to the path between S 7 →S 14 to increase its hold slack. the location of the splitter nodes (S 0 ...S n ), and a variability model that defines thevariationsofgatedelaysunderthepresenceofprocess, biasing, ortemperature variations. Theobjectivefunctionistominimizethetotalnumberofrequiredhold buffers such that timing of the circuit with respect to hold constraints is satisfied or a target timing yield is reached. Therefore, the total negative hold slack in the presence of timing variations should be less than a predefined threshold value. In otherwords, weaimtosatisfy the constraintsin terms ofworst negative holdslack while minimizing the number of inserted hold buffers, i.e., the area overhead of timing closure. 2.3.3 Proposed Methodology - An Overview The overall flow of our proposed method is depicted in Fig. 2.3. The first phase follows the algorithms outlined in [32] and [47]. The Berkeley open source logic 32 synthesis tool, ABC [51], is used to synthesis the netlist. The qPlace tool, as part of the qPALACE CAD tool suite for SFQ circuit [32,33], is employed to place the logic gates, construct a minimum-skew clock network, and place the clock splitters [32,33] to minimize the nominal clock skew, reducing the total negative slack associated with all sequentially adjacent gates. ABC then parses and translates the generated netlist and clock tree information, extracts delays associated with the logic gates, splitters, and interconnect using a linear delay model [33], and performs timing analysis for all setup and hold constraints. Using ourproposedvariation-awareholdfixingalgorithm, ABCtheninsertsholdbuffers, i.e., Josephson transmission line (JTL) cells, between sequentially adjacent pairs in the netlist as needed. Figure 2.3: Overview flow of the proposed algorithm. 33 As shown in Fig. 2.3, we then enter the second phase of the flow in which the updated circuit, including newly inserted JTLs, is physically placed by the qPlace tool, either using our proposed incremental placement technique or by re- placing the entire circuit. The IBM CPLEX v12.10 package [52] is used for solving mixed integer linear programming (MILP) problems for clock tree placement and legalization. Since this step may modify the location of logic gates, the clock tree is re-synthesized to minimize the nominal clock skew and generate a high quality clocknetwork,albeitweforceittousethesameclocktreetopologyasinphaseone. Thisismotivatedbythevariation-awareholdfixingalgorithmthattakesadvantage of the topology of the clock tree and common clock paths to minimize the area overhead of timing fixes. Although clock skews and hold slacks in the nominal conditionmaybemodified, thetimingvariationswillnotleadtoadditionaltiming violations, as the worst case variations as a function of the clock topology remains constant. In addition, to both mitigate routing congestion caused by the relative few number of routing layers and to facilitate a final pass of hold buffer insertion, during this second placement phase we reserve empty space next to each logic cell and existing hold buffers. After this second placement, we parse the re-placed circuit into ABC and run a final round of timing analysis. If any additional hold buffers are required, they are inserted into the reserved spaces without the need to move other gates. We note that our ABC timing fix algorithm and System 34 Verilog translation with interconnect delay model has been added to the existing qPALACE tool suite [32,33]. In particular, the qPlace placement tool [32] has been extended to include an option for our incremental placement algorithm. The next subsection details our novel hold time fixing techniques. The incre- mental placement techniques is presented in Sec. 2.4.1. The description of the third (evaluation) phase of the flow is presented in Sec. 2.5. 2.3.4 Variation Aware Common Path Pessimism Removal Commonpathpessimismremoval(CPPR)istheremovalofunnecessarypessimism of timing analysis on launching clock path and capturing clock path by accounting for the common portion of the launch and capture CPs [53,54]. Particularly, the timing uncertainty of the propagation delay of these elements in the CCP affects the launching and capturing clock path delays similarly. Thus weproposetoremovethisunnecessarytiminguncertaintyfromourtiminganalysis to more accurately capture the worst-case timing scenario. To account for the remaining timing variations, we add hold margins to each timing path and thereby make the design robust to hold time violations. Assume G 1 and G 2 are the launching and the capturing logic gates, respectively. The hold slack from G 1 to G 2 is formulated with timing constraints as Equation (2.2). Instead of applying a constant hold margin to all hold slacks as proposed in [47], 35 we propose a variation-aware strategy to determine the hold margin required for each timing path and apply CPPR to eliminate unnecessary pessimism. For all gates in the circuit, we assume the gate delay follows a Gaussian dis- tribution with same standard variation value σ . In the worst-case scenario, a hold short path can be formulated as follows: the delay of clock splitters in the launch- ingclockpathbiastoalowvalue,whilethedelayofclocksplittersinthecapturing clock path bias to a high value. At the same time, the delay of gates in data path bias to a low value. We consider the three-sigma rule [55] which states that 99.7% of the possible data are within three standard variation deviations from the mean and assume a worst-case delay change of 3σ for each gate. Consider the worst case scenario in Equation (2.2), the variation-aware hold margin can be set as the the maximum reduction of Slack hold (G 1 ,G 2 ) caused by gate delay variations. T margin hold (G 1 ,G 2 )=3σ ∗ (D NCP (G 1 ,G 2 )+D DP (G1,G2)) (2.3) where D NCP (G 1 ,G 2 ) represents the sum of the delay of the clock splitters that are not common to LP and CP, while D DP (G1,G2) refer to the sum of delay of the gates along the data path from G 1 to G 2 . D NCP (G 1 ,G 2 )isdeterminedbyfindingthelowestcommonancestorinthebal- anced binary clock tree. We record the routes from the clock source to clock sinks when building the clock tree. For any two sequentially adjacent gates, the algo- rithm traverses both launching and capturing clock paths level by level, starting 36 at the leaves (level 0), until reaching a common node, the lowest common ancestor (LCA). 1 The number of splitters in the NCP is simply the number of nodes met during the traversal, which is equal to twice the level of the LCA minus 1. Thus, we can compute D NCP (G 1 ,G 2 ) in Equation (2.3) as: D NCP (G 1 ,G 2 )=2∗ (level(LCA(G1,G2))− 1)∗ D sp (2.4) where D sp denotes the nominal splitter delay. For each timing path, the hold time constraint is checked and in case of a negative slack, hold buffers are added before the corresponding input pin of G 2 until the constraint is met. Consider as an example the circuit in Fig. 2.1 and the hold time for the path G 1 to G 2 . The clock path of G 1 consists of the following clock segments: the clock source, splitters S 1 , S 3 , S 4 . Similarly, the clock path of G 2 comprises of clock segments: clock source, splitters S 1 , S 3 , S 5 . The CCP contains S 1 and S 3 , while NCP contains S 4 and S 5 . level(LCA(G1,G2)) is 2 in this case. The hold time margin for G 2 is thus T margin hold (G 1 ,G 2 )=3σ ∗ (2∗ D sp +D G 1 +N data sp ∗ D sp ) (2.5) where D G 1 denotes the nominal clock-to-Q delay of G 1 , and N data sp denotes the number of splitters in data path from G 1 to G 2 . T margin hold (G 1 ,G 2 ) can be used to guide hold buffer insertion for the timing path from G 1 to G 2 . 1 Note that the common node must be at the same level in both the launch and capture paths because the clock tree is balanced. 37 Assuming there are N clock sinks in the whole circuit, the height H of clock tree, a fully-balanced binary tree, is⌊log 2 N+1⌋. For CCP, the time complexity to find LCA is O(H). We denote the total number of pairs of sequentially adjacent gates as E. Consequently, the time complexity of finding LCA of all sequentially adjacent gates is O(E∗ log 2 N). The fact that the clock tree is a fully-balanced binary tree simplifies the LCA detection algorithm. In particular, for traditional CMOScircuits,state-of-artLCAdetectionalgorithmstypicallyrequireareduction to an instance of range minimum query (RMQ) problem via an Euler walk of the clock tree and the storage of several extra tables [56–58]. 2.4 Placement Methodology After adding the hold buffers to the netlist, the location of logic cells and inserted hold buffers is determined. Note that since the timing closure algorithm is based ontheoriginalclocktopologyandinsertedholdbuffersarenotsequentialelements, afterplacementandlegalizationoftheholdbuffers,thesameclocktopologycanbe utilized. However, to minimize timing metrics such as maximum clock skew and the negative timing slacks, the location of gates are properly adjusted. Similar to [32], we adopt the deferred merge embedding (DME) algorithm for calculating the location of the tapping points of the clock network on the layout area and employ an integer linear programming algorithm to map the clock splitters to routingchannelsandremovetheoverlapsamongtheclocksplittersandlogicgates, 38 such that the maximum clock skew is minimized. Finally, once a legal high quality solution in terms of placement and timing metrics is generated, another iteration of timing fixes adjusts the timing slacks. Accordingly, the location of the placed logic and clock splitters are then mod- ified to accommodate the placement of the hold buffers. The total number of inserted hold buffers is a function of the degree of timing uncertainties and can be a large fraction of the size of the original netlist. Therefore, to optimize the qual- ity of results in terms of the total wirelength and timing metrics, we propose two placement strategies: (i) Incremental Placement and (ii) Placement from Scratch. 2.4.1 Incremental Placement The incremental placement methodology helps preserving the original placement solution by constraining the placement of logic cells and data splitters to the same rows as they are initially placed. When the number of inserted hold buffers is rel- atively small, an incremental placement approach helps minimizing the displace- ment of original netlist elements, eliminates the need for extensive modifications to the clock network, and facilitates converging to a high quality solution without the need for multiple iterations of placement and clock synthesis. Alternatively, when the number of inserted hold buffers is large, executing the placement algo- rithm from scratch may yield better overall placement metrics such as the total wirelength, layout area, and routing congestion. As the number of required hold 39 buffers depends on the structure of circuit and the degree of variations, this paper considers both. In our incremental placement flow, the logic cells and data splitters remain in their initially placed logic row, i.e., their y coordinates are preserved. However, their x coordinates are adjusted to accommodate the placement of hold buffers. The pseudo code for the incremental placement flow is shown in Algorithm 1. Lines 1-23 show the assignment of the hold buffers to logic rows. First, the ideal number of gates per row is determined by computing the total number of gates (including hold buffers, logic cells, data splitters and clock splitters) and dividing this number by the number of rows (cf. line 1 in Algorithm 1). The algorithm tries to assign hold buffers to logic rows to minimize the final width of the layout area thereby the total layout area after hold buffer insertion, using the number of cells assigned per row as a proxy for row width. This is achieved by setting a threshold value for the number of gates per row and trying to distribute the hold buffers among rows such that the final distribution of logic gates per row is below this threshold for all rows. Initially, this threshold value is set to the ideal number of gates per row and each row is marked as full or partially-filled by comparing the existinggatecountwiththisthreshold(cf. line7). Wethenassigneachholdbuffer to the nearest partially-filled row to its fanout gate. In particular, if the row of the fanout gate is partially-filled, we assign the hold buffer to the same row with its x coordinate shifted from its fanout gate by the width of one hold buffer (cf. lines 40 8-10). Otherwise,theholdbufferisassignedtothenearestpartially-filledrowwith x coordinate set to the same value as the fanout gate (cf. lines 12-15). There may be scenarios in which all close-by rows for a hold buffer are full which can increase thewiredelaymorethanexpectedandforceanincreaseintheclockcycletime. To avoid degrading the clock frequency when this situation is encountered, we instead increase the threshold by one buffer and accept a small increase in area (cf. lines 16-20). Once the assignment of the hold buffers to the logic rows are determined, each logic row is legalized to remove all cell overlaps using the algorithm presented in [33] (cf. line 24 in Algorithm 1). In lines 25-26, the algorithm synthesizes and places the clock tree [32] and produces the final legal placement. Because the logic cells are mapped to their original logic rows, the location of the sink nodes of the clock network are minimally modified. Therefore, the location of the tapping points of the clock network, i.e., the placement of the clock splitters, will be similar to those of the original clock network. In summary, the incremental placement algorithm tries to fulfill three objec- tives: (i) preserve the layout area by distributing the hold buffers among partially- filled rows; (ii) minimize the perturbation to the original placement solution to facilitate the clock tree synthesis and control the clock skew; and (iii) minimize the adverse effect of additional cell and interconnect delays, due to the insertion of hold buffers, on the maximum clock frequency of the circuit. 41 Algorithm 1 Incremental Placement INPUT: A netlist, logic cell placement, a clock tree topology (CTT), a list of inserted hold buffers OUTPUT: A placed netlist with clock tree and hold buffers 1: Th =⌈ #nodes #rows ⌉ ▷ the threshold (max) # of gates per row 2: rowMap[rowId] = no. gates in row rowId 3: for path from u to v do 4: currRowId = getRow(v) 5: for hold buffer j do 6: currGateNum = rowMap[currRowId] 7: filled = compare(Th, currGateNum) 8: if filled is false then 9: assign j to same row as v 10: rowMap [currRowId] += 1 11: else 12: closestRowId← closest unfilled row 13: if clock frequency meets then 14: assign j to closest unfilled row 15: rowMap [closestRowId] += 1 16: else ▷ Increase threshold by 1 17: Th += 1 18: assign j to same row as v 19: rowMap [currRowId] += 1 20: end if 21: end if 22: end for 23: end for ▷ complete row assignment of hold buffers 24: legalize and remove overlaps of hold buffers 25: add clock splitters from original CTT to netlist 26: place clock splitters, legalize, and remove overlaps 2.4.2 Placement from Scratch Forsomecircuitsandvariationsettings,thenumberofinsertedholdbuffersperrow canbecomparabletotheinitialnumberofgates. Thus,theincrementalplacement 42 algorithm is forced to make substantial modifications to the original placement solution resulting in a significant increase in the total wirelength. For these cases, it is beneficial to re-place the updated netlist without any row restrictions. Once the placement and clock tree synthesis are completed, the timing closure flowaddsafinalroundofholdbufferinsertiontoensuretimingrequirementsunder nominal conditions are satisfied. Experimental results show that the total number of required hold buffers in this phase is much smaller than the size of the netlist and can be placed in the reserved empty spaces next to existing logic cells without requiring multiple iterations. 2.5 Evaluation Flow We evaluated the timing yield of our final circuits by translating them to System Verilog (SV) netlists using Python scripts, similar to [59], and running dynamic Monte Carlo (MC) co-simulations in Cadence NCsim, checking all setup and hold constraints. This section first presents the variation model used and then details of the co-simulation Monte Carlo flow. 2.5.1 Variation Model In this subsection, we summarize the variation model utilized to generate the random gate delays used in the MC simulations for evaluating timing yield [47]. 43 Figure 2.4: Grid-based variation model with uniform grids. To account for the spatial correlation of timing uncertainties, we employ a placement-aware variation model, utilizing a grid-based model similar to [47,60] illustrated in Fig. 2.4. In this model, the entire layout area is considered to be inside one grid bin (level 0) where we assume global variations affect all the gate delays. Subsequently, to account for the local variations on the chip, we divide the layout area level by level, each bin is subdivided into smaller bins. At each level, it is assumed that all process parameters and variations within the same bin have the same characteristics. The differences of bins are determined by the local variation of the level and different hierarchy levels are assumed to be statistically independent [47]. By adding up the time variations induced by all levels, the delay factor of each gate can be defined. The delay of all gates produce a data set which follows a Gaussian distribution. We set the targeted standard variation of this Gaussian distribution data set under process variations based on the process 44 control monitor (PCM) data for 350nm fabrication process SFQ5ee developed by MIT Lincoln Laboratory (MIT LL) [61]. Then the parameters for global and local variations can be set according to the targeted standard variation [47,62,63]. 2.5.2 Dynamic Simulation InordertorunMonteCarlo(MC)simulations, wedefinegatedelayscalingfactors for each gate in the circuit and write them as System Verilog macros. According to our variation model, we generate a new random delay factor set for each Monte Carlo simulation. In case of a plain simulation without random gate delays, the gate delay factors are set to one. This is used to verify the circuit in the nominal casebeforeMonteCarlosimulations. A“golden”behavioralnetlistisco-simulated to validate the circuit functionality with the same random input vectors. A MC run is considered as a “pass” only when there are no setup or hold violations and no mismatches between the primary outputs of the model under simulation with those of the golden results. 2.5.3 Experimental Results We compare the proposed method with a baseline approach proposed in [47] that appliesaconstantholdmargintoeachtimingcriticalpathinthecircuit. Thefixed marginapproachdoesnotconsiderclocktopologyandrequires,ingeneral,multiple iterationstoobtain adesign-specific fixed margin that achieves a high timing yield 45 while managing the overhead in terms of incurred area. In contrast, our proposed method adds more hold margins to paths where the lowest common ancestor of sequentially adjacent FFs is higher in the clock tree and thereby achieves superior results over all the benchmarks. We also compare the two proposed re-placement methods, i.e., placement from scratch and incremental placement. We run Monte Carlo simulations on the placed ISCAS’85 benchmarks [64]. All experiments were run on two Intel Xeon E5-2450 v2 CPUs with 128 GB of RAM. Weexperimentedwithboth2σ and3σ variationsinholdtimemarginwithrandom Gaussian distributed gate delays to analyze the trade-off between area and timing yield. For the fixed margin approach, we evaluated hold margin values of 10 ps and 4ps, before settling on a compromise of 7ps, optimizing both buffer area overhead and timing yield. In all our experiments, the additional white-space around logic cells and hold buffers designed to reduce local routing congestion (i.e., ensuring a routable solution is produced) and inserting hold buffers in the final step of timing closure, is set to 8× the routing track (10µm ). To solve the MILP problem for clock tree synthesis, the CPLEX time limit is set to 60 minutes. Table 2.1 lists the timing yield values and the number of inserted hold buffers before and after re-placement, for the baseline, 2σ and 3σ approaches. In these experiments, the clock frequencies are set to the delay of the longest timing path withasmallsetupmargin. Inoursimulationresults,theaverageclockfrequencyof 46 fixed and variation-based hold fixing approaches differ by an average of 5% across all the benchmarks. Compared with the fixed margin approach, our 3 σ approach produces netlists with an average reduction of 8.4% in terms of number of hold buffers and an in- crease in timing yield of 6.2%. The 2σ approach achieves an average of 21.9% saving in terms of number of hold buffers with a 1.7% higher timing yield. The primary source of improvement over the baseline approach comes from the appli- cation of CPPR to the clock tree topology. Note that these results assume no common clock path between PIs/POs and thelogiccells. Ifweestimatethecommonclockpath(CCP)lengthofPI/POpaths withtheaverageCCPlengthofthecircuit, the3σ approachachievesanaverageof 10.2% saving in hold buffers over the fixed margin approach. Alternatively, if we simply exclude JTLs on PI/PO paths, the 3σ approach saves an average of 14.1% hold buffers. Table 2.1: Number of hold buffers (JTLs) and timing Yield (%) for the fixed margin (7ps), and the proposed variation-based 3σ and 2σ hold margin, with incremental placement fixed margin (7ps) 3σ 2σ Design Std Dev Yield(%) # JTLs plc # JTLs replc Yield(%) # JTLs plc # JTLs replc Yield+(%) # JTLs-(%) Yield(%) # JTLs plc # JTL replc Yield+(%) # JTLs-(%) c432 0.0808 99.1 1508 1564 99.7 1012 1420 0.6 9.2 95.4 1012 1185 -3.7 24.2 c499 0.0808 95.9 564 584 99.7 426 531 4.0 9.1 99.3 426 470 3.5 19.5 c880 0.0808 90.4 1031 1151 99.8 757 1018 10.4 11.6 95.4 757 871 5.5 24.3 c1355 0.0809 99.1 577 591 100.0 431 525 0.9 11.2 97.8 431 470 -1.3 20.5 c1908 0.0808 97.5 1085 1150 99.7 739 985 2.3 14.3 98.3 739 830 0.8 27.8 c2670 0.0831 93.8 2831 3570 99.8 2513 3573 6.4 -0.1 96.1 2513 3196 2.5 10.5 c3540 0.0829 93.7 2055 2294 99.7 1416 2190 6.4 4.5 93.5 1416 1770 -0.2 22.8 c5315 0.0829 90.2 4199 5036 99.1 3055 4747 9.9 5.7 91.4 3055 3939 1.3 21.8 c6288 0.0829 92.5 4697 6039 99.2 3592 5404 7.2 10.5 93.0 3592 4685 0.5 22.4 c7552 0.0829 86.8 2601 2990 98.9 1794 2745 13.9 8.2 94.2 1794 2252 8.5 24.7 AVG. 0.0819 93.9 2115 2497 99.6 1574 2313.8 6.2 8.4 95.4 1574 1967 1.7 21.9 The results of the placement from scratch algorithm are shown in Table 2.2. The 3σ approach shows a 2.0% saving on the number of hold buffers with a 5.9% 47 improvement on timing yield, when compared with the baseline approach. Addi- tionally, the 2σ approach shows a 19.4% saving on the number of hold buffers with a 0.9% drop on yield. Table 2.2: Number of hold buffers (JTLs) and timing Yield (%) for the fixed margin (7ps), and the proposed variation-based 3σ and 2σ hold margin, with placement from scratch fixed margin (7ps) 3σ 2σ Design Std Dev Yield(%) # JTLs plc # JTLs replc Yield(%) # JTLs plc # JTLs replc Yield+(%) # JTLs-(%) Yield(%) # JTLs plc # JTL replc Yield+(%) # JTLs-(%) c432 0.0809 90.7 1508 1935 98.8 1012 1685 8.9 12.9 95.6 1012 1385 5.4 28.4 c499 0.0810 99.2 564 698 99.9 426 731 0.7 -4.7 95.9 426 601 -3.3 13.9 c880 0.0808 97.9 1031 1408 99.5 757 1394 1.6 1.0 92.6 757 1132 -5.4 19.6 c1355 0.0810 96.9 577 695 99.5 431 678 2.7 2.4 95.8 431 573 -1.1 17.6 c1908 0.0808 91.3 1085 1343 99.9 739 1274 9.4 5.1 92.5 739 1032 1.3 23.2 c2670 0.0819 95.7 2831 4170 99.5 2513 4155 4.0 0.4 95.1 2513 3615 -0.6 13.3 c3540 0.0829 94.3 2055 2922 99.3 1416 3105 5.3 -6.3 94.6 1416 2417 0.3 17.3 c5315 0.0828 90.8 4199 8507 97.8 3055 8621 7.7 -1.3 89.3 3055 7143 -1.7 16.0 c6288 0.0829 85.9 4697 7731 98.9 3592 7591 15.1 1.8 87.4 3592 6333 1.7 18.1 c7552 0.0829 94.7 2601 5115 98.3 1794 4658 3.8 8.9 89.1 1794 3742 -5.9 26.8 AVG. 0.0818 93.7 2115 3452 99.1 1574 3389 5.9 2.0 92.8 1574 2797 -0.9 19.4 Table 2.3: Number of hold buffers (JTLs) for placement from scratch vs incre- mental placement approaches fixed margin (7ps) 3σ 2σ Design # logic gates+splitters # JTLs scratch # JTLs incre. Saving (%) # JTLs scratch # JTLs incre. Saving (%) # JTLs scratch # JTLs incre. Saving (%) c432 3760 1935 1564 19.2 1685 1420 15.7 1385 1185 14.4 c499 1916 698 584 16.3 731 531 27.4 601 470 21.8 c880 3585 1408 1151 18.3 1394 1018 27.0 1132 871 23.1 c1355 1916 695 591 15.0 678 525 22.6 573 470 18.0 c1908 3596 1343 1150 14.4 1274 985 22.7 1032 830 19.6 c2670 6841 4170 3570 14.4 4201 3573 14.9 3615 3196 11.6 c3540 7831 2922 2294 21.5 3105 2190 29.5 2417 1770 26.8 c5315 15575 8507 5036 40.8 8621 4747 44.9 7143 3939 44.9 c6288 14169 7731 6039 21.9 7591 5404 28.8 6333 4685 26.0 c7552 9420 5115 2990 41.5 4658 2745 41.1 3742 2252 39.8 AVG. 6861 3452 2497 22.3 3394 2314 27.5 2797 1967 24.6 Finally, aggregating Tables 2.1, 2.2, and 2.3 provides a comparison between placement from scratch and incremental placement approaches. Under the as- sumed variations, incremental placement outperforms the placement from scratch approach, with savings on hold buffers with 22.3%, 27.5% and 24.6% under fixed, 3σ and 2σ margin approaches, respectively. Additionally, incremental placement minimizes the total power consumption associated with DC bias resistors in each 48 SFQ cell by reducing the number of inserted hold buffers, hence static power con- sumption, which is responsible for most of the circuit power dissipation in stan- dard RSFQ logic [5]. The main advantage in terms of reducing the number of hold buffers originates from minimizing the perturbation to the original placement solutionandlayoutarea. Asmentionedearlier,ourflowutilizestheoriginaltiming- aware clock topology, which considers the criticality of the data paths in terms of timing slacks, after placement and legalization of the hold buffers. Therefore, less perturbations to the placement solution and the fixed clock topology lead to fewer changes in the location of clock splitters, less modifications to the clock arrival times, hence less effort on timing closure and hold fixing. Table 2.4: Clock Cycle Time (CCT) and Area Comparison between placement from scratch vs incremental placement (3σ ) Placement from Scratch Incremental Placement Design Std Dev CCT (ps) Area CCT (ps) CCT Overhead Area Area Overhead c432 0.0809 116.6 3.72E+07 139.6 19.7% 3.64E+07 -2.3% c499 0.0810 115.8 1.80E+07 121.7 5.2% 1.82E+07 1.0% c880 0.0808 118.9 3.19E+07 117.4 -1.2% 3.16E+07 -1.0% c1355 0.0810 82.0 1.79E+07 92.5 12.8% 1.80E+07 0.1% c1908 0.0808 88.6 3.20E+07 115.8 30.7% 3.15E+07 -1.6% c2670 0.0819 249.3 7.17E+07 302.2 21.2% 7.24E+07 0.8% c3540 0.0829 146.3 7.11E+07 157.0 7.3% 7.26E+07 2.0% c5315 0.0828 299.5 1.44E+08 324.5 8.4% 1.44E+08 0.4% c6288 0.0829 204.5 1.30E+08 248.9 21.7% 1.32E+08 1.6% c7552 0.0829 306.3 9.81E+07 299.9 -2.1% 9.95E+07 1.5% AVG. 0.0818 172.8 6.52E+07 192.0 12.4% 6.57E+07 0.2% Table 2.4 presents the results including the minimum clock cycle and the lay- outareabyapplyingtwoplacementapproachesunder3σ marginapproach. Under the assumed variations, incremental placement has an average overhead of 12.4% and 0.2% in terms of minimum clock cycle time and layout area, respectively. By 49 distributing the hold buffers such that the logic and hold buffers are somewhat uniformly distributed among all the rows, incremental placement minimizes the impact on layout area. The reason behind the degradation of clock cycle times is that incremental placement sometimes creates setup critical paths by adding to the wire delay of paths with inserted hold buffers, whereas placement from scratch approach, aimedatminimizingthetotalwirelengthwithoutrowrestrictions, man- agestominimizemoreofthewirelengthsandhencereducesthelongwiresonsome paths which improves the clock cycle time. Notethatifthetiminguncertaintiesweresignificantlyhigher,moreholdbuffers would be needed and the perturbations to the original gate locations would be larger, increasing the overheads of incremental placement. For example, consider circuitc880withvariationincreasedfrom0.080to0.134. Then, incrementalplace- ment overheads increase to 0.4% in area and 8% in minimum clock cycle time. Table 2.5: Run-time in the fixed margin (7ps), and the proposed variation-based 3 σ and 2σ hold margin, with incremental placement Run-time (sec) Design fixed margin (7ps) 3σ 2σ c432 101 92 93 c499 57 56 55 c880 95 87 86 c1355 59 56 57 c1908 90 87 86 c2670 206 199 199 c3540 245 235 237 c5315 618 594 593 c6288 516 484 489 c7552 383 363 361 AVG. 237 225 226 50 Finally, to quantify the scalability of our proposed hold time fixing algorithm, Table 2.5 summarizes the run-time of our timing closure flow beginning after clock treesynthesisandincludingincrementalplacement. Ourvariation-awareapproach takes similar time as the fixed margin approach and even for benchmarks with a large gate count takes less than 10 minutes. 2.6 Conclusions This chapter presents a variation-aware hold time fixing flow for RSFQ circuits. We consider the worst-case scenario in terms of gate delay variations caused by timing uncertainties and apply the common path pessimism removal technique to the clock tree topology to reduce the number of required hold buffers for closing thetimingofacircuit. Ourflowallowsatrade-offbetweentimingyieldandlayout areaofthecircuit. Additionally,wepresenttwoplacementtechniques,incremental and from scratch, to efficiently place inserted hold buffers while reducing incurred overheads in terms of layout area and maximum clock frequency. The efficacy of the presented methodology is verified using Monte Carlo simulations on ISCAS’85 benchmark circuits and an SFQ5ee-based cell library. 51 Chapter 3 Multi-Phase Clocking Methodology in RSFQ Circuits In this chapter, we first review multiple clocking methodologies for RSFQ circuit andmulti-phaseclockedsystemintypicallatch-baseddesigns. Thenwediscussthe multi-threading feature in RSFQ sequential circuits and propose the multi-phase clocking methodology. Next, we describe the design flow, experimental results and verification flow by simulation. Last, we present a modified ILP algorithm to maximize the DFF sharing and the experimental results. 3.1 Related work As discussed in Sec. 1.3.1, the correct operation of RSFQ circuits with single global clock needs full path balancing (FPB), which requires the insertion of a large number of path balancing DFFs, sometimes more than doubling the total number of clocked gates. As pointed out in [65], the number of path balancing DFFs is as high as 4.5× that of the original logic gate count in a IntDiv8 circuit 52 (an 8-bit integer divider). This overhead on area and power has been a critical bottleneck when trying to extend SFQ technology to larger systems. To address this concern, schemes to trade-off maximum achievable through- put with lower overhead have also been explored. In particular, the dual clocking method (DCM) [2] in Sec. 1.2.3 uses two clock signals, one fast and and one slow. The slow clock aims to clock the architectural registers and sets the throughput of the circuit while the fast clock is used to propagate the logic through a possibly unbalanced network of logic. The maximum difference of logic levels of all fanin gates for a gate is imbalance factor. The largest imbalance factor in the circuit determines the throughput degradation. However, this method requires the addi- tion of non-destructive read-out DFFs (NDROs) as repeat band to repeat pulse signals and 2-input AND gates as mask band to prevent the propagation of invalid values in order to ensure correct operation with unbalanced data paths. Moreover, the DCM still requires clock tree synthesis and routing of a fast clock. Finally, although the DCM may likely be extendable to sequential circuits, the approach has only been demonstrated on combinational circuits. In this chapter, we proposes a multi-phase clocking approach that significantly reduces the number of path balancing DFFs and enables efficient multi-threaded operation in which the frequency of each clock phase never exceeds the achieved throughput. In addition to the overall benefits of this clocking scheme, the algo- rithmic contributions of this chapter can be summarized as follows: 53 • We propose a novel integer linear programming (ILP) algorithm that mini- mize the required path balancing DFFs in multi-phased gate-level-pipelined logic. Our algorithm applies to both combinational and sequential circuits and guarantees correct multi-threaded operation. • We adopt a fanout optimization technique to improve the utilization of and reduce the number of required path-balancing DFFs. • Experimental results are presented on eight SFQ benchmarks using a commercial-grade design flow that spans logic synthesis, clock tree synthesis, and placement and route (PnR). • Theeffectivenessoftheproposedapproachisdemonstratedwithcomparisons toboththefullpathbalancingmethodandtheSOTAdualclockingmethods with experimental results and a case study. • The correct functionality of the multi-phase clocking benchmark circuits is verified by dynamic simulation with random input patterns compared with “golden” RTL Verilog. • We present a modified ILP algorithm to maximize the DFF sharing across the fanout gates and further reduce the number of required path-balancing DFFs. The effectiveness of DFF sharing is proved with experimental results. 54 3.2 Review of Multi-Phase Clocked System Multi-phased clocks have been first proposed for CMOS latch-based circuits, as formalized by Sakallah, Mudge, and Okulotun (SMO) [66]. An optimal timing model is proposed based on making a distinction between two types of signals: clock and data. SMO defines a general system of timing constraints (GSTC) that describe the relationship between data and clock signals as follows. A N-phase clock is defined as a collection of N periodic signals ϕ 1 , ϕ 2 , ..., ϕ N with a common cycle time T. The phase signals are ordered in a global time reference: e i− 1 < e i ; e N = T, where e i is the closing time edge of the phase p i . The phase shift from phaseitophasej isdefinedas(3.1). ForSFQlogic,withthedataandclocksignals represented as voltage pulses, instead of voltage levels, as shown in Fig. 3.1, we define the time of pulse center c i as follows: c i− 1 <c i ; c N = N− 1 N T. E ij = e j − e i , if i<j T +e j − e i , if i≥ j (3.1) In CMOS circuit, the multi-phased clocks can be implemented with a shift register or delay-locked loop [67]. In RSFQ circuits, there has been study on shift register circuitry as in [68]. 55 Figure 3.1: Multi-Phase Clocks 3.3 Multi-threading in RSFQ circuit For combinational circuits (with no feedback loops) the throughput of the system is limited by the frequency of the global clock. However, for sequential circuits (with feedback loops), the throughput of a single thread of computation is limited by the latency (in terms of number of clock cycles) of the loop. For both cases, the FPB scheme enables the circuit to support, in principle, multiple threads of computation. The maximum number of threads of a FPB sequential circuits is the latency around the sequential loop. As an illustrative example, Fig. 3.2 shows a fully-path balanced sequential cir- cuitwithfourgatesinaloopandthus,asannotated,cansupportfourindependent threads. Inputscanbeappliedeveryclockcycletothecircuitaslongastheinputs of each thread are separated by four clock cycles. This clocking style was recently used for a single instruction, multiple thread (SIMT) 4-bit SFQ processor [8]. 56 Figure 3.2: A Fully Path Balanced Gate-Level Pipelined Sequential Circuit The impact of such deep fine-grained multi-threaded pipelines is two fold. The first is that these circuits require a large number of path balancing DFFs to ensure the computations of different threads do not interfere. The example in Fig. 3.2 requires three additional DFFs for path balancing (gates G8, G9, and G10). Note werefertotheDFFsintheoriginalcircuit(e.g.,gateG5)asregisterstodistinguish them from path-balancing DFFs. G3 and G7 are data path splitters. Similar to a pipeline stage, we use the notion of a clock stage to distinguish steps of the data flow. S1-S4 represent four clock stages. The second impact of these deep gate-level pipelines is that architecting the environmentofthesecircuitstosupportmanythreadsandminimizedataandcon- trol dependencies becomes critical, as otherwise the deep pipelines would remain largelyemptyandunderutilized. Infact, formanypracticalsystems(e.g., CPUs), 57 support for more than a few threads is impractical because each thread requires a significant amount of private memory (e.g., a register file), a severely limited resource in current SFQ designs [8]. 3.4 Proposed Multi-Phase Clocking Methodol- ogy This section provides an overview of our proposed multi-phase clocking solution, followed by our integer linear programming (ILP) formulation. 3.4.1 Problem formulation Our proposed N-phase clocking solution reduces the numbers of required extra DFFs and the number of threads needed to achieve maximum throughput. As an example, Fig. 3.3 demonstrates the application of this N-phased clocking scheme on the same circuit as shown in Fig. 3.2, where N = 2 in this case. In general, there are M clock stages and each stage is associated with a particular thread of operation. Inthisexample, M = 2. EachclockstageismadeupofN clockphases where the various phases propagate the data through a given clock stage. To formalize the correctness constraints of this scheme, we define a directed acyclic graph (DAG) as a pair of nodes V and edges E where each register in the circuit is modeled as a pair of distinct psuedo input and output nodes and 58 Figure3.3: ThecircuitofFig.3.2re-designedwithtwo-phaseclockingandtwothreads. Figure 3.4: The formalized DAG for the circuit of Fig. 3.2. each clocked logic gate is an internal node in the graph. We create a directed edge in E between pairs of nodes for every connection between associated inputs, outputs, gates, and registers, where data path splitters and interconnect wires (JTLs and/or PTLs) are abstracted into the edges. The corresponding DAG for the example circuit is shown in Fig. 3.4. Moreover, for each node i ∈ V in the graph, we create two integer variables, clock stage S i and clock phase CK i . S i indicates the clock stage of gate i and CK i indicates with which phase of the clock the gate is clocked. Finally, we let 59 C ij represent the number of required of additional DFFs placed between nodes i and j. To guarantee correct multi-threaded operation, for each edge from node i to node j, node j must be at the same or one higher clock stage than node i. This guarantees no path of logic passes through a clock stage without the associated data being properly captured in a clocked gate. In addition, if node i feeds node j and both are in the same clock stage, node j should have a larger clock phase than node i. This guarantees that data within a clock stage properly propagates through the stage. All inputs and outputs, including psuedo inputs and outputs modeling registers, should have the same clock phase. All inputs are in the first clock stage and all outputs should be in the same clock stage. These last two constraints guarantee the consistency of users’ clock interfaces and that all paths have the same sequential depth, ensuring the computation in different threads of operations remain independent. Notice that, for the example in Fig. 3.3, these constraints can be satisfied with no additional DFFs, compared to the three additional DFFs needed in Fig. 3.2. The resulting circuit has two clock stages and supports two associated threads of operation. Two shifted clock phases are necessary to propagate the data through each stage. A simulation waveform of this example is provided in Sec. 3.7.1. 60 3.4.2 Proposed ILP Optimization WeformulateanIntegerLinearPrograming(ILP)problemtominimizethenumber of added DFFs subject to the above constraints, as follows. Minimize X E ij =1 C ij (3.2) subject to the following constraints. For each pair (i,j)∈E C ij = S j − S i , if CK j >CK i S j − S i − 1, if CK j ≤ CK i 0, otherwise. (3.3) S j − S i ≥ 0 (3.4) (CK j − CK i > 0)∥ (S j − S i > 0) (3.5) For each gate i 1≤ S i ≤ S max (3.6) 1≤ CK i ≤ N (3.7) 61 To make the ILP problem compatible with commonly used ILP solvers, we intro- duce two binary variables, α ij and β ij . α ij = 1, if CK j − CK i > 0 0, otherwise (3.8) β ij = 1, if S j − S i > 0 0, otherwise (3.9) Finally, (3.5) can be converted to α ij +β ij ≥ 1 (3.10) and the objective function (3.2) can be expressed as C ij =S j − S i − (1− α ij ) (3.11) ILP Solution of Example Circuit In the example circuit as in Fig. 3.4, we formalize ILP constraints as above and solve it. One of the optimal solution with the minimum number of DFFs of 0 is 62 as below. Note that in this DAG, the maximum depth of all nodes is 5, with the number of given clocks as 2, we count the S max =⌈ 5 2 ⌉ = 3. For primary inputs/outputs and registers, CK A =CK B =CK G5 in =CK C =CK D =CK G5out = 1 (3.12) S A =S B =S G5 in = 1 (3.13) S C =S D =S G5out = 3 (3.14) For the internal “combinational” gates, CK G1 =CK G6 =CK G4 = 2,CK G2 = 1 (3.15) S G1 =S G6 = 1,S G2 =S G4 = 2 (3.16) 3.4.3 Proof of Correctness In this section, we prove the constraints proposed in Section 3.4.2 guarantees the correctness of a well-formed multi-phase clocking circuit. Our proof is inductive based on the level of the nodes in our DAG, where the level L j of node j is equal to max i∈FI(j) L i +1, with the base case L k = 0 for all k∈ PIs. For the proof, we also use the notion of the driving node D j of an MPCC node j. When node j is an 63 insertedDFF,D j isthenon-DFFnodethatdrivesthechainofDFFsthatincludes node j; otherwise, D j =j. We formalize our notion of correctness as follows. Theorem of equivalence: A well-formed MPCC circuit M is thread flow equivalent to CMOS circuit C if given that V 0 i = V i ,i ∈ PIs, it satisfies V S j = V j ,j ∈ POs, where V S i is the value of output net of node i at stage S in M, and V i is the value of the corresponding net in C. ProofbyInduction: Induction hypothesis: ForallnodesinMPCCwithlevel l <L,V S j j =V D j . Basis: The induction hypothesis immediately holds for all PIs. Induction step: For a node j in MPCC with level L, If CK j > CK i , the fanin node i has sampled value V S j i = V S i i since S i = S j . V S i i = V D i by the induction hypothesis since L i < L. If CK j ≤ CK i , on the other hand, the fanin node i has sampled value V S j − 1 i = V S i i since S i = S j − 1. V S i i = V D i by the induction hypothesissinceL i <L. Consequently, sinceallinputsofnodej havethesampled value V D i , the output of node j should equal V D j , alas V S j j =V D j . ■ The proof guarantees that for all POs, the output values in the MPCC and CMOS circuits match when the inputs match. Since the POs and the original registers have the same stage, the sequential behavior of the two circuits remains synchronizedovertime. Whilethisprooffocusesonasinglethreadofcomputation, it naturally extends to the multiple threads supported by an MPCC. 64 3.5 Design Flow As shown in Fig. 3.5, our overall design flow [13] starts with a non-path balanced netlist generated by logic synthesis from RTL Verilog. Formal verification is per- formed to check that the initial netlist and the synthesized netlist are functionally equivalent. With this verified netlist we formulate a DAG and generate the con- straints for the ILP problem. The ILP problem is run, providing the clock phase and clock stage assignments for all clocked gates. The ILP solution is then parsed and we iterate through all edges in the DAG, inserting the required DFFs in the circuit. Here, we apply fanout optimization to merge DFFs on edges with the same source node. The gates in the original circuits are marked with their as- signed phase of clock. For a given edge, the inserted DFFs can be marked with the same clock phase as the fanout or fanin gate. In our implementation, when no fanout optimization is applied, the DFFs are given the same phase as the fanout. Whenfanoutoptimizationis applied, because the fanouts may have different clock phases, we assign the added DFFs to the same phase as the fanin gate. After this step, to validate the function correctness of the circuit, we formulate a DAG including the newly inserted DFFs. Starting from all inputs with clock stage of 1, given the clock phase of all nodes, we set the clock stage for each node as follows. If clock phase strictly increases among the edge, clock stage of target node is set to be the same as the source node. Otherwise, the clock stage of target node is increased by 1. If the calculated clock stage of a node differs across its 65 fanins, an error is found. This guarantees the valid data is computed and latched in every node. Lastly, we verify that all registers and outputs are set with same clock phase and clock stage. Thenextstepsareplacement, clocktreesynthesis, andclockrouting. Withthe initial placement solution generated, a distinct clock tree is synthesized for each clockphase. AglobalH-treeandlocal treesarebuiltto distributetheclock signal. First, we divide the floorplan into P × P grid cells, where P is a power of 2, and insert one tap cell (i.e., a splitter) at the center of each grid cell to act as local clock tree roots. In particular, the value of P is chosen to be the largest power of 2 such that number of clock sinks is at least four times the number of tap cells. Then, a symmetric H-tree is built to drive the tap cells. Next, all clock sinks are distributed to its closest tap cell and a local tree is built under each tap cell with level balancing. A Method of Mean and Median (MMM) algorithm [69] is applied during the clustering of subtree to reduce the total wire length. Note that when N is 1, only one local tree is built without the need of an H-tree. Finally, routing is done for all clock nets. To verify the function of the circuit after required DFFs insertion, we generate the gate-level Verilog after the step and run functional simulation. The details of this part will be described in Sec. 3.7. 66 Figure 3.5: The design flow of multi-phase clocks. 3.6 Experimental Results Thissectionevaluatesthebenefitsoftheproposedmulti-phaseclocksmethod. We compare our approach with the FPB method on a set of eight benchmark circuits and present a case study to compare with DCM. All experiments are run with the SynopsysEDAflowforSuperconductingElectronics[70]asdescribedinSection3.5 using the SFQ5ee process with only two layers of metal for routing [71]. The ILP problems are solved by IBM CPLEX v12.10 package [52] with a time-out limit set to 10 minutes for the benchmarks in Fig. 3.6 and 50 minutes for the larger benchmarks in Fig. 3.7. 67 3.6.1 Comparison to Full Path Balancing Tables 3.1, 3.2 and 3.3 list, respectively, the number of required path-balancing DFFs, the number of clock splitters and total gate area after CTS and clock tree routing on eight sequential benchmarks including AMD2901 and seven OpenCores [72] designs, varying the number of given clock phases up to ten. The FPB is equivalent to having the number of clock phases equal to one. Our results show that, compared to the FPB method averaged over the eight benchmarks,thenumberofinsertedDFFsisreducedby55.0%withonlytwoclock phasesandupto95.5%giventenclocks. WithFPB,i.e.,wherethenumberofclock phasesisone,theaveragenumberofrequiredDFFsisapproximatelysixtimesthat of the number of original clocked gates. The average area improvement starts at 40.6% with two phases and up to 69.6% for ten clocks. As might be expected, although the number of required DFFs decreases with more clock phases, the area improvement diminishes because of the overhead of the multiple clock trees and the probability distribution of imbalance factors. Figs. 3.6 and 3.7 show the total clock tree wire length (TWL) after CTS and clock tree routing as a function of the given numbers of clock phases. Since an independent clock tree is synthesized for each clock phase, there is some overhead associated with distributing the clock signals across the entire design given more clock phases. As shown in Fig. 3.7, for larger designs, when the number of reduced 68 Table 3.1: Number of required DFFs with given number of clock phases Designs AMD2901ss pcmsimple spides areaethernetpci bridge32 spi mem ctrl Avg. Impr. Clocked Gates 1042 524 946 3947 275 22225 3004 8362 5041 N/A Clock Phases Number of Required Extra DFFs 1 3546 995 2995 4571 564 168810 13140 51321 30743 N/A 2 1539 305 1146 1510 210 76751 5851 23255 1382155.0% 3 982 105 599 756 90 47798 3477 13821 8454 72.5% 4 600 24 288 399 61 33167 2304 9234 5760 81.3% 5 533 21 132 262 32 22419 1577 6060 3880 87.4% 6 334 9 103 194 26 18504 1149 4574 3112 89.9% 7 229 8 60 116 25 15272 897 3143 2469 92.0% 8 200 7 42 88 21 11755 813 3047 1997 93.5% 9 159 7 36 64 0 9488 507 1715 1497 95.1% 10 103 6 22 64 0 8714 426 1670 1376 95.5% Table 3.2: Number of clock splitters with given number of clock phases Number of Clock Splitters Clock PhasesAMD2901ss pcmsimple spides areaethernetpci bridge32 spi mem ctrl Avg. Pct. 1 5231 1707 4577 9276 850 194705 17061 63167 37072 N/A 2 3882 1096 3186 7109 509 104933 10576 36843 2101743.3% 3 3667 966 2578 6817 397 76526 8820 29910 1621056.3% 4 2375 922 1737 7076 377 62105 7919 24653 1339663.9% 5 2306 597 1579 7306 359 52141 7468 22514 1178468.2% 6 2242 706 1626 5212 367 48712 6823 21675 1092070.5% 7 2183 736 1747 5324 348 46901 7038 20516 1059971.4% 8 2120 749 1680 5578 355 43482 6966 15719 9581 74.2% 9 2072 767 1749 5486 336 44706 6683 14563 9545 74.3% 10 2035 766 1857 5813 329 42353 6625 15019 9350 74.8% Table 3.3: Total area with given number of clock phases Total Area (mm 2 ) Clock PhasesAMD2901ss pcmsimple spides areaethernetpci bridge32 spi mem ctrl Avg. Impr. 1 23.1 7.8 19.7 49.5 4.1 837.2 79.5 270.9 161.5 N/A 2 16.3 5.3 12.8 38.8 2.8 474.3 53.9 162.7 95.9 40.6% 3 15.0 4.6 10.5 36.6 2.4 361.5 45.1 130.2 75.7 53.1% 4 13.4 4.4 9.5 36.5 2.3 303.3 39.3 113.0 65.2 59.6% 5 12.6 4.5 8.7 36.7 2.2 258.9 37.5 99.1 57.5 64.4% 6 12.9 4.4 8.5 37.2 2.2 244.3 36.4 91.5 54.7 66.1% 7 11.4 4.4 9.0 37.3 2.2 237.2 34.8 87.0 52.9 67.2% 8 10.6 4.6 8.6 37.6 2.2 218.7 35.3 87.7 50.7 68.6% 9 10.6 4.4 9.5 37.8 2.1 219.4 32.9 82.3 49.9 69.1% 10 10.4 4.6 8.9 37.8 2.1 212.1 33.0 83.3 49.0 69.6% DFFs is comparable to the number of clocked gates in the original circuit (as reported in Table 3.1), the TWL can be further improved with additional clock 69 phases. The improvement of TWL benefits from two major reasons. First, the totalareaisimprovedwiththeincreasednumberofclockphasesasshowninTable 3.3, which reduces the wire lengths needed to route the clock tree. Second, the number of clock sinks and clock splitters have significant savings with additional clock phases, which reduces the number of elements to be connected in clock tree routing. Figure 3.6: Total Wire Length vs. Number of Clock Phases for Small to Medium Sized Benchmarks Fig. 3.8 shows the post clock tree routed layout for AMD2901 with the five clock phases highlighted. 70 Figure 3.7: Total Wire Length vs. Number of Clock Phases for Medium to Large Sized Benchmarks 3.6.2 Case Study Due to the limited availability and support of the associated tools and libraries, we cannot make a direct comparison to the gate counts reported for DCM [2] with thesamesetofbenchmarkcircuits. Instead, weperformacasestudywitha16-bit Kogge-Stone adder (KSA) that is designed with four 4-bit KSAs and four stages. Weadoptthesametechnologymappingofthe4-bitKSAdescribedin[2]. OurILP solutionshowsthenumberofrequiredDFFsis169withtwoclockphases. Assume the same clock tree structure is used, the total number of extra gates including clocksplittersis337,comparedto310reportedin[2]. However,partiallyduetothe needoftheadditionalmaskandrepeatbandsinDCM,themaximumthroughputis degraded by 5x times compared with FPB. In contrast, our proposed multi-phase clocks method with M = 2, the degradation of throughput is two. Thus, our 71 Figure 3.8: AMD2901: Layout with five clock trees approach is 2.5x time in throughput for a similar amount of gate area. Moreover, thefrequencyofourclocksare2.5xtimeslowerthanthatofthefastclockinDCM. With three clock phases, the total number of extra gates is 163, with a throughput that is 67% higher than DCM. This shows that the reduction of the throughput in DCM requires is less efficient than using multi-phased clocks, incurring higher area and power. 72 3.7 Simulation Verification 3.7.1 Simulation of Example Circuit As an illustrative example, Fig. 3.9 shows a simulation waveform of the circuit in Fig. 3.3 with WRSpice [73] using an RSFQ cell library [74]. As shown in the waveform,inputsignalAandB aresetto1,whichwillforcetheoutputDtotoggle between 1 and 0 in CMOS logic. However, in SFQ logic with two phase clocks, the outputs from two threads are interleaved and generate a “1100” pattern on D, as shown in Fig. 3.9. Note that the last row of the waveform shows the output from G4 clocked by clk 2 . But, since D is a primary output it will be read at clk 1 . Figure 3.9: Example simulation waveform of the circuit shown in Fig. 3.3 73 3.7.2 Functional Simulation To verify the function correctness of the post DFF inserted netlists, we set up a simulation flow as follows. A RSFQ circuit under test may have M clock phases, with M ∈ [1,10] in the experiments. We generate N sets of random input vec- tors, where N = ⌈ max logic depth M ⌉. Taken one vector from each set as inputs per clock cycle, we simulate the post-synthesis multi-clocked netlist. A “golden” RTL Verilog/VHDL is co-simulated with each set of input vectors individually, then the “golden” results are generated by interleaving N sets of outputs. A circuit is considered as a “pass” when there is no mismatches between the test results with the “golden” results. To simplify the Verilog model for testing the functionality, we use unit delay as gate delay, and set the setup/hold time to a negligible value. Fig. 3.10 illustrates the aforementioned simulation flow. To comprehensively test all the benchmarks, we use 1000 inputs vectors in each set, and compare the test results with “golden” results as described. All the eight benchmarks in Sec 3.6 pass the simulation in our experiments. 3.7.3 Simulation results - AMD2901 The section includes the simulation results of one of the benchmark circuit - AMD2901 that has 14 output bits. Fig. 3.11 - 3.14 are the “golden” outputs with four different sets of random inputs vectors, while Fig. 3.15 is the test result 74 Figure3.10: Simulationflowtoverifythefunctioncorrectnessofthepost-DFF-inserted RSFQ circuits of corresponding RSFQ circuits with 10-phase clocks and four threads/sets. As marked in the figures, the test result is exactly the interleaved version of the four golden results. Figure 3.11: Golden result with input vectors in set 1 75 Figure 3.12: Golden result with input vectors in set 2 Figure 3.13: Golden result with input vectors in set 3 Figure 3.14: Golden result with input vectors in set 4 76 Figure 3.15: Test result with 10-phase clocks and 4 threads 3.8 Improved DFF sharing In Section 3.4.2, the splitters in the data paths are inserted before the formulation of ILP algorithm. As a result, the retiming of inserted DFF buffers is limited by the existing splitters. If the splitter insertion is completed after DFF buffers insertion, considering there exist nodes with multiple fanout nodes as shown in Fig. 3.16, we can modify the ILP formulation as follows. Figure 3.16: DFF sharing of a node with four fanout nodes 77 3.8.1 Modified ILP Formulation We introduce an integer variables C i for each gate i. For all pair (i,j)∈E, C i ≥ C ij (3.17) The cost function is to minimize (4.10), where there are n clocked gates in the circuit. X i=1...n C i (3.18) The overall design flow is changed as follows. Starting with a netlist without splitters in the data paths and DFF path balancing buffers, we formulate the new ILPproblemandsolveitwithILPsolver. AfterparsingtheILPsolution,weinsert DFF buffers as a linear pipeline. To latch the correct data in each clock stage, the clock phases of inserted DFF buffers are set to the same phase as the fanin gate. The inputs of fanout gates are connected to the corresponding DFF buffers. As the last step, we insert splitters to the output nets which have a number of fanouts larger than the fanout limit as needed. 3.8.2 Experimental results This section evaluates the improvement on the number of buffers by DFF shar- ing. We compare the approach with the original ILP with fanout optimization as 78 described in Section 3.4.2 and Section 3.5. The ILP problems are solved by IBM CPLEX v12.10 package [52] with a time-out limit set to 50 minutes. Table3.4showstheimprovementofthenumbersofrequiredDFFswhensharing DFFs as linear pipeline over fanout gates. Among the six benchmarks, AMD2901, simple spianddes areaaretestedwithSynopsysEDAflowforforSuperconducting Electronics [70] using the SFQ5ee process. Other three benchmarks, ID8s, c5315 and c6288 are tested with the qPALACE tool suite [75] and technology mapping [76] without full path balancing. Our results show that, compared with the flow in Section 3.4.2 and Section 3.5, the number of inserted DFFs is reduced by an average of 26.3% across six benchmarks, with consistent improvement given clock phases from one to ten. There are two factors contributed the above reduction. First, sharing of DFFs is improved across fanout gates with a linear pipeline. Second, the modified ILP objective function is a better model on the number of actual inserted DFFs. Table 3.5a and 3.5b show the reduction of the number of DFFs by DFF sharing and modified ILP objective, correspondingly. The five benchmarks are tested with the qPALACE tool suite [75] with technology mapping [76] excluding full path balancing. On average, DFF sharing contributes 19.3% of the improvement while modified ILP objective leads to 6.0% reduction on the number of DFFs. For c3540 and c7552, the improvement tends to reduce when the clock phase increases up 79 to ten. This is because the remaining inserted DFFs are necessary with given the clock phases and no more DFF sharing can be achieved. 80 Table 3.4: Improved numbers of required DFFs when sharing DFF as linear pipeline with modified ILP modified ILP with linear pipeline Heuristic Improvement Designs AMD2901 simple spi des area ID8s c5315 c6288 AMD2901 simple spi des area ID8s c5315 c6288 AMD2901 simple spi des area ID8s c5315 c6288 Avg. Clock Phases Number of Extra Required DFFs 1 3065 2510 3525 2457 3273 3921 3546 2802 4623 3864 4944 6021 13.6% 10.4% 23.8% 36.4% 33.8% 34.9% 25.5% 2 1340 1004 1047 1118 1253 1568 1560 1119 1446 1752 1906 2393 14.1% 10.3% 27.6% 36.2% 34.3% 34.5% 26.1% 3 797 524 539 694 660 908 964 583 780 1087 1010 1344 17.3% 10.1% 30.9% 36.2% 34.7% 32.4% 26.9% 4 498 244 261 488 410 625 605 278 380 766 617 894 17.7% 12.2% 31.3% 36.3% 33.5% 30.1% 26.9% 5 383 97 175 384 205 441 470 117 248 594 319 616 18.5% 17.1% 29.4% 35.4% 35.7% 28.4% 27.4% 6 284 87 128 302 182 339 338 103 194 472 256 464 16.0% 15.5% 34.0% 36.0% 28.9% 26.9% 26.2% 7 185 50 71 248 94 257 229 54 116 391 136 353 19.2% 7.4% 38.8% 36.6% 30.9% 27.2% 26.7% 8 170 35 56 206 84 210 200 39 88 322 115 272 15.0% 10.3% 36.4% 36.0% 27.0% 22.8% 24.6% 9 131 33 32 180 38 173 159 36 64 277 51 221 17.6% 8.3% 50.0% 35.0% 25.5% 21.7% 26.4% 10 82 17 32 158 32 139 103 20 64 243 41 163 20.4% 15.0% 50.0% 35.0% 22.0% 14.7% 26.2% Avg. 16.9% 11.7% 35.2% 35.9% 30.6% 27.4% 26.3% 81 Table 3.5: Reduced numbers of required DFFs with improved DFF sharing and modified ILP, separately (a) Numbers of required DFFs with improved DFF sharing Base ILP with improved DFF sharing Base ILP without improved DFF sharing Improvement Designs ID8s c5315 c3540 c7552 c6288 ID8s c5315 c3540 c7552 c6288 ID8s c5315 c3540 c7552 c6288 Avg. Clock Phases Number of Extra Required DFFs 1 2671 3862 1312 3137 4197 3864 4944 1714 3578 6021 30.9% 21.9% 23.5% 12.3% 30.3% 23.8% 2 1221 1481 414 1034 1690 1752 1906 504 1139 2393 30.3% 22.3% 17.9% 9.2% 29.4% 21.8% 3 754 725 179 550 948 1087 1010 209 646 1344 30.6% 28.2% 14.4% 14.9% 29.5% 23.5% 4 529 458 107 253 667 766 617 120 301 894 30.9% 25.8% 10.8% 15.9% 25.4% 21.8% 5 413 259 56 169 460 594 319 61 210 616 30.5% 18.8% 8.2% 19.5% 25.3% 20.5% 6 327 196 31 100 357 472 256 32 114 464 30.7% 23.4% 3.1% 12.3% 23.1% 18.5% 7 264 102 23 98 276 391 136 24 105 353 32.5% 25.0% 4.2% 6.7% 21.8% 18.0% 8 221 88 14 43 218 322 115 14 46 272 31.4% 23.5% 0.0% 6.5% 19.9% 16.2% 9 194 38 10 41 185 277 51 10 44 221 30.0% 25.5% 0.0% 6.8% 16.3% 15.7% 10 168 32 5 41 149 243 41 5 44 163 30.9% 22.0% 0.0% 6.8% 8.6% 13.6% Avg. 30.9% 23.6% 8.2% 11.1% 22.9% 19.3% (b) Numbers of required DFFs with modified ILP objective Modified ILP with improved DFF sharing Base ILP with improved DFF sharing Improvement Designs ID8s c5315 c3540 c7552 c6288 ID8s c5315 c3540 c7552 c6288 ID8s c5315 c3540 c7552 c6288 Avg. Clock Phases Number of Extra Required DFFs 1 2457 3273 1225 2687 3921 2671 3862 1312 3137 4197 8.0% 15.3% 6.6% 14.3% 6.6% 10.2% 2 1118 1253 377 897 1568 1221 1481 414 1034 1690 8.4% 15.4% 8.9% 13.2% 7.2% 10.6% 3 694 660 167 520 907 754 725 179 550 948 8.0% 9.0% 6.7% 5.5% 4.3% 6.7% 4 488 410 101 240 618 529 458 107 253 667 7.8% 10.5% 5.6% 5.1% 7.3% 7.3% 5 384 205 54 164 440 413 259 56 169 460 7.0% 20.8% 3.6% 3.0% 4.3% 7.7% 6 302 182 30 98 339 327 196 31 100 357 7.6% 7.1% 3.2% 2.0% 5.0% 5.0% 7 248 94 23 96 257 264 102 23 98 276 6.1% 7.8% 0.0% 2.0% 6.9% 4.6% 8 206 84 14 43 210 221 88 14 43 218 6.8% 4.5% 0.0% 0.0% 3.7% 3.0% 9 180 38 10 41 173 194 38 10 41 185 7.2% 0.0% 0.0% 0.0% 6.5% 2.7% 10 158 32 5 41 139 168 32 5 41 149 6.0% 0.0% 0.0% 0.0% 6.7% 2.5% Avg. 7.3% 9.0% 3.5% 4.5% 5.9% 6.0% 82 Chapter 4 Hold Time Safe Guaranteed RSFQ Circuits with Multi-Phase Clocks Multi-phase clocking methodology has the potential to ease time closure, i.e. no setup time and hold time violation. As mentioned in Chapter 2, minimizing hold time violations is critical as a single hold violation may lead to a chip being un- usable. This chapter proposes the formulation of 100% hold time safe multi-phase clocking SFQ circuits, with minimized overhead of path balancing DFFs. 4.1 Hold Time Safe Guaranteed Circuits 4.1.1 Timing Analysis of Two Phase Clocks In this section, we detail the timing analysis of the two phase clocks in Sec. 1.2.4. This will serve as an inspiring example for the following hold time safe flow with multi-phaseclocks. AsshowninthetimingdiagraminFig.4.1, sinceallsequential 83 gatesareclockedbyclk 1 orclk 2 alternatively,timingconstraintsarealwayschecked from clk 1 to clk 2 , or from clk 2 to clk 1 . (a) Setup time check (b) Hold time check (c) Setup time and hold time check from clk 2 to clk 1 Figure 4.1: Timing Checks in Two Phase Clocking 84 For setup time constraints on both clk 1 to clk 2 , and clk 2 to clk 1 , as shown in Fig. 4.1a, we have T > 2∗ (T CQ +T setup ) (4.1) For hold time constraints on both clk 1 to clk 2 , and clk 2 to clk 1 , as shown in Fig. 4.1b, we have T CQ +δ >T hold (4.2) T CQ +(T − δ )>T hold (4.3) Forbothsetupandholdtimeconstraintfrom clk 2 toclk 1 asexample, asshown in Fig. 4.1c, we have T >T setup +T hold (4.4) From Equation (4.1), (4.2) and (4.3), we notice that a working clock period with phase shift δ can always be found in the two phase clocking architecture. As it is claimed in [7], for any possible value of the clock skew, there exists a minimum clock period above which the circuit works correctly for any clock period. This means a 100% timing yield, i.e. zero timing violations, which is the key advantage of this clocking architecture. 85 4.1.2 Guarantee Hold Time Safe in Multi-Phase Clocks Two phase clocks architecture shows 100% time yield as discussed in Sec. 4.1.1. Similarly with multi-phase clocks, if it is guaranteed that no sequentially adjacent gates have the same clock phase, it can also achieve 100% timing yield. In this section, we ensure the same clock phase is not assigned to sequentially adjacent gates. 4.2 Modified ILP Inparticular,wemodifytheILPformulationinSec.3.4.2tominimizetheoverhead of generating hold safe circuits as described in Sec. 4.1.2. 4.2.1 Problem fomulation Inthis section, weformulate the problem of finding the minimum required number of DFFs with guaranteed hold safe condition. Steiner tree Problem Given a weighted undirected graph and a subset of nodes, usually referred to as terminals, the Steiner tree problem [77] seeks a minimum weight subtree that contains all terminals (but may include additional nodes). As one of the Karp’s original 21 NP-complete problems [78], the decision variant of the Steiner tree 86 problem in graphs is NP-complete and the optimization variant is NP-hard. A variation of the traditional edge-weighted Steiner tree problem is node-weighted problem [79]. There are previous studies on the practical problems such as rout- ing [80] and power recovery [81] with node-weighted Steiner tree. When the graph isdirected, theproblemiscalledNode-WeightedDirectedSteinerTree prob- lem. Problem formalization We present how to formalize the problem of finding the minimum number of DFFs as a Node-Weighted Directed Steiner Tree (NWDST) problem. Fig.4.2showsanexampledirectedgraphwiththefaningateasnode“src”and two fanout gates as “des”. First, we define a directed graph G as a pair of nodes V and edges E. Given the fanin gate v s = (CK s ,S s ) and a set of fanout gates V d ={(CK d 1 ,S d 1 ),...,(CK dn,S dn )}, we create node v = (CK,S) with CK∈ [1,m] and S∈ [S i ,max(S d 1 ,...,S dn )]. For each node (CK,S), we create directed edge E from (CK,S) to (CK +1,S),...,(m,S) and (1,S +1),...,(CK− 1,S +1) when the destination node exists in G. For each node v = (CK,S) ∈ V d , we create a special replica node v rep = (CK,S)∈ V rep . The replica node only duplicates the in-edges to v, alias it has a outdegree 1 of 0. Each node in G has unit weight. The original problem can be formalized to a NWDST problem as Problem 1. 1 The outdegree of node v is the number of edges which are going out from node v. 87 (a)DAGwithonesourcenodeandthreedestinationnodes (b) DAG with replica nodes for destination nodes Figure 4.2: DAG in NWDST problem for hold safe DFF insertion Problem 1. Find the Steiner tree with minimum weight in the directed graph G with the terminal nodes as V term ={v s }∪V rep . The Steiner tree has v s as the root node. 88 In the example circuit in Fig. 4.2b, the terminal nodes are src node (2,1) and three replica nodes (3,2) rep ,(1,3) rep and (3,3) rep .To find the solution to Problem 1, we have Definition 4.1. We define the hold safe edge E = (v s ,v d ) as follows. Given a node v s = (CK s ,S s ), the destination node v d = (CK d ,S d ) should satisfy Condition 1: (CK d >CK s )∧(S d =S s ) or Condition 2: (CK d <CK s )∧(S d =S s +1). Theconditionsguaranteesequentiallyneighboringnodesneverhavesameclock phase, and the data is always correctly propagated as discussed in Section 3.4.1. Note that in Problem 1, all edges in G are hold safe edge. Definition 4.2. Given the solution Steiner tree G R = (V R ,E R ) of Problem 1, we define a sub-tree G A = (V A ,E A ). V A = V R \V term . E A ∈ E R , E A contains all the edges that connect both the origin and destination node in V A . Fig. 4.3 shows the solution tree G R and the sub-tree G A for the example in Fig. 4.2b. V A are the nodes with dashed outlines and E A are the edges with dashed lines. Next, we show that G A has a special feature that simplifies finding the Steiner tree solution. Definition 4.3. Let G = (V,E,ϕ ) be a directed graph [82]. Let e 1 ,e 2 ,...,e n− 1 be a sequence of elements of E for which there is a sequence v 1 ,v 2 ,...,v n of distinct 89 Figure 4.3: Solution tree G R and sub-tree G A for the example in Fig. 4.2b elements of V such that ϕ (e i ) = (v i ,v i+1 ) for i = 1,2,...,n− 1. The sequence of edges e 1 ,e 2 ,...,e n− 1 is called a directed path in G. The sequence of vertices v 1 ,v 2 ,...,v n is called the vertex sequence of the path. Note that since the vertices are distinct, so are the edges. Theorem 4.1. There exists an optimal solution G A of Problem 1, which is a directed path. Figure 4.4: Branch node v b with two branches in solution Steiner tree 90 Proof. (by contradiction) To prove G A is a directed path, we show that G A does not contain any branching point. Assumeabranchingnodev b = (CK b ,S b )existsasshowninFig.4.4. B 1 andB 2 are two branches, B1 :{v b → v 1 → v d 1 }, B 2 :{v b → v 2 → v d 2 }. v d 1 = (CK d 1 ,S d 1 ) and v d 2 = (CK d 2 ,S d 2 ) are two nodes in B 1 and B 2 correspondingly. From Definition 4.1, we discuss two possible cases of v 1 = (CK 1 ,S 1 ) and v 2 = (CK 2 ,S 2 ). 1. |S 1 − S 2 | = 1, i.e., v 1 and v 2 are in different clock stages. We assume S 2 = S b +1 = S 1 +1. The case of S 1 = S 2 +1 can be proved similarly. From Definition 4.1, we know that ( CK 1 > CK b ) and (CK 2 < CK b ). We discuss allthepossibleconditionsofv d 1 = (CK d 1 ,S d 1 ). ApplyDefinition4.1onnode v d 1 and v 1 : • If (CK d 1 >CK 1 )∧(S d 1 =S 1 ), as shown in Fig. 4.5a with the dash line, v b can be connected to v d 1 , making v 1 redundant, contradiction! • If (CK d 1 <CK b )∧(S d 1 =S 1 +1), as shown in Fig. 4.5b with the dash line, v b can be connected to v d 1 , making v 1 redundant, contradiction! • If (CK b ≤ CK d 1 < CK 1 )∧ (S d 1 = S 1 + 1 = S b + 1), combine with (CK 2 ,S 2 ), we have CK 2 < CK b ≤ CK d 1 < CK 1 ≤ m and S d 1 = S 1 +1 =S b +1 =S 2 . Therefore, as shown in Fig. 4.5c, node v 2 can be connected to v d 1 . Since v 2 is connected to both v d 1 and v d 2 , i.e., node 91 v 1 is redundant. The Steiner tree with nodes{v b ,v 2 ,v d1 ,v d2 } is a better solution than{v b ,v 1 ,v 2 ,v d1 ,v d2 } in Fig. 4.4. Contradiction! (a) (b) (c) Figure 4.5: Node connections in Condition 1 2. S 1 =S 2 , i.e., v 1 and v 2 are in the same clock stage. Apply Definition 4.1 on node v d 1 and v 1 : • If S 1 =S 2 =S b , we have CK 1 >CK b ,CK 2 >CK b , from the induction above, we know (CK b ≤ CK d 1 < CK 1 ≤ m)∧ (S d 1 = S 1 + 1), and (CK b ≤ CK d 2 < CK 2 ≤ m)∧(S d 2 = S 2 +1). Assume CK 1 ≥ CK 2 , then (CK 1 > CK d 2 )∧(S 1 = S d 2 +1), as shown in Fig. 4.6a with the dash line, v 1 can be connected to v d 2 . Thus, v 2 is redundant in the solution. Similarly, if CK 1 <CK 2 , v 1 is redundant. Contradiction! • If S 1 = S 2 = S b +1, assume CK 1 > CK 2 , as shown in Fig. 4.6b and Fig. 4.6c, v 1 can be connected to v d 2 , thus v 2 is redundant. Similarly, if CK 1 <CK 2 , v 1 is redundant. Contradiction! 92 (a) (b) (c) Figure 4.6: Node connections in Condition 2 In all possible cases, only one node from{v 1 ,v 2 } exists in the optimal solution, that is, the branching node v b doesn’t exist. When L B > 2, we can also prove similarly that there are unnecessary nodes and no branching node needs exist in the solution tree. Hence, G A is a directed path. ■ Given G A is a directed path, we discuss the algorithm to find G A as follows. Westartfromanalyzingthepathsfromsourcenodetoalldestinationnodes. Then we prove there is one specific path among these paths which is solution for G A . First, we define the distances in the DAG as below. Definition 4.4. Given the distances 2 from source node v s to all the destination nodes, we define the longest distance as d L , the longest paths as P L 1 ,...,P L k and the corresponding destination nodes V L ={v L 1 ,...,v L k }. 2 Distance of two nodes in the directed graph is the number of edges in a shortest path connecting them. 93 For the example in Fig. 4.2b, d L = 3, V L = {(3,3) rep },P L = {(2,1),(3,1),(2,2),(1,3) des ,(3,3) rep }. Definition 4.5. Distance from a directed path to a node: In graph G, we define the distance from a directed path P = (V,E,ϕ ) to node v as the shortest distance from v k to v, v k ∈V. Lemma 4.1. Given a destination node v Lm = (CK Lm ,S Lm ),v Lm ∈ V L , and an- other destination node v = (CK,S),v / ∈V L , then S≤ S Lm . Proof. (by contradiction) Assume S =S Lm +1, e (v pre →v Lm ) is an edge on path P Lm . From Definition 4.1, d vpre→v ≥ 1 (d vpre→v : the distance fromv pre tov). Thus, the length of the path from v s to v is at least the longest distance. Contradiction! Hence S≤ S Lm . ■ Lemma 4.1 states that the destination nodes with longest distance always have largest clock stage among all destination nodes. For the example in Fig. 4.2b, among nodes{(3,3) rep ,(3,2) rep ,(1,3) rep }, node (3,3) rep has the largest clock stage of 3. If there are multiple nodes with largest clock stage ((3,3) rep ,(1,3) rep in the example), we find the solution for G A according to Theorem 4.2. Theorem 4.2. Sort the destination nodes{v L 1 ,...,v L k } by the decreasing order of the clock stage, if there is a tie, then sort the nodes by the decreasing order of the clock phase. If the first node after sorting is v Lm , for any node v ∈{V term \v Lm }, the distance from P Lm to v is 1. 94 Proof. As an illustrative example, in Fig. 4.3, v Lm = (3,3) rep ,P Lm = {(2,1) src ,(3,1),(2,2),(1,3),(3,3) rep }, the distances fromP Lm to node (3,2) rep and (1,3) rep are 1. We prove the theorem stays true in the general cases. From Definition 4.1, we know that the edges in G are always between two nodes with same clock stage or neighboring clock stage. Given the node v Lm = (CK Lm ,S Lm ), and the longest distance path P Lm (V Lm ,E Lm ,ϕ Lm ), for any given clock phase S < S Lm , there always exists a node v ∈ V Lm with clock stage S. Hence, for any directed path from source node v s to destination node v d , at least one node exists at every clock stage S :S s CK d )∧(S pre =S d − 1), the distance from v pre to v d is 1. So, for any node v∈{V term \v Lm }, the distance from P Lm to v is 1. 95 ■ Theorem 4.3. Given the directed path P Lm in Theorem 4.2, the solution of Prob- lem 1 can be generated by connecting P Lm with the destination nodes{V term \v Lm }. Proof. As proved in Theorem 4.1, G A is a directed path. From definition of G A , L G A = L G R − 2, L G A ≥ d L − 2 ⇒ L G R ≥ d L . As shown in Theorem 4.2, the distance from P Lm to node v ∈ {V term \v Lm } is 1. Hence, L G R = d L , G R can be generated by connecting P Lm with the rest destination nodes. ■ In conclusion, we proved that we can find the solution of the formalized NWDST problem in two steps: First, sort the destination nodes in the decreasing order of clock stages and then decreasing order of clock phases if there is a tie. Second, find the shortest path from the source node to the first destination node after sorting, and connect all other destination nodes to the path. 4.2.2 NumberofInsertedDFFsunderHoldSafeCondition AccordingtoTheorem4.3, theSteinertreesolutioncanbeobtainedbyfindingthe shortest path from the source node to one specific destination node. Hence, the number of required inserted buffers can be calculated from the number of internal nodes on the path. Due to the particular structure of the directed graph G in Problem1,insteadofthetypicalbreath-first-search(BFS)/depth-first-search(DFS) algorithm, we purpose that the shortest path can be easily find with a regular 96 pattern. Assume we want to find the path P from node v i = (CK i ,S i ) to v j = (CK j ,S j ), where S j > S i . Since G has a layered structure where the nodes with same clock stage are in the same layer, it is preferable that the clock stage is increased with each internal node. From Definition 4.2, when a node has a clock phase of 1, no edges that connect it to a node in the next layer exists. Then its’ child node will have the same clock stage with the same clock phase, which we called as an overhead node. We prefer using as little overhead nodes as possible, so that the length of the path is minimized. Next,wepresenthowtocountthenumberofminimumrequiredinsertedbuffers with a regular pattern which minimizes the number of overhead nodes. Theorem 4.4. Given a pair of source node (fanin gate) v i = (CK i ,S i ) and des- tination node (fanout gate) v j = (CK j ,S j ), the number of minimum required in- serted buffers C ij is C ij = S j − S i − 1+⌈ max(S j − S i − (CK i − CK j ),0) m− 1 ⌉, S j >S i 0, S j ≤ S i (4.5) Proof. • When S j ≤ S i , from Equation (3.5), we have CK j >CK i . Thus v i is connected to v j without extra DFFs, C ij = 0. • When S j > S i , we need to find the shortest path between node v i and v j . As an illustrative example, Fig. 4.7 shows the shortest path from (1,3) to 97 (15,2), with six clock phases. As explained above, given a node, we always prefer to connect it with a node with higher clock stage, if not possible, then we connect to the node with same clock stage and largest clock phase. Figure 4.7: Transition from (1, 3) to (15,2) with minimum DFFs Next, we count the number of internal node in this path. We discuss two conditions: 1. CK i ≤ CK j – Baseline: WhenclockstageisincreasedwitheveryDFF,thenum- ber of DFFs is N 1 =S j − S i − 1 (4.6) 98 – Overhead: When there exists overhead DFFs of N 2 , N 2 = 1+⌈ max(S j − S i − (CK i − 1)− (m− CK j ),0) m− 1 ⌉ (4.7) We prove Equation (4.7) by induction. ∗ Basecase: When0 0, since S j − S i − (CK i − CK j )> 0,wehaveN =S j − S i +⌈ S j − S i − (CK i − CK j )− (m− 1) m− 1 ⌉ = S j − S i − 1+⌈ S j − S i − (CK i − CK j ) m− 1 ⌉ =C ij . Hence, N =C ij . 100 2. CK i >CK j – Sub-condition 2.1: When S j − S i ≤ CK i − CK j , N 2 = 0, N = S j − S i − 1. Since S j − S i − (CK i − CK j )≤ 0, C ij =S j − S i − 1− ⌈ max(S j − S i − (CK i − CK j ),0) m− 1 ⌉ =S j − S i − 1 =N. – Sub-condition 2.2: When S j − S i >CK i − CK j , ∗ Baseline: N 1 =S j − S i − 1 ∗ Overhead: Similar with Equation (4.7), N 2 = 1 + ⌈ max(S j − S i − (CK i − 1)− (m− CK j ),0) m− 1 ⌉. Thus,N =N 1 +N 2 =S j − S i +⌈ max(S j − S i − (CK i − CK j +m− 1),0) m− 1 ⌉ =C ij . Hence,Equation(4.5)istheminimumnumberofrequiredinsertedDFFsunder the hold safe condition. ■ 4.2.3 Relaxed Clock Phase Constraints In Section 3.4.1, we constrain all inputs and outputs, including pseudo inputs and outputsmodelingregisters,tohavethesameclockphase. Althoughthisconstraint guarantees the consistency of users clock interface, it could over-constrain the solution space of the ILP problem and worsen the solution quality. In hold safe ILP,werelaxitasfollows: A pair of pseudo input and output should have the same clock phase with each other. 101 4.2.4 Objective Function Similar to Section 3.8.1, We introduce an integer variables C i for each gate i. For all pair (i,j)∈E, C i ≥ C ij (4.9) The cost function is to minimize Equation (4.10), where there are n clocked gates in the circuit. X i=1...n C i (4.10) 4.3 Clock Phase Assignment The modified ILP in Section 4.2 generates solution for clock phases and clock stages for the existing sequential gates in the circuits. In this section, we present how to generate the clock phase assignment for the inserted DFFs. In Section 4.2.2, we show that inserting the DFFs as a linear pipeline generates thesolutionwithminimumnumberofDFFs. AsshowninFig.4.7, theclockphase transition has the regular patterns as below. • Without overhead node: CK i → (CK i − 1)→...→CK j • With overhead node: CK i → (CK i − 1) → ... → 1 → {m→ (m− 1)→...→ 1} | {z } repeating K times → m → (m− 1) → ... → CK j , where K is a constant and K≥ 0,K∈N, and m is the number of given clock phases. 102 Foreachfanoutgate,wefindthecorrespondingbufferandconnectthem. Split- ters are inserted as needed. 4.4 Experimental Results In this section, we evaluate the proposed hold safe multi-phase clocking SFQ cir- cuits. We compare the hold safe circuits with the circuits without hold safe con- dition on a set of benchmark circuits to understand the changes after synthesis. The experiments for benchmark circuits AMD2901 and ss pcm are tested with the Synopsys EDA flow for Superconducting Electronics [70] as described in Section 3.5 using the SFQ5ee process, while other benchmark circuits are tested with the qPALACE CAD tool suite for SFQ circuit [75]. The ILP problems are solved by IBM CPLEX v12.10 package [52] with a time-out limit set to 50 minutes. Table4.1and4.2list,respectively,thenumberofrequiredpath-balancingDFFs and total gates after logic synthesis (before clock tree synthesis) on eight bench- marks including AMD2901, ss pcm from OpenCores [72] designs, array multiplier (arrmult8s), integer divider (ID4s and ID8s) and other circuits from ISCAS’85 benchmarks [64]. The number of given clock phases varies from two to ten. Note thatsingleclockphasedoesnotservefortheholdsafecondition. Ourresultsshow that,comparedwiththecircuitwithoutholdsafecondition,thenumberofinserted DFFs is increased by an average of 120% with two clock phases, 72% with three clock phases, and the overhead reduced to around 20% given ten clock phases. 103 The number of total gates after logic synthesis is increased by an average of 67% with two clock phases, 25% with three clock phases and the increase reduced to 1% given ten clock phases. Table 4.3 lists the number of total gates after logic synthesis of the multi-phase circuits under hold safe condition with two and ten clock phases, compared with the number of total gates with single phase clock. The results show that the number of total gates with two clock phases has an average of 97.3% compared with single clock circuits, while the number reduced to an average of 57.6% and 35.5% with three and ten clock phases, correspondingly. Note that single clock phase circuits may need a large number of hold buffers inserted under significant time uncertainty [83] after clock tree synthesis, while it is not required for the hold time safe multi-phase clocking circuits. 104 Table 4.1: Number of required DFFs with given number of clock phases, with hold safe condition Designs AMD2901 ss pcm ID4s ID8s c5315 c6288 Avg. AMD2901 ss pcm ID4s ID8s c5315 c6288 Avg. Clock Phases Hold Safe / Non Hold Safe - Number of Extra Required DFFs Overhead of Extra Required DFFs 2 2706 / 1340 364 / 276 245 / 101 2442 / 1110 2994 / 1096 3876 / 1559 2105 / 914 102% 32% 143% 120% 173% 149% 120% 3 1222 / 797 127 / 78 100 / 57 1102 / 680 1052 / 515 1528 / 870 855 / 500 53% 63% 75% 62% 104% 76% 72% 4 709 / 498 51 / 23 55 / 36 680 / 480 457 / 270 873 / 572 471 / 313 42% 122% 53% 42% 69% 53% 63% 5 461 / 383 27 / 20 35 / 24 477 / 366 264 / 155 558 / 405 304 / 226 20% 35% 46% 30% 70% 38% 40% 6 356 / 284 10 / 9 23 / 16 361 / 289 109 / 69 409 / 296 211 / 161 25% 11% 44% 25% 58% 38% 34% 7 256 / 185 9 / 8 16 / 12 287 / 233 37 / 27 298 / 220 151 / 114 38% 13% 33% 23% 37% 35% 30% 8 170 / 170 8 / 7 10 / 8 233 / 191 25 / 22 224 / 174 112 / 95 0% 14% 25% 22% 14% 29% 17% 9 153 / 131 8 / 7 8 / 6 191 / 164 3 / 3 177 / 139 90 / 75 17% 14% 33% 16% 0% 27% 18% 10 134 / 82 7 / 6 6 / 5 160 / 143 2 / 2 139 / 110 75 / 58 63% 17% 20% 12% 0% 26% 23% Table 4.2: Number of total gates after logic synthesis with given number of clock phases, with hold safe condition Designs arrmult8 ID4s ID8s c5315 c6288 c7552 Avg. arrmult8 ID4s ID8s c5315 c6288 c7552 Avg. Clock Phases Hold Safe / Non Hold Safe - Number of Gates after Synthesis Overhead of Gates after Synthesis 2 2277 / 1313 739 / 451 5981 / 3317 9485 / 5689 12329 / 7695 8705 / 5443 6586 / 3985 73% 64% 80% 67% 60% 60% 67% 3 1291 / 1019 449 / 363 3301 / 2457 5601 / 4527 7633 / 6317 5111 / 4385 3898 / 3178 27% 24% 34% 24% 21% 17% 24% 4 995 / 867 359 / 321 2457 / 2057 4411 / 4037 6323 / 5721 4267 / 4059 3135 / 2844 15% 12% 19% 9% 11% 5% 12% 5 859 / 795 319 / 297 2051 / 1829 4025 / 3807 5693 / 5387 3953 / 3913 2817 / 2671 8% 7% 12% 6% 6% 1% 7% 6 797 / 741 295 / 281 1819 / 1675 3715 / 3635 5395 / 5169 3825 / 3801 2641 / 2550 8% 5% 9% 2% 4% 1% 5% 7 739 / 715 281 / 273 1671 / 1563 3571 / 3551 5173 / 5017 3801 / 3797 2539 / 2486 3% 3% 7% 1% 3% 0% 3% 8 707 / 687 269 / 265 1563 / 1479 3547 / 3541 5025 / 4925 3717 / 3715 2471 / 2435 3% 2% 6% 0% 2% 0% 2% 9 679 / 667 265 / 261 1479 / 1425 3503 / 3503 4931 / 4855 3715 / 3715 2429 / 2404 2% 2% 4% 0% 2% 0% 1% 10 671 / 657 261 / 259 1417 / 1383 3501 / 3501 4855 / 4797 3715 / 3715 2403 / 2385 2% 1% 2% 0% 1% 0% 1% 105 Table 4.3: Number of total gates after logic synthesis under hold safe condition, com- pared with single clock without hold safe condition Designs Number of Gates after Synthesis Hold Safe/Non Hold Safe Ratio Non Hold Safe Hold Safe 1 2 3 10 2 3 10 arrmult4 2307 2277 1291 671 98.7% 56.0% 29.5% ID4s 751 739 449 261 98.4% 59.8% 35.3% ID8s 6011 5981 3301 1417 99.5% 54.9% 23.7% c5315 10043 9485 5601 3501 94.4% 55.8% 36.9% c6288 12419 12329 7633 4855 99.3% 61.5% 39.4% c7552 9089 8705 5111 3715 95.8% 56.2% 42.7% Avg. 6770 6586 3898 2403 97.3% 57.6% 35.5% 106 Chapter 5 Summary and Future Work Thischapterconcludesthethesis. First,weprovideasummaryofthecontributions and the work discussed in the thesis. Second, we discuss some promising future work. 5.1 Summary RSFQtechnologyhasthepotentialofmeetingultra-highoperatingfrequencyneeds with three orders of magnitude lower power compared with CMOS technologies. However, the benefits are yet to be attained for complex designs. The nature of RSFQ circuits requires ultra-deep gate-level pipelines which must be properly handled. To achieve the highest possible throughput, full path balancing is re- quired which significantly increases the chip area and power dissipation due to the insertion of path balancing DFFs. For sequential circuits with data dependency, this leads to the requirement of a very large number of threads, which can be very difficult to manage and leverage. The practical utilization of these highly 107 pipelined circuits is low. Moreover, they require ultra-high frequency clock distri- bution networks, which makes timing closure in RSFQ circuits very challenging, due to significant time uncertainty. This thesis proposes a multi-phase clocking methodology which significantly reduces the number of required path balancing DFFs and enables efficient multi- threading operation in which the frequency of each clock phase never exceeds the achieved throughput. The effectiveness of the proposed approach is demonstrated with experiment results using a commercial-grade design flow from logic synthesis, to clock tree synthesis and placement and route (PnR). The proposed method reduces the number of path-balancing registers by 55.5% with two clock phases and up to 95.5% with ten clock phases with savings on area and clock tree wire length. The CTS and PnR results show that the decrease in registers yields a decrease in total gate area by 40.6% and clock tree wire length by 54.9% with two clock phases, and by 69.6% and 69.8% with ten clock phases, respectively. We further save an average of 26.3% of path-balancing DFFs compared to the original approach by optimizing DFF sharing across the fanout gates as a linear pipeline. Multi-phase clocking methodology has the potential to make timing closure easier. Wefurtherpresentaclockphaseassignmentflowwhichguarantees100%holdtime safe with reasonable overhead on the gate count. 108 5.2 Future Work Toextendthebenefitsofthemulti-phaseclockingmethodology, wesuggestseveral potential directions of the future work. • Clocktreesynthesis(CTS)formulti-phaseclocks. Althoughsingleclocktree synthesisforRSFQcircuits has been investigated in recent literature [32,36], how to generate high quality CTS solution for multi-phase clocks is still an open question. Compared with single clock tree synthesis, the number of clock sinks is significantly reduced in multi-phase clocking which eases CTS for high frequency clocks. However, several areas of future work are still needed. First, we need to implement efficient phase shifting logic in RSFQ circuittogeneratethephase-shiftedclockswithaspecifiednumberofphases and clock frequency. Second, we should optimize the clock arrival time spe- cific to the requirements of multi-phase clocking. In particular, one interest- ing area of future work is to utilize useful skew across multiple clock trees. Similar with CMOS logic, the potential benefits include better performance of the circuit and better timing margin of the post-CTS circuit. • InChapter4, weproposea100%holdtimesafeflowformulti-phaseclocking by avoiding assign same clock phase to neighboring sequential gates. This restriction leads to smaller search space for assigning clock phases. Conse- quently, the required number of inserted DFFs increases. A potential future 109 work is to selectively add hold safe constraints to parts of the circuit which are more vulnerable to hold time violations by analyzing the data path delay betweensequentialelements. Thisapproachcouldprovidetrade-offsbetween the overhead on DFF counts and the timing yield. • The current variation-aware hold time fixing flow can be extended to the multi-phase clocking domain. Hold buffers may be required for fixing hold time violations with given user clock frequency. As discussed in Chapter 4, the guaranteed hold time safety sacrifices performance if hold time violation exists. Hold buffers can be added to the data paths with most critical timing margins. Consequently, system performance can be improved. • Test pattern generation for multi-phase clocks. Different algorithms have been studied for generating test patterns for CMOS circuits. These tech- niques may need further investigation to be used on multi-phase clocked SFQ circuits. This thesis introduces a multi-phase clocking methodology into SFQ circuits andshowsits’significantbenefitsonarea, powerandreliability. InSFQcircuitde- signs, full path balancing causes large chip area and long interconnect wires which challengetimingclosureandreducethepromisedsystemfrequency. Theultra-deep gate-levelpipelinerequireslargenumberofindependentthreads,makingsequential circuitshardtoachievethemaximumthroughputasexpected. Giventhesecritical 110 bottlenecks, this thesis show that the multi-phase clocking methodology is a very promising research direction for SFQ circuits. We encourage the superconducting electronics community to seek more opportunities in this direction. 111 Bibliography [1] W. Buckel, R. Kleiner, and R. Huebener, Superconductivity: Fundamentals and Applications, ser. Physics textbook. Wiley, 2004. [Online]. Available: https://books.google.com/books?id=Xb7vAAAAMAAJ [2] G. Pasandi and M. Pedram, “An efficient pipelined architecture for super- conducting single flux quantum logic circuits utilizing dual clocks,” in IEEE Transactions on Applied Superconductivity, vol. 30, no. 2, pp. 1–12, March 2020. [3] K. K. Likharev and V. K. Semenov, “RSFQ logic/memory family: a new Josephson-junction technology for sub-terahertz-clock-frequency digital sys- tems,” IEEE Trans. Appl. Supercond.,, vol. 1, no. 1, pp. 3–28, 1991. [4] D. A. Reed and J. Dongarra, “Exascale computing and big data,” Commun. ACM, vol. 58, no. 7, p. 56–68, Jun. 2015. [Online]. Available: https://doi.org/10.1145/2699414 [5] O. A. Mukhanov, “Energy-efficient single flux quantum technology,” IEEE Trans. Appl. Supercond.,, vol. 21, no. 3, pp. 760–769, 2011. [6] J. E. Hirsch, “The origin of the meissner effect in new and old superconductors,” Physica Scripta, vol. 85, no. 3, p. 035704, feb 2012. [Online]. Available: https://dx.doi.org/10.1088/0031-8949/85/03/035704 [7] K. Gaj, E. G. Friedman, and M. J. Feldman, “Timing of multi-gigahertz rapid single flux quantum digital circuits,” Journal of VLSI signal processing systems for signal, image and video technology, vol. 16, no. 2, pp. 247–276, 1997. [Online]. Available: https://doi.org/10.1023/A:1007903527533 [8] K. I. al., “32 GHz 6.5 mW gate-level-pipelined 4-bit processor using super- conductor single-flux-quantum logic,” in IEEE Symposium on VLSI Circuits, 2020, pp. 1–2. 112 [9] Q. P. Herr, A. Y. Herr, O. T. Oberg, and A. G. Ioannidis, “Ultra-low-power superconductor logic,” Journal of Applied Physics, vol. 109, no. 10, p. 103903, may 2011. [Online]. Available: https://doi.org/10.1063%2F1.3585849 [10] D. BROCK, “Rsfq technology: Circuits and systems,” International Journal of High Speed Electronics and Systems, vol. 11, 03 2001. [11] D. S. Holmes, A. L. Ripple, and M. A. Manheimer, “Energy-efficient super- conductingcomputing—powerbudgetsandrequirements,”IEEETrans.Appl. Supercond.,, vol. 23, no. 3, pp. 1701610–1701610, 2013. [12] C.J.FourieandM.H.Volkmann,“Statusofsuperconductorelectroniccircuit designsoftware,” IEEE Trans. Appl. Supercond.,,vol.23,no.3,pp.1300205– 1300205, 2013. [13] S. I. Chase, J. Kawa, L. Amaru, S. Chen, J. C. Vujkovic, S. Whiteley, E. Mli- nar, R. Freeman, A. Belov, T. Arifin, A. Salz, P. Moceyunas, M. Pan, T. Liu, S. Lu, Y. Zhang, and R. Singh, “First demonstration of a superconducting electronics microcontroller RTL-to-GDSII flow.” GOMACTech, 2021. [14] C. J. Fourie, “Digital superconducting electronics design tools—status and roadmap,” IEEE Transactions on Applied Superconductivity, vol. 28, no. 5, pp. 1–12, 2018. [15] T. Gheewala, “The Josephson technology,” Proceedings of the IEEE, vol. 70, no. 1, pp. 26–34, 1982. [16] K.LikharevandV.Semenov,“RSFQlogic/memoryfamily: anewJosephson- junction technology for sub-terahertz-clock-frequency digital systems,” IEEE Transactions on Applied Superconductivity, vol. 1, no. 1, pp. 3–28, 1991. [17] T. Jabbari, G. Krylov, S. Whiteley, E. Mlinar, J. Kawa, and E. G. Friedman, “Interconnect routing for large-scale RSFQ circuits,” IEEE Transactions on Applied Superconductivity, vol. 29, no. 5, pp. 1–5, 2019. [18] H. Suzuki, S. Nagasawa, K. Miyahara, and Y. Enomoto, “Characteristics of driverandreceivercircuitswithapassivetransmissionlineinRSFQcircuits,” IEEE Transactions on Applied Superconductivity, vol. 10, no. 3, pp. 1637– 1641, 2000. [19] O. A. Mukhanov, “Energy-efficient single flux quantum technology,” IEEE Transactions on Applied Superconductivity, vol. 21, no. 3, pp. 760–769, 2011. [20] M. H. Volkmann, A. Sahu, C. J. Fourie, and O. A. Mukhanov, “Implementation of energy efficient single flux quantum digital circuits 113 with sub-aJ/bit operation,” Superconductor Science and Technology, vol. 26, no. 1, p. 015002, nov 2012. [Online]. Available: https: //doi.org/10.1088%2F0953-2048%2F26%2F1%2F015002 [21] E. Friedman, “Clock distribution networks in synchronous digital integrated circuits,” Proceedings of the IEEE, vol. 89, no. 5, pp. 665–692, 2001. [22] P. Restle, T. McNamara, D. Webber, P. Camporese, K. Eng, K. Jenkins, D.Allen,M.Rohn,M.Quaranta,D.Boerstler,C.Alpert,C.Carter,R.Bailey, J. Petrovick, B. Krauter, and B. McCredie, “A clock distribution network for microprocessors,” IEEE Journal of Solid-State Circuits, vol. 36, no. 5, pp. 792–799, 2001. [23] J. Fishburn, “Clock skew optimization,” IEEE Transactions on Computers, vol. 39, no. 7, pp. 945–951, 1990. [24] E. Friedman, “Method of deskewing data pulses,” IBM Tech. Disclosure Bull, vol. 28, pp. 2658–2659, Nov. 1995. [25] R.-S. Tsay and I. Lin, “Robin Hood: a system timing verifier for multi-phase level-sensitive clock designs,” in [1992] Proceedings. Fifth Annual IEEE In- ternational ASIC Conference and Exhibit, 1992, pp. 516–519. [26] I. Lin, J. Ludwig, and K. Eng, “Analyzing cycle stealing on synchronous circuits with level-sensitive latches,” in [1992] Proceedings 29th ACM/IEEE Design Automation Conference, 1992, pp. 393–398. [27] H.B.Bakoglu,Circuits, Interconnections, and Packaging for VLSI. Reading, MA: Addison Wesley, 1990. [28] Ting-Hai Chao and Yu-Chin Hsu and Jan-Ming Ho and Kahng, A.B., “Zero skewclockroutingwithminimumwirelength,”IEEETransactionsonCircuits and Systems II: Analog and Digital Signal Processing, vol.39, no.11, pp.799– 814, 1992. [29] E.FriedmanandS.Powell, “Designand analysisof ahierarchicalclock distri- bution system for synchronous standard cell/macrocell VLSI,” IEEE Journal of Solid-State Circuits, vol. 21, no. 2, pp. 240–246, 1986. [30] K.Gaj,E.G.Friedman,andM.J.Feldman,“Timingofmulti-gigahertzrapid single flux quantum digital circuits,” J. VLSI Signal Process, vol. 16, no. 2, pp. 247–276, July 1997. [31] S. N. Shahsavani, T.-R. Lin, A. Shafaei, C. J. Fourie, and M. Pedram, “An integrated row-based cell placement and interconnect synthesis tool for large 114 SFQ logic circuits,” IEEE Transactions on Applied Superconductivity, vol. 27, no. 4, pp. 1–8, 2017. [32] S. N. Shahsavani and M. Pedram, “A minimum-skew clock tree synthesis al- gorithmforsinglefluxquantumlogiccircuits,” IEEE Transactions on Applied Superconductivity, vol. 29, no. 8, pp. 1–13, 2019. [33] S. N. Shahsavani, B. Zhang, and M. Pedram, “A timing uncertainty-aware clock tree topology generation algorithm for single flux quantum circuits,” in 2020 Design, Automation Test in Europe Conference Exhibition (DATE), 2020, pp. 278–281. [34] J.Cong, A.Kahng, C.-K. Koh, and C.-W. Albert Tsao, “Bounded-skew clock andSteinerroutingunderElmoredelay,”inProceedingsofIEEEInternational Conference on Computer Aided Design (ICCAD), 1995, pp. 66–71. [35] C.-C. Wang and W.-K. Mak, “A novel clock tree aware placement methodol- ogy for single flux quantum (SFQ) logic circuits,” in 2021 IEEE/ACM Inter- national Conference On Computer Aided Design (ICCAD), 2021, pp. 1–9. [36] R.Bairamkulov,T.Jabbari, andE.G.Friedman, “QuCTS–singlefluxquan- tum clock tree synthesis,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1–1, 2021. [37] R. N. Tadros and P. A. Beerel, “A robust and tree-free hybrid clocking tech- nique for RSFQ circuits - CSR application,” in 2017 16th International Su- perconductive Electronics Conference (ISEC), 2017, pp. 1–4. [38] Z. Deng, N. Yoshikawa, S. Whiteley, and T. Van Duzer, “Data-driven self- timed RSFQ digital integrated circuit and system,” IEEE Transactions on Applied Superconductivity, vol. 7, no. 2, pp. 3634–3637, 1997. [39] H. R. Gerber, C. J. Fourie, W. J. Perold, and L. C. Muller, “Design of an asynchronous microprocessor using RSFQ-AT,” IEEE Transactions on Ap- plied Superconductivity, vol. 17, no. 2, pp. 490–493, 2007. [40] Y.Nobumori,T.Nishigai,K.Nakamiya,N.Yoshikawa,A.Fujimaki,H.Terai, and S. Yorozu, “Design and implementation of a fully asynchronous SFQ microprocessor: SCRAM2,”IEEETransactionsonAppliedSuperconductivity, vol. 17, no. 2, pp. 478–481, 2007. [41] T. V. Filippov, A. Sahu, A. F. Kirichenko, I. V. Vernik, M. Dorojevets, C. L. Ayala, and O. A. Mukhanov, “20GHz operation of an asynchronous wave-pipelined RSFQ arithmetic-logic unit, physics procedia,” Volume, vol. 36, pp. 1875–3892, 2012. [Online]. Available: https://doi.org/10.1016/j. phpro.2012.06.130 115 [42] G. Pasandi, A. Shafaei, and M. Pedram, “SFQmap: A technology mapping tool for single flux quantum logic circuits,” in 2018 IEEE International Sym- posium on Circuits and Systems (ISCAS), 2018, pp. 1–5. [43] P. Bunyk, K. Likharev, and D. Zinoviev, “RSFQ technology: Physics and devices,” Int. J. High Speed Electron. Syst.,, vol. 11, 03 2001. [44] I. V. Vernik, Q. P. Herr, K. Gaij, and M. J. Feldman, “Experimental investi- gation of local timing parameter variations in RSFQ circuits,” IEEE Trans. Appl. Supercond.,, vol. 9, no. 2, pp. 4341–4344, 1999. [45] K. Han, A. B. Kahng, and J. Li, “Optimal generalized H-tree topology and buffering for high-performance and low-power clock distribution,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 39, no. 2, pp. 478–491, 2020. [46] B. Zhang and M. Pedram, “qSTA: A static timing analysis tool for supercon- ductingsingle-flux-quantumcircuits,” IEEETrans.Appl.Supercond.,,vol.30, no. 5, pp. 1–9, 2020. [47] R. N. Tadros and P. A. Beerel, “Optimizing (HC) 2 LC, A robust clock dis- tribution network for SFQ circuits,” IEEE Trans. Appl. Supercond.,, vol. 30, no. 1, pp. 1–11, 2020. [48] D. Amparo, M. Eren C ¸elik, S. Nath, J. P. Cerqueira, and A. Inamdar, “Tim- ing characterization for RSFQ cell library,” IEEE Transactions on Applied Superconductivity, vol. 29, no. 5, pp. 1–9, 2019. [49] M. E. C ¸elik and A. Bozbey, “A statistical approach to delay, jitter and timing of signals of RSFQ wiring cells and clocked gates,” IEEE Transactions on Applied Superconductivity, vol. 23, no. 3, pp. 1701305–1701305, 2013. [50] K. Hubert, Digital Integrated Circuit Design: From VLSI Architectures to CMOS Fabrication. Cambridge University Press, 2008. [51] M. Shell. (2020) ABC: A system for sequential synthesis and verification. [Online]. Available: https://people.eecs.berkeley.edu/ ∼ alanmi/abc/ [52] (2019) IBM ILOG CPLEX 12.10. [Online]. Available: http://www.ilog.com/ products/cplex [53] J. Zejda and P. Frain, “General framework for removal of clock network pes- simism,” in Proc. IEEE/ACM Int. Conf. Comput.-Aided Des., 2002, pp. 632– 639. 116 [54] V. Garg, “Common path pessimism removal: An industry perspective: Spe- cialsession: Commonpathpessimismremoval,”in2014 IEEE/ACM Interna- tional Conference on Computer-Aided Design (ICCAD), 2014, pp. 592–595. [55] E. W. Grafarend, Linear and Nonlinear Models: Fixed Effects, Random Ef- fects, and Mixed Models. Walter de Gruyter, 2006. [56] T. Huang, P. Wu, and M. D. F. Wong, “Fast path-based timing analysis for CPPR,” in 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2014, pp. 596–599. [57] T. Huang and P. Wu and M. D. F. Wong, “UI-Timer: An ultra-fast clock network pessimism removal algorithm,” in 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2014, pp. 758–765. [58] M.A.Bender and M.F.Colton, “The LCA problem revisited,” Proc. 4th Latin American Symposium on Theoretical Informatics, vol. 1776 LNCS, pp. 88–94, 2000. [59] R. N. Tadros, A. Fayyazi, M. Pedram, and P. A. Beerel, “SystemVerilog mod- eling of SFQ and AQFP circuits,” IEEE Transactions on Applied Supercon- ductivity, vol. 30, no. 2, pp. 1–13, 2020. [60] J. Xiong, V. Zolotov, and L. He, “Robust extraction of spatial correlation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Sys- tems, vol. 26, no. 4, pp. 619–631, 2007. [61] S. K. Tolpygo, V. Bolkhovsky, D. E. Oates, R. Rastogi, S. Zarr, A. L. Day, T. J. Weir, A. Wynn, and L. M. Johnson, “Superconductor electronics fabri- cation process with MoNx kinetic inductors and self-shunted Josephson junc- tions,” IEEE Trans. Appl. Supercond.,, vol. 28, no. 4, pp. 1–12, 2018. [62] L. C. M¨ uller, H. R. Gerber, and C. J. Fourie, “Review and comparison of RSFQ asynchronous methodologies,” Journal of Physics: Conference Series, vol. 97, p. 12109, Feb 2008. [Online]. Available: https: //doi.org/10.1088%2F1742-6596%2F97%2F1%2F012109 [63] K.Gaj, Q.Herr, andM.Feldman, “Parametervariationsandsynchronization of RSFQ circuits,” in Proc. Conf. Series-Inst. Phys., vol. 148, 1995, pp. 1733– 1736. [64] D.Bryan,“TheISCAS’85benchmarkcircuitsandnetlistformat,”Technischer Bericht, Microelectronics Center of North Carolina (MCNC), 1985. 117 [65] G. Pasandi and M. Pedram, “A dynamic programming-based, path balanc- ing technology mapping algorithm targeting area minimization,” in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2019, pp. 1–8. [66] K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “Optimal clocking of synchronous systems,” TAU Workshop, pp. 1–21, 1990. [67] X. Gao, E. A. M. Klumperink, and B. Nauta, “Advantages of shift registers over DLLs for flexible low jitter multiphase clock generation,” IEEE Transac- tions on Circuits and Systems II: Express Briefs, vol. 55, no. 3, pp. 244–248, March 2008. [68] O. Mukhanov, “Rapid single flux quantum (RSFQ) shift register family,” IEEE Transactions on Applied Superconductivity,vol.3,no.1,pp.2578–2581, 1993. [69] M. A. B. Jackson, A. Srinivasan, and E. S. Kuh, “Clock routing for high- performanceICs,” 27th ACM/IEEE Design Automation Conference, pp.573– 579, 1990. [70] R. Freeman, “Synopsys’ journey to enable TCAD and EDA tools for super- conducting electronics.” vol. 2020. [71] C. J. Fourie, C. L. Ayala, L. Schindler, T. Tanaka, and N. Yoshikawa, “De- sign and characterization of track routing architecture for RSFQ and AQFP circuits in a multilayer process,” IEEE Transactions on Applied Superconduc- tivity, vol. 30, no. 6, pp. 1–9, September 2020. [72] OpenCores Benchmarks. [Online]. Available: http://www.opencores.org [73] “WRspice circuit simulator.” [Online]. Available: http://wrcad.com/ [74] L.Schindler,“RSFQcelllibrary,” StellenboschUniversity.[Online].Available: https://github.com/sunmagnetics/RSFQlib [75] M. Pedram, “qPALACE: A suite of EDA tools for synthesis and physical design optimization of single flux quantum logic circuits,” in Proc. of the 33rd International Symposium on Superconductivity, Dec 2020. [76] G. Pasandi and M. Pedram, “PBMap: A path balancing technology map- ping algorithm for single flux quantum logic circuits,” IEEE Transactions on Applied Superconductivity, vol. 29, no. 4, pp. 1–14, 2019. [77] M. Hazewinkel, “Steiner tree problem,” in Encyclopedia of Mathematics. Springer, 2001. 118 [78] R. M. Karp, Reducibility among Combinatorial Problems. Boston, MA: Springer US, 1972, pp. 85–103. [Online]. Available: https: //doi.org/10.1007/978-1-4684-2001-2 9 [79] E. D. Demaine, M. Hajiaghayi, and P. N. Klein, “Node-weighted steiner tree and group steiner tree in planar graphs,” ACM Trans. Algorithms, vol. 10, no. 3, jul 2014. [Online]. Available: https://doi.org/10.1145/2601070 [80] M. T. Hajiaghayi, R. D. Kleinberg, H. R¨ acke, and T. Leighton, “Oblivious routing on node-capacitated and directed graphs,” ACM Trans. Algorithms, vol. 3, no. 4, p. 51–es, nov 2007. [Online]. Available: https://doi.org/10.1145/1290672.1290688 [81] S. Guha, A. Moss, J. S. Naor, and B. Schieber, “Efficient recovery from power outage (extended abstract),” in Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, ser. STOC ’99. New York, NY, USA: Association for Computing Machinery, 1999, p. 574–582. [Online]. Available: https://doi.org/10.1145/301250.301406 [82] S. G. Bender, Edward A.; Williamson, Lists, Decisions and Graphs. With an Introduction to Probability. Unit GT: Basic Concepts in Graph Theory, 2010. [83] X. Li, S. N. Shahsavani, X. Zhou, M. Pedram, and P. A. Beerel, “A variation-aware hold time fixing methodology for single flux quantum logic circuits,” ACM Trans. Des. Autom. Electron. Syst., vol. 26, no. 6, aug 2021. [Online]. Available: https://doi.org/10.1145/3460289 119
Abstract (if available)
Abstract
Single flux quantum (SFQ) logic is a promising technology to replace CMOS logic for future exa-scale supercomputing but requires the development of reliable EDA tools that are tailored to the unique characteristics of SFQ circuits. SFQ circuits often have ultra-high clock frequencies, significant time uncertainty, and deep clock trees, all of which makes the design of clock distribution networks (CDN) and timing closure extremely challenging. Moreover, the gate-level pipelining nature of SFQ requires many path-balancing registers to enable proper multi-threaded computation, which significantly increases the chip area and power.
This thesis first presents a variation-aware hold time fixing methodology that considers both local and global timing uncertainties and effectively utilizes the common path pessimism removal to reduce the overhead of hold buffers with competitive timing yield, compared with fixed time margin. It then presents a novel multi-phase clocking methodology targeting multi-threaded gate-level pipelined sequential circuits. An integer linear programming (ILP) algorithm is formulated to minimize the number of required registers given the number of available clock phases. The proposed method reduces the number of path-balancing registers by 55.5% with two clock phases and up to 95.5% with ten clock phases. The clock tree synthesis (CTS) and placement and route (PnR) results show that the decrease in registers yields a decrease in total gate area by 40.6% and clock tree wire length by 54.9% with two clock phases, and by 69.6% and 69.8% with ten clock phases, respectively, despite the increase in the number of clock phases. To further reduce the number of required registers, we present an enhancement of the flow to maximize register sharing across the fanout gates. The enhanced flow saves the number of inserted registers by an average of 26.3%, compared with the original flow. Last, we propose a hold time safe extension of the multi-phase clocking methodology which extends the benefits of the multi-phase clocking to timing robustness.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Clocking solutions for SFQ circuits
PDF
Analog and mixed-signal parameter synthesis using machine learning and time-based circuit architectures
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Radiation hardened by design asynchronous framework
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Efficient and effective techniques for large-scale multi-agent path finding
PDF
Security-driven design of logic locking schemes: metrics, attacks, and defenses
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
PDF
A double-layer multi-resolution classification model for decoding time-frequency features of hippocampal local field potential
PDF
Lower overhead fault-tolerant building blocks for noisy quantum computers
PDF
Multi-scale quantum dynamics and machine learning for next generation energy applications
PDF
Charge-mode analog IC design: a scalable, energy-efficient approach for designing analog circuits in ultra-deep sub-µm all-digital CMOS technologies
PDF
Efficient inverse analysis with dynamic and stochastic reductions for large-scale models of multi-component systems
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
Asset Metadata
Creator
Li, Xi
(author)
Core Title
Multi-phase clocking and hold time fixing for single flux quantum circuits
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-12
Publication Date
11/14/2022
Defense Date
10/18/2022
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
DAG,gate-level pipelining,hold time,multi-phase clock,multithreading,OAI-PMH Harvest,path balancing,single flux quantum (SFQ),superconducting electronics,timing uncertainty
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter A. (
committee chair
), Gupta, Sandeep (
committee member
), Nakano, Aiichiro (
committee member
), Nuzzo, Pierluigi (
committee member
)
Creator Email
xili.xil20@gmail.com,xli497@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC112417962
Unique identifier
UC112417962
Identifier
etd-LiXi-11312.pdf (filename)
Legacy Identifier
etd-LiXi-11312
Document Type
Dissertation
Format
theses (aat)
Rights
Li, Xi
Internet Media Type
application/pdf
Type
texts
Source
20221115-usctheses-batch-991
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
DAG
gate-level pipelining
hold time
multi-phase clock
multithreading
path balancing
single flux quantum (SFQ)
superconducting electronics
timing uncertainty