Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
(USC Thesis Other)
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Designing Ecient Algorithms and Developing Suitable Software Tools to Support Logic Synthesis of Superconducting Single Flux Quantum Circuits by Ghasem Pasandi A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Electrical Engineering) December 2020 Copyright 2020 Ghasem Pasandi Dedication In memory of my mom Aramjan and my dad Ali whom I lost several years ago. May they rest in peace. ii Acknowledgements First and foremost, I would like to express my deepest appreciation and thanks to my advisor, Professor Massoud Pedram, for the support, ideas, and guidance he has provided over the past ve years. Being a member of his research group has signicantly helped my PhD research and my journey as a PhD student has been remarkable because of his leadership. I would also like to thank my defense committee Professor Sandeep Gupta and Professor Ketan Savla and also my qualication exam committee Professor Peter Beerel and Professor Pierluigi Nuzzo. What is done in this dissertation was part of a bigger inter-continental project called Cold- Flux 1 . I would like to show gratitude to the ColdFlux team starting from the PI of the project, Professor Massoud Pedram from the University of Southern California (USC), Los Angeles, CA, and co-PI of the project Professor Coenrad Fourie from the Stellenbosch University, Stellenbosch, South Africa, and the faculty members involved: Professor Peter Beerel, Professor Sandeep Gupta, Professor Shahin Nazarian, Professor Murali Annavaram all from USC, Professor Yanzhi Wang from the Northeastern University, Boston, MA, Professor Noboyuki Yoshikawa and Professor Christopher Ayala from the Yokohama National University, Yokohama, Japan, Professor Mark Law from the University of Florida, Gainesville, FL, and Professor Pascal Febvre from the Uni- versity of Savoie Mont Blanc, Chambery, France. Also, I want to thank my colleagues here at USC: Dr. Naveen Katam, Dr. Ramy Tadros, Dr. Fangzhou Wang, Ting-Ru Lin, Soheil Nazar- shahsavani, Bo Zhang, Haolin Cong, Souvik Kundu, Haipeng Zha, and Gourav Datta. 1 cold ux.usc.edu iii I have been lucky to be able to work with a wonderful group of talented people, and to be part of a great research group (SPORT lab) here at USC (past and current members): Dr. Shahin Nazarian, Dr. Xue Lin, Dr. Di Zhu, Dr. Alireza Shafaei, Dr. Tiansong Cui, Dr. Shuang Chen, Dr. Luhao Wang, Dr. Naveen Katam, Hassan Afzalikusha, Mahdi Nazemi, Soheil Nazarshah- savani, Ting-Ru Lin, Marzieh Vaeztourshizi, Amir Erfan Eshratifar, Amirhossein Esmaili, Bo Zhang, Saeed Abrishami, Haolin Cong, Souvik Kundu, Mingye Li, Zeming Cheng, and Mustafa Karamuftuoglu. Within SPORT lab, specically I would like to thank Saeed Abrishami, Soheil Nazarshah- savani, Mahdi Nazemi and Dr. Alireza Shafaei. From the rst day that I came to the United States until now, the aforementioned individuals have helped me in numerous ways: picking me up when I arrived at the LAX airport, helping me get prepared for placement exams here at USC, and even teaching me how to cook Persian food. They contributed a lot of experiences and good ideas, which in turn showed me how to write good papers and how to be a great PhD student. I would also like to thank Dr. Shahin Nazarian, a faculty at USC and a member of the SPORT lab. We have worked on many remarkable projects together and published our ndings in several top conferences and journals. However, I did not include them in this dissertation because they are out of scope. My sincerest thanks to my best friends here at USC: Ahmad Fallahpour, Mehdi Jafarnia, Mohammad Asghari, Seyed Mohammadreza Mousavi, Ali Zarei, Mehrdad Kiamari, and Pezhman Mamdouh. I will never forget the memories we have created: our road trips, hiking, hangouts on Friday evenings, going to Westwood and getting Persian food, watching movies and playing video games, our short trips in and around LA to attend dierent events, birthday parties, and many more. Last but not least, I am most grateful to my wonderful and encouraging family, my siblings, nieces, and nephews. Without their love, help, and support, I could not succeed and would not be where I am today. Student visas for Iranians here in the U.S. are normally single entry, so I iv couldn't travel back to my country during my PhD studies. Also, the current administration's travel ban made it impossible for my family to come to the United States to visit me. So, I did not get to see my family for more than four years, but their love has always been with me and kept me going. The research is based upon work supported by the Oce of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the U.S. Army Research Oce grant W911NF-17-1-0120. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation herein. v Table of Contents Dedication ii Acknowledgements iii List Of Tables ix List Of Figures xi Abstract xxii Chapter 1: Introduction 1 1.1 Superconducting SFQ Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Chapter 2: Preliminaries and background 8 2.1 Background on SFQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.1 Fanout in SFQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.2 Gate-level pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.3 Path Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.4 Interconnects in SFQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Background on logic synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Terminology and denition of required concepts . . . . . . . . . . . . . . . . 13 2.2.2 State-of-the-art technology mapping ow . . . . . . . . . . . . . . . . . . . 15 2.2.2.1 k-feasible cuts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.2.2 Computing Cut's Function . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2.3 Supergates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.2.4 Boolean Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.2.5 Best Matches and Best Covers . . . . . . . . . . . . . . . . . . . . 17 2.3 Background on graph partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Chapter 3: Balanced Factorization and Rewriting Algorithms 21 3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Balanced Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Balanced Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 vi Chapter 4: Path Balancing Technology Mapping 34 4.1 Minimizing Total DFF count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.1.3 Presenting Our Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.1.3.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.1.3.2 Discussion about the Algorithm . . . . . . . . . . . . . . . . . . . 43 4.1.3.3 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.3.4 DAG Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.1.3.5 Clock Jitter Accumulation . . . . . . . . . . . . . . . . . . . . . . 53 4.1.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Minimizing Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.2 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.2.1 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . 62 4.2.2.2 Presenting the Algorithm . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.3 Proof of Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2.4 DAG Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.3 Minimizing Product of Stage Delay and Circuit Depth . . . . . . . . . . . . . . . . 79 4.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.2.1 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.3.2.3 Depth Minimization with Path Balancing . . . . . . . . . . . . . . 83 4.3.2.4 Peephole Optimization for Reducing the Sequential Depth . . . . 85 4.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 Chapter 5: Dual Clocking Method for Realizing SFQ Circuits 93 5.1 An Ecient Pipelined Architecture for SFQ Logic Circuits Utilizing Dual Clocks . 94 5.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.1.2 Throughput Versus Path Balancing DFFs . . . . . . . . . . . . . . . . . . . 95 5.1.2.1 Path Balancing Requirement in SFQ Circuits . . . . . . . . . . . . 95 5.1.2.2 No Path Balancing (NPB) . . . . . . . . . . . . . . . . . . . . . . 97 5.1.2.3 Partial Path Balancing (PPB) . . . . . . . . . . . . . . . . . . . . 101 5.1.2.4 Timing Requirements of DCM . . . . . . . . . . . . . . . . . . . . 105 5.1.2.5 Adder Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.1.3 Simulation and Experimental Results . . . . . . . . . . . . . . . . . . . . . 111 5.1.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.2 Depth-Bounded Levelized Graph Partitioning Algorithm . . . . . . . . . . . . . . . 118 5.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 5.2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.2.3 Proposed Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.2.4 Realization of SFQ Circuits using Dual Clocks . . . . . . . . . . . . . . . . 132 5.2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 vii Chapter 6: Synthesizing Sequential Circuits in SFQ Technology 142 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2 Leveling Directed Cyclic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 6.2.1 Terminology, Notation, and Basic Denitions . . . . . . . . . . . . . . . . . 144 6.2.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.2.3 Our Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.3 Synthesizing Sequential Circuits in SFQ Technology . . . . . . . . . . . . . . . . . 155 6.3.1 Pipeline Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.3.1.1 Dual Clocking Method . . . . . . . . . . . . . . . . . . . . . . . . 156 6.3.1.2 Single Clocking Method . . . . . . . . . . . . . . . . . . . . . . . . 157 6.3.2 Circuits with Arbitrary Latch Placement . . . . . . . . . . . . . . . . . . . 157 6.3.3 Circuits with Feedback Loops . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.3.3.1 Single Clocking Method . . . . . . . . . . . . . . . . . . . . . . . . 162 6.3.3.2 Dual Clocking Method . . . . . . . . . . . . . . . . . . . . . . . . 162 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.4.1 Case Study: A Counter Circuit . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.4.2 ISCAS89 Benchmark Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Chapter 7: qSyn: A Synthesis Tool for Superconducting SFQ Logic Circuits 177 7.1 qYosys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.1.1 Behavioral Synthesis Commands . . . . . . . . . . . . . . . . . . . . . . . . 178 7.1.2 Mini Synthesis Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.2 qABC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.3 qView . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 7.4 Demonstrative Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Chapter 8: Conclusions and future works 203 Reference List 208 viii List Of Tables 2.1 Average hit rate for 20 ISCAS benchmark circuits in the standard cut-enumeration- based technology mapping approach [38]. . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Dierent technology-independent optimization scripts. . . . . . . . . . . . . . . . . 28 3.2 Comparing the graph cost of dierent benchmark circuits generated by using dif- ferent optimization scripts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Comparing the area (mm 2 ) of dierent benchmark circuits generated by using dierent optimization scripts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 Experimental results for PBMap and baseline mapper (ABC's mapper). #DFFs is reported for before and after applying the retiming algorithm. Area is in mm 2 and run-time is in second. Area and JJ count (#JJs) are for after retiming. Logical depth is the maximum logic level in the network. . . . . . . . . . . . . . . . . . . . 89 4.2 Area and delay results for BalancedMap (BM), baseline (Base) [75], and PBMap [123]. For area two sets of numbers are reported; numbers inside parenthesis include area of gates, path balancing DFFs, and splitters, while the area numbers outside parenthesis do not include area of splitters. Since delays of circuits generated by PBMap is very close to the same generated by BM and due to the lack of space, they are removed from the table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.3 Experimental results for SFQmap, and ABC mapper using mcnc.genlib library. The run-time, which measures the amount of time it takes to generate the mapping solution, is measured in second s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.4 Improvement percentages of \SFQmap -i 5" and \SFQmap -i 0" over ABC mapper for key parameters. These results are the average of all ve tested benchmark circuits.# shows a decrease, and" shows an increase in the corresponding quantity. 92 5.1 Experimental results for FPB, PPB, and NPB. #PI, #PO, and #DFFs stand for PI count, PO count, DFF count, respectively. DFF count is split into NDROs and DROs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.2 Experimental results for DLGP-based DCM, Baseline1 (FPB), and Baseline2 (FPB + retiming). Run-time is in second. For DLGP-based dual clocking method (DCM), two cases of p=10 and p=5 are considered. . . . . . . . . . . . . . . . . . . . . . . . 141 ix 6.1 Experimental results for 1-4 -bit versions of the counter circuit presented in Sec- tion 6.4.1, generated by applying technology-independent optimizations, technol- ogy mapping and also Algorithm 10. L max is the maximum level among feedback latches which is the same as the maximum loop length in the circuit. The 1-3-bit versions are shown in Figures 6.7, 6.9, 6.11. . . . . . . . . . . . . . . . . . . . . . . 168 6.2 Experimental results for 1-4 -bit versions of the counter circuit presented in Sec- tion 6.4.1, generated by applying technology-independent optimizations, technology mapping (with using supergates) and also Algorithm 10. L max is the maximum level among feedback latches which is the same as the maximum loop length in the circuit. These circuits are shown in Figures 6.14, 6.15, 6.17, 6.19. . . . . . . . . . . 168 6.3 Experimental results for the rst 15 benchmark circuits in ISCAS89 [17] suite. L max is the maximum loop length in the circuit. . . . . . . . . . . . . . . . . . . . 169 x List Of Figures 2.1 Schematic of an SFQ buer, its IVC and the pulse shape representation of data in SFQ (waveforms are adopted from [53]). . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Schematic of an SFQ NOT gate and its waveforms. In case of not having any input pulses, an output pulse is generated after arrival of the clock pulse, representing a \logic-1". However, when there is an input pulse, no pulses are generated at the output, meaning a \logic-0". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 (a) A splitter gate in SFQ technology, (b) waveforms corresponding to the operation of this gate, and a splitter tree providing four fanouts with depth (c) 2, and (d) 3. Even-though the balanced structure as in (c) is usually preferred but there can be cases that the linear structure works better: Delay for generating out 1 in (d) is lower than (c). Thus, using the structure of (d) is better in networks where the critical path goes through out 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Gate-level schematic of Example 1 for showing the necessity of path balancing for correct operation in SFQ circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Josephson Transmission Line (JTL). . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 A factoring tree for the expression given in the Example 2. . . . . . . . . . . . . . 24 3.2 Three AIGs obtained for Boolean expression mentioned in the Example 3. Nodes (shown by circles) are 2-input AND gates, and dashed lines are inverted edges. (a) generated by applying standard rewriting, balance, and refactoring commands of ABC [38], (b) generated by applying resyn2 optimization script of ABC, and, (c) a perfectly balanced tree generated by our blnc syn1 optimization script (see Table 3.1). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3 Comparing imbalance factors of original graphs (no opt.) and those generated by dierent optimization scripts. For better exhibition purposes, the data for priority, i10, and voter circuits are scaled down by a factor of 10. . . . . . . . . . . . . . . . 29 3.4 Comparing logical depth of the mapped circuits optimized by using dierent opti- mization scripts and the original one (no opt.). For better exhibition purposes, the data for priority, and IntDiv8 circuits are scaled down by a factor of 10. . . . . . . 30 xi 3.5 Post place-and-route of ISCAS c7552 benchmark circuit synthesized by apply- ing our technology-independent optimization script, blnc syn2. Dimensions are 8440m 8420m. Dimensions of the chip without applying our optimization scripts are 10090m 9750m. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 Two mapping solutions for F = a:b:(!c):d. The left circuit, generated by ABC's mapper [38] and requires three path balancing DFFs. There is another mapping solution with only one DFF as shown in the right graph. . . . . . . . . . . . . . . . 38 4.2 Showing 3-feasible cuts of node v i . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.3 (a) A tree that we want to nd its #input pins, and (b) its extended tree. Fertile and sterile buer nodes and imaginary nodes are shown in the extended tree. The left sub-tree (not shown) in the extended tree is a full binary tree rooted at node i+1. This sub-tree generates 24 input pins, and there is a single input pin feeding the only buer node at the last level (x=4). Thus, n=9. . . . . . . . . . . . . . . . 42 4.4 Two possible binary trees with height X = 2, and the most unbalanced binary trees with height X=3, and X=4. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Two examples for lemma 4.1.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.6 The most balanced binary tree that we can get by increasing the height of the tree described in lemma 4.1.5 without increasing the number of input pins. . . . . . . . 49 4.7 Matches for a node in a subject graph: (a) before retiming (b) after retiming. . . . 50 4.8 Simulation results for a 4-bit Kogge-Stone adder (KSA4). (a) input/output signals for the KSA4 circuit generated by our algorithm, (b) output signals generated for the same inputs by a KSA4 circuit which is not path balanced. Four sets of ran- dom inputs are applied: a 0 =1001,a 1 =1111,a 2 =1001,a 3 =1111,b 0 =1010,b 1 =1100, b 2 =1010,b 3 =1100,c in =1001. The correct outputs are: S 0 =1010,S 1 =1010,S 2 =1110, S 3 =1010, C out =1101. As seen, only the results in (a) are correct. . . . . . . . . . . 54 4.9 Path balancing DFF count versus original gate count for a few benchmark circuits. KSA32 is a 32-bit Kogge-Stone adder; IntDiv8 is an 8-bit integer divider and c5315, c7552, and c3540 are chosen from ISCAS benchmark suite [59]. These circuits are mapped using map command of ABC [38], and path balanced and retimed using full path balancing [75,123] and retiming [92] algorithms. . . . . . . . . . . . . . . 58 4.10 Two mapping solutions for a 2-bit Kogge-Stone adder using mcnc.genlib library of gates (red squares are path balancing DFFs): (a) Consuming 11 gates with area of 37.0 units and requiring 10 path balancing DFFs , (b) Consuming 11 gates with the same area but requiring only three path balancing DFFs. It is obvious that the total area of gates and path balancing DFFs in the second circuit is much less than the same in the rst one. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.11 An AIG tree with ve nodes. 3-feasible cuts (C 1 , C 2 , and C 3 ) of node i in this subject tree are shown. Leaf nodes are shown with squares. Nodes are labeled in a Breadth First Search (BFS) order (root, left, right). . . . . . . . . . . . . . . . . . 65 xii 4.12 An example to verify the correctness of Eq. 4.12 for giving total leaf node count of a general tree. The leaf node count of this tree is : 23 2 2f13+1g+1f0+2g = 18 8 2 = 8X. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.13 An example of the most cost ecient tree with n=7, H=4, and b=3. The right- most column in the table shows portions of the total cost function for a specic reverse level (r). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.14 An example of the least cost ecient tree with n=7, H=4, and b=3. . . . . . . . . 73 4.15 Value of the cost function (Eq. 4.19) for dierent circuits generated by three dierent mappers. For better exhibition purposes data for sin, priority, s35932, and s38584 circuits are scaled down by a factor of ve. . . . . . . . . . . . . . . . . 77 4.16 Normalized total static power consumption (the main source of power consumption in SFQ circuits [111]). For better exhibition purposes data for sin, priority, s35932, and s38584 circuits are scaled down by a factor of ve. . . . . . . . . . . . . . . . . 78 4.17 Total number of Josephson junctions for gates, path balancing DFFs, and splitters. For better exhibition purposes data for sin, priority, s35932, and s38584 circuits are scaled down by a factor of ve. . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.18 Simulation waveforms for a 2-bit Kogge-Stone Adder (KSA2) generated by MB. For four sets of random inputs, correct outputs for S 0 , S 1 , and C out are generated which are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.19 Post place-and-route of dec circuit which is mapped by BM. Dimensions are 3960m 3310m. The dimensions are increased to 4870m 3990m when the circuit is mapped using the map command of ABC. . . . . . . . . . . . . . . . . . . . . . . . 81 4.20 Three mapping solutions for the expressionF =ab(!c)d. (a) the circuit generated by ABC mapper [38] requiring three path balancing DFFs and has depth of three; two other mapping solutions requiring only one DFF with depth of (b) three, and (c) two. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.1 (a) An unbalanced circuit with incorrect operation, (b) path-balanced version of this circuit with correct operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.2 Comparing the required number of path balancing DFFs to the original gate count for a few benchmark circuits when the FPB method is employed. . . . . . . . . . . 97 5.3 (a) A (linear) pipeline architecture for SFQ circuits in which each block is fully path-balanced, (b) The new architecture employing a fast and a slow clock signals together with repeat (green) and mask (blue) bands. In the new architecture, orig- inal circuit blocks are either partially path-balanced or they are not path-balanced at all. Frequency of the slow clock is D times lower than that of the fast clock, where D=max(D 1 , D 2 ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.4 (a) Block diagram of an NDRO, (b) its state transition diagram, (c) its operation waveforms obtained by simulating an NDRO using the Josephson simulator (JSIM) [71]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 xiii 5.5 Inverse of normalized throughput versus path balancing DFF count for the KSA32 circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.6 Partial path balancing of the circuit shown in Figure 5.1a with (a) = 1, (b) = 2.102 5.7 Waveforms showing timing requirements for correct operation of DCM. . . . . . . 106 5.8 Design of a 4-bit KSA by using two 2-bit KSAs and following the presented DCM and architecture shown in Figure 5.3b. . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.9 Simulation results for a 4-bit KSA (KSA4) generated by our dual clocking method (Figure 5.8). Inputs, outputs and clock pulses are shown. Four sets of random inputs are applied to this circuit: a 0 = 0101, a 1 = 1011, a 2 = 0010, a 3 = 1001, b 0 = 0110, b 1 = 1000, b 2 = 1011, b 3 = 0110, and c in = 1010. The correct outputs are S 0 = 1001, S 1 = 0101, S 2 = 0011, S 3 = 0101, and C out = 1010 are generated. . 112 5.10 Total Josephson junction count for dierent benchmark circuits. For better exhi- bition purposes, data for voter, priority, and sin benchmark circuits is scaled down by a factor of 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.11 Total area inmm 2 for dierent benchmark circuits. For better exhibition purposes, data for voter, priority, and sin benchmark circuits is scaled down by a factor of 10.115 5.12 Total node count including: gates, DFFs, and splitter for dierent benchmark circuits. For better exhibition purposes, data for voter, priority, and sin benchmark circuits is scaled down by a factor of 10. . . . . . . . . . . . . . . . . . . . . . . . . 116 5.13 A 4-bit Kogge-Stone adder circuit implemented using the clock-follow-data clocking scheme [49]. The worst stage delay is 15.7ps, which determines the minimum clock period (for post synthesis). Critical nodes with stage delay of 15.7ps are highlighted, and gates at dierent logic levels are shown. To generate a delay value equal to or larger than the clock period of this circuit, at least four series JTLs should be used. Therefore, to comply with the clock-follow-data clocking scheme, clock pulse of a gate at level i needs to go through (i 1) 4 series JTLs. Only splitters which are needed for splitting clock pulses between dierent levels are shown; other plitters are not shown in this graph. Note that in the above circuit only one clock pulse, i.e. slow clock is needed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.14 An example directed acyclic graph used in explaining DLGP and DCGP problems and the proposed algorithms for graph partitioning. This graph has 10 internal nodes and levels of these nodes range from 1 to 5. Triangular shapes at the bottom of the graph represent primary inputs and those at the top show primary outputs. Color-coding is used for easy identication of hyper-edges. 2-pin nets are shown in black color. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.15 The induced sub-graphs achieved in partitioning the graph in Figure 5.14 into three parts: V 1 =fv 1 ;v 2 ;v 3 ;v 4 g, V 2 =fv 5 ;v 6 g, and V 3 =fv 7 ;v 8 ;v 9 ;v 10 g, having p = 2 and K = 3. (a) G 1 , (b) G 2 , and (c) G 3 . . . . . . . . . . . . . . . . . . . . . . . . . 124 xiv 5.16 The optimal cuts for the graph in Figure 5.14 and the corresponding directed weighted chain graphs. (a) showing the optimal cuts of the graph which give opti- mal parts minimizing TCS. The directed weighted chain graph with optimal parts are shown using (b) regular edges, (c) using hyper edges for calculating weights. TCW =7 in (b) and TCW =5 in (c). . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.17 (a) A graph used in Example 15, corresponding to a 4-bit mapped adder. Color- coding is used to show hyper edges. 2-pin nets are shown in black color. (b) The corresponding weighted chain graph having p=2 and K=3. The hyper-edges are used for calculating weights and therefore to nd optimal cuts and parts. For the shown partitioning solution, TCW =26. . . . . . . . . . . . . . . . . . . . . . . . . 130 5.18 Induced sub-graphs for the graph shown in Figure 5.17a corresponding to the par- titioning solution with p = 2 and K = 3 shown in Figure 5.17b. (a) G 1 , (b) G 2 , and (c) G 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.19 An example of (a) full path balancing (FPB), and (b) DLGP-based DCM (with p=6) for SFQ circuits. FPB uses DRO DFFs whereas the DCM uses NDRO DFFs. FPB requires 9 DRO DFFs and DCM requires 5 NDRO DFFs. . . . . . . . . . . . 134 5.20 Pulse-repeating gate: an SFQ gate consists of an NDRO DFF, an AND gate, and two splitters to give the macro and micro clocks to these gates. . . . . . . . . . . . 135 5.21 Total static power consumption (mW ) of dierent benchmark circuit generated by DLGP-based DCM, Baseline1 (FPB), and Baseline2 (FPB + retiming). For better exhibition purposes, data for priority, voter, i10, and s13207 is scaled down by a factor of 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.22 Area (mm 2 ) of dierent benchmark circuit generated by DLGP-based DCM, Base- line1 (FPB), and Baseline2 (FPB + retiming). For better exhibition purposes, data for priority, voter, i10, and s13207 is scaled down by a factor of 10. . . . . . . . . 137 5.23 Post place-and-rout results of ISCAS c432 benchmark circuit synthesized by our proposed DCM. Dimensions of the chip are 5150m4330m. Note that synthesis and place-and-rout of the slow clock tree is not done here (this will not have a large impact on the reported dimensions since the number of sink nodes for the slow clock is very small for this circuit). . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.1 The cyclic digraph in Example 16. The triangular shapes shown in the left labeled by a, b, and c are leaf nodes, and the one shown in the right side is the root node of the graph. Levels of internal nodes in this graph following the Denition 2 will be: v 1 : 1, v 2 : 1, v 3 : 2, v 4 : 3, v 5 : 4. . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6.2 A cyclic digraph used in proof of Lemma 6.2.1. . . . . . . . . . . . . . . . . . . . . 152 xv 6.3 Three dierent categories of sequential circuits to be synthesized in SFQ technol- ogy. Cloud-like shapes are combinational logic, squares denote individual memory elements (latches), and rectangles are columns of memory elements. (a) A linear nice pipeline architecture with three combinational logic blocks, (b) A circuit with memory elements in random places, and (c) A circuit that has (linear) pipeline latches, latches placed in random locations in the circuit, and latches in feedback loops. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.4 Part of a VLSI circuit with showing all feedback loops going through the feedback latch v 7 . As Lemma 6.3.1 states, after applying Algorithm 10, lengths of all of these loops will be the same. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 6.5 A VLSI circuit with one feedback latch and two feedback loops (7! 3! 2! 4! 5 and 7! 6! 5). The green squares are balancing latches which will be added by Algorithm 10, resulting in equalizing lengths of both loops. This is part of a real circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.6 The Verilog code for the counter circuit, yosys-counter, used in Section 6.4.1. . . . 163 6.7 A 1-bit version of the counter in Figure 6.6 after applying technology-independent optimizations, technology mapping and also Algorithm 10. This circuit has only one feedback loop of length three as follows: 4! 1! 3. Note that the worst stage delay in this implementation is determined by the intrinsic delay of an and2 gate to be 8.4ps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.8 Waveforms for showing correct functionality of the 1-bit counter in Figure 6.7. At rst, since the reset signal (rst) is 1, the counter is reset and count becomes 0. Then, the enable signal (en) becomes 1 and causes the count to get a value of 1 after three clock cycles. Note that for this implementation we should apply inputs and do sample outputs every three fast clock cycles. For each pulse on en pin every three clock cycle, the value on count output will be inverted, i.e. if it is 0, it will become 1 and vice versa. This is how a 1-bit counter is supposed to work. The counter preserves its previous counted value when there is no pulse on en. . . . . . 164 6.9 A 2-bit version of the counter in Figure 6.6 after applying technology-independent optimizations, technology mapping and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. This circuit has four feedback loops as follows with the maximum length of ve for the rst one: 7! 3! 2! 4! 5, 7! 6! 5, 15! 14! 11! 13, and 15! 10! 12! 13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 6.10 Waveforms for showing correct functionality of the 2-bit counter in Figure 6.9. At rst, since the reset signal (rst) is 1, the counter is reset and both count[0] and count[1] become 0. Then, the enable signal (en) becomes 1 and causes the count[0] to get a value of 1 after ve clock cycles. Note that for this implementation we should apply inputs and do sample outputs every ve fast clock cycles. The counter counts up by 1 for every new pulse on en until it reaches 11 and then it over ows to 00 and starts over. The counter preserves its previous counted value when there is no pulse on en. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 xvi 6.11 A 3-bit version of the counter in Figure 6.6 after applying technology-independent optimizations, technology mapping and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. This circuit has six feedback loops as follows with the maximum length of six for the last one: 7! 3! 2! 4! 5, 7! 6! 5, 15! 14! 10! 12! 13, 15! 11! 13, 23! 20! 18, and 23! 22! 21! 19! 17! 18. . . . . . . . . . 171 6.12 Waveforms for showing correct functionality of the 3-bit counter shown in Figure 6.11. Since the maximum level among feedback latches is six in this implementation, new inputs should be applied every six clock cycles, and every six clock cycles a new correct output will be generated. The counter counts up by 1 for every pulse on en until it reaches 111 and then it over ows to 000 and starts over. The counter preserves its previous counted value when there is no pulse on the en pin. . . . . . 171 6.13 Waveforms for showing correct functionality of the 4-bit counter generated by ap- plying technology-independent optimizations, regular technology mapping (without using supergates) and also Algorithm 10. New inputs should be applied every seven clock cycles, and every seven clock cycles a new correct output will be generated. The counter counts up by 1 for every pulse on en until it reaches 1111 and then it over ows to 0000 and starts over. The counter preserves its previous counted value when there is no pulse on the en pin. . . . . . . . . . . . . . . . . . . . . . . 172 6.14 A 1-bit version of the counter in Figure 6.6 after applying technology-independent optimizations, technology mapping (with using supergates) and also Algorithm 10. This circuit has only one feedback loop of length three as follows: 4! 1! 3. Note that the worst stage delay in this implementation is determined by the intrinsic delay of an or2 gate to be 6.1ps while in the implementation shown in Figure 6.7 it is determined by an and2 gate to be 8.4ps. Having the same maximum level for feedback latches, therefore, this implementation has better throughput (see Tables 6.1, 6.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 6.15 A 2-bit version of the counter in Figure 6.6 after applying technology-independent optimizations, technology mapping (using supergates) and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. Note that compared to the implementation shown in Figure 6.9, this implementation will have a better throughput (see Tables 6.1, 6.2) because the maximum level for feedback latches in it is smaller by one. This circuit has two feedback loops as follows: 7! 4! 1! 3, and 8! 6! 4. . . . . . . . . . 173 6.16 Waveforms for showing correct functionality of the 2-bit counter in Figure 6.15. At rst, since the reset signal (rst) is 1, the counter is reset and both count[0] and count[1] become 0. Then, the enable signal (en) becomes 1 and causes the count[0] to get a value of 1 after four clock cycles. Note that we should apply inputs and do sample outputs every four fast clock cycles. The counter counts up by 1 for every new pulse on en until it reaches 11 and then it over ows to 00 and starts over. The counter preserves its previous counted value when there is no pulse on en. . . 173 xvii 6.17 A 3-bit version of the counter in Figure 6.6 after applying technology-independent optimizations, technology mapping (using supergates) and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. Note that compared to the implementation shown in Figure 6.11, this implementation will have a better throughput (see Tables 6.1, 6.2) because the maximum level for feedback latches in it is smaller by one. This circuit has three feedback loops as follows: 3! 1! 2, 9! 6! 8, and 13! 11! 12.174 6.18 Waveforms for showing correct functionality of the 3-bit counter shown in Figure 6.17. Since the maximum level among feedback latches is ve in this implemen- tation, new inputs should be applied every ve clock cycles, and every ve clock cycles a new correct output will be generated. The counter counts up by 1 for every pulse on en until it reaches 111 and then it over ows to 000 and starts over. The counter preserves its previous counted value when there is no pulse on the en pin. 174 6.19 A 4-bit version of the counter in Figure 6.6 after applying technology-independent optimizations, technology mapping (using supergates) and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. This circuit has ve feedback loops as follows: 3! 1! 2, 9! 6! 8, 13! 11! 12, 22! 21! 20! 19, and 22! 18! 17! 20! 19. . . 175 6.20 Waveforms for showing correct functionality of the 4-bit counter shown in Figure 6.19. New inputs should be applied every ve clock cycles, and every ve clock cycles a new correct output will be generated. The counter counts up by 1 for every pulse on en until it reaches 1111 and then it over ows to 0000 and starts over. The counter preserves its previous counted value when there is no pulse on the en pin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.21 Area and #JJs results for 1-4 bit counters in two cases of with (super) and without (map) using supergates during technology mapping. Area results belong to the left Y-axis which is in logarithmic scale and #JJs belong to the right Y-axis with normal linear scale. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 6.22 Static power consumption results for 1-4 bit counters with (super) and without (map) using supergates in two technologies: RSFQ and ERSFQ. The power results for RSFQ belong to the left Y-axis while the power results for ERSFQ belong to the right Y-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7.1 Eight dierent sub-circuits generated by Yosys for DFFs with asynchronous resets. RstLvl is the reset level or the value that should be put on the reset pin in order to activate the reset operation. RstVal is the reset value to which the circuit will be reset to. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.2 Four dierent mappings that are generated by our mini synthesis process for syn- thesizing Yosys' sub-circuits that implement DFFs with asynchronous resets. The names under each circuit match with those listed in the last column of the table shown in Figure 7.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7.3 Options for balanced factorization and rewriting commands in qABC. . . . . . . . 181 xviii 7.4 Options for path balancing technology mapping commands (qmap and BM) in qABC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182 7.5 Options for addLr command in qABC that adds full/partial path balancing latches and performs standard retiming. In addition, this command implements our depth- bounded graph partitioning algorithm (DLGP) as well as our dual clocking method (DCM) for realizing SFQ circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 7.6 Options for adj and adjPI commands in qABC that adjust fanout counts of nodes, latches and primary inputs and guarantee limiting them to a maximum value given by the user. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 7.7 Options for CorrectLevels or cl command in qABC that assigns correct levels to gates, latches, primary inputs, and primary outputs. . . . . . . . . . . . . . . . . . 184 7.8 Options for BalanceFLatch or b command in qABC that balances feedback latches in an SFQ circuit with feedback loops. It can also add garbage collecting AND gates if the \-a" switch is selected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.9 Options for ChangeName command in qABC that changes names of internal nodes, latches, primary inputs, and primary outputs to some standard names. . . . . . . 185 7.10 Options for InsS command in qABC that inserts splitter trees at outputs of gates with more than one fanouts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.11 Options for qCut command in qABC that creates a two stage pipeline circuit out of a given combinational circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.12 Options for SPORTPrintStats command and its variations sprss and sprsl in qABC. This commands is designed to print some statistics about the circuit. . . . 187 7.13 Options for qWrite blif or qwb command in qABC. This command is designed to write a synthesized, mapped, path balanced SFQ circuit with inserted splitters into a BLIF format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 7.14 Options for const command in qABC. This command is designed to replace gate zeros and gate ones (ABC's implementations for constant-0 and constant-1, respec- tively) with primary inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 7.15 Options for black command in qABC. This command is designed to handle black boxes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 7.16 Synthesized and mapped circuit for KSA4 by using qmap command. Logic gates, and primary inputs/outputs are shown by circles, and triangles, respectively. . . . 190 7.17 Synthesized, mapped, and fully path balanced circuit for KSA4 by using qmap and addLr -L -R commands. Logic gates, path balancing latches, and primary inputs/outputs are shown by circles, rectangles, and triangles, respectively. . . . . 191 7.18 Messages printed to the command prompt in the qABC environment when synthe- sizing the KSA4 benchmark circuit (Example 20). . . . . . . . . . . . . . . . . . . 192 xix 7.19 Synthesized, mapped, and fully path balanced circuit for KSA4. Logic gates and DFFs, and primary inputs/outputs are shown by circles, and triangles, respectively. In this graph, the splitters are also inserted and shown. This graph is generated by using qView.py script in the qSyn directory. . . . . . . . . . . . . . . . . . . . . 193 7.20 Verilog code of the circuit used in Example 21. . . . . . . . . . . . . . . . . . . . . 194 7.21 Test circuit with Verilog shown in Figure 7.20. (a) after applying blnc syn2 and qmap commands, (b) after applying cl -B command, (c) after applying addLr -L -R command. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 7.22 Final mapped circuit for Example 21 generated using qView. . . . . . . . . . . . . 194 7.23 A 2-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; . To see the graph representation of a circuit with latches in qABC environment type show -s. . . . . . . . . . . . . . . . . . . . . . . . . . . 195 7.24 A 2-bit counter circuit generated by invoking the following commands. blnc syn2; qmap; cl -B -v; b -b -v; addLr -L -R;. To see the graph representation of a circuit with latches in qABC environment type show -s. . . . . . . . . . . . . . . . . . . . 195 7.25 A 3-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 7.26 A 3-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; b -b -v; addLr -L -R;. . . . . . . . . . . . . . . . . . . 196 7.27 A 4-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 7.28 A 4-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; b -b -v; addLr -L -R;. . . . . . . . . . . . . . . . . . . 197 7.29 (a) A pipelined 4-input XOR circuit with two stages of pipeline. The rectangles are latches and circle are logic gates. The rst stage of pipeline in this circuit is consists of only one XOR gate (15) and the second stage is consists of two XOR gates (16 and 17).(b) After applying the following commands: blnc syn2; qmap; adj -C 8; cl -d -B; b -b; addLr -L -R -n 0; , (c) After applying the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -n 1;, (d) After applying the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -n 2;. . 200 7.30 A pipelined 4-input XOR circuit after applying the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -i 1 -p;. . . . . . . . . . . . . . . . . . . 201 7.31 A 2-bit adder circuit after applying the following commands: blnc syn2; qmap;. . 201 7.32 A 2-bit adder circuit after applying the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -i 1 -g 3;. . . . . . . . . . . . . . . . . . . . . . . . 201 xx 7.33 Synthesized, mapped, and fully path balanced circuit for a 4-bit array multiplier. Logic gates and DFFs, and primary inputs/outputs are shown by circles, and tri- angles, respectively. In this graph, the splitters are also inserted and shown. This graph is generated by using qView.py script in the qSyn directory. . . . . . . . . . 202 xxi Abstract Superconducting Single Flux Quantum (SFQ) devices have been proven to be great candidates to provide high speed solutions post-silicon and post- Complementary Metal Oxide Semiconductor (CMOS) when the scaling of the minimum feature size is coming to an end. Also, these Niobium- based devices are extremely low-power and despite their cryo-cooling overhead, they still consume signicantly less amounts of power compared to the state-of-the-art silicon-based devices. SFQ circuits while operating at liquid-helium temperature (4K), have switching delay of around 1ps and switching energy of about 10 19 J. Frequencies as high as 770GHz for a T-Flip-Flop and 750GHz for a digital frequency divider are reported in this technology. Design and implementation of SFQ circuits have been mainly manual with limited automa- tion mostly by using some CMOS Computer-Aided Design (CAD) tools with minimal changes. Recently, a good amount of research has been done as part of a funded project (called Super- Tools program) by the U.S. government to develop a superconducting circuit design ow with a comprehensive set of Electronic Design Automation (EDA) and Technology Computer Aided De- sign (TCAD) tools for Very-Large-Scale Integration (VLSI) design of Superconducting Electronics (SCE). The work presented in this dissertation is part of that eort. SFQ logic circuits have several unique properties and specic requirements that should be taken into account when CAD tools are being developed to automate their design ow. For the logic synthesis, new technology-independent and technology mapping algorithms should be developed to improve important evaluation metrics of SFQ circuits such as total Josephson junction count, xxii total area, total power consumption, peak throughput, and local clock frequency during this automation process. In this regard, we will be presenting a technology-independent optimization ow for SFQ circuits using a combination of dierent optimization functions including our proposed balanced factorization and rewriting algorithms. This technology-independent ow will be followed by our novel path balancing technology mapping algorithm for mapping SFQ logic circuits into networks of logic gates. Our proposed path balancing technology mapping algorithm has three main modes: (i) minimizing the total number of path balancing D-Flip-Flops (DFFs), (ii) minimizing the total area including area of logic gates and path balancing DFFs, and (iii) minimizing the product of the worst stage delay and the depth of the circuit. Next, we will present a Dual Clocking Method (DCM) for realization of SFQ circuits that removes the need for the expensive full path balancing step in the standard design and imple- mentation process of SFQ circuits. The proposed DCM is accompanied by a new depth-bounded levelized graph partitioning algorithm that is one of the contributions of this dissertation; this graph partitioning algorithm uses the dynamic programming to nd the optimal parts and hence the optimal places for inserting the Non-Destructive Read Out DFFs (NDROs) into a mapped circuit: a step that is needed in the DCM. Then, we will present theoretical bases and algorithmic support for synthesizing all dierent kinds of sequential SFQ circuits including those with feedback loops. Synthesizing sequential circuits with feedback loops in the SFQ technology is a very challenging task that is done in this dissertation for the rst time. Finally, we will present qSyn, a behavioral and logic synthesis soft- ware tool written in C/C++/Python that has all of the aforementioned algorithms and methods for synthesizing SFQ circuits. qSyn uses two open source tools namely ABC, UC Berkeley's logic synthesis and verication tool, and Yosys, a Verilog parser and a behavioral synthesis tool. xxiii Related Publications: 1. Ghasem Pasandi, Alireza Shafaei, and Massoud Pedram,\SFQmap: A Technology Mapping Tool for Single Flux Quantum Logic Circuits, " In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27-30 May 2018. 2. Ghasem Pasandi and Massoud Pedram.\A Graph Partitioning Algorithm with Application in Synthesizing Single Flux Quantum Logic Circuits, " in arXiv preprint arXiv:1810.00134, 2018. 3. Ghasem Pasandi and Massoud Pedram, \Balanced Factorization and Rewriting Algorithms for Synthesizing Single Flux Quantum Logic Circuits, " In Proceedings of the 2019 on Great Lakes Symposium on VLSI (GLSVLSI), pages 183-188, Washington D.C., May 2019. 4. GhasemPasandi and Massoud Pedram, \A Dynamic Programming-Based, Path Balancing Tech- nology Mapping Algorithm Targeting Area Minimization," in IEEE/ACM International Conference on Computer Aided Design (ICCAD), Nov. 4-7, Westminster, CO, 2019. 5. Adam Holmes, Mohammad Reza Jokar,GhasemPasandi, Yongshan Ding, Massoud Pedram, and Frederic T Chong, \NISQ+: Boosting quantum computing power by approximating quantum error correction, " In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 6. Ghasem Pasandi and Massoud Pedram, \PBMap: A Path Balancing Technology Mapping Algo- rithm for Single Flux Quantum Logic Circuits, " In IEEE Transactions on Applied Superconduc- tivity, 29(4):1-14, 2018. 7. GhasemPasandi and Massoud Pedram, \An Ecient Pipelined Architecture for Superconducting Single Flux Quantum Logic Circuits Utilizing Dual Clocks, " In IEEE Transactions on Applied Superconductivity, 30(2):1-12, 2019. 8. GhasemPasandi and Massoud Pedram, \Depth-Bounded Graph Partitioning Algorithm and Dual Clocking Method for Realization of Superconducting SFQ Circuits, " to appear in ACM Journal on Emerging Technologies in Computing Systems (JETC), 2020. xxiv 9. Ghasem Pasandi and Massoud Pedram, \Synthesizing Sequential Circuits in Superconducting Single Flux Quantum Technology, " under submission. 10. GhasemPasandi and Massoud Pedram, \An Ecient Pipelined Architecture for Superconducting Single Flux Quantum Logic Circuits Utilizing Dual Clocks, " A pending US Patent. xxv Chapter 1 Introduction Advances in the semiconductor manufacturing technology have provided a decades-long decrease in the minimum feature size of transistors and increase of their switching speed [108]. However, in spite of the supporting voltage scaling, power densities on chips have been increasing, resulting in a practical upper limit of 4GHz or so for the clock frequency of processors (a limit that was achieved in mid 2000's). A related phenomenon is the dark silicon problem [36], which simply states that signicant portions of a chip cannot be powered up at the same time due to power delivery and heat dissipation concerns. In addition, Moore's law is coming to an end because transistors are reaching their physical scaling limits below which classical principles that dictate their operation cease to be valid. In addition to performance issues, the widely used standard CMOS technology is not as energy- ecient as it will be needed in the near future. For example, a standard CMOS switch consumes around 1:6 10 16 J energy per switching. A considerable portion of this energy consumption is because of charging interconnect capacitances. With the help of advanced interconnection technologies such as 3D interconnects, the energy consumption for switching can be reduced to around 10 17 J [111]. Even with utilizing low power and energy ecient device, circuit, and 1 architectural methodologies such as SOI FinFETs [72], power gating [1, 126], hardware multi- threading/virtualization [8], power aware softwares [33, 124, 139], and near-threshold comput- ing [27,34,118], the power reduction will not be as much as it is needed in 2027 [135]. For example, a typical System-on-Chip (SOC) will consume more than 10W of power in year 2027 [135], and an exascale supercomputer will require up to 200MW of power in optimistic scenarios, which needs a small power plant for delivering this power demand [111]. Therefore, to keep up with the ever increasing demands for energy-ecient and high-speed electronics, new materials, devices, circuit fabrics, and architectures are needed. 1.1 Superconducting SFQ Technology Superconducting Single Flux Quantum (SFQ) technology with a combination of fast switching (1ps) and high energy eciency (switching energy consumption of 10 19 J, that is 100 lower than CMOS) oers a promising alternative to CMOS technology [152]. Moreover, Josephson junction integrated digital circuits oer key features which make them uniquely suitable for high speed processing of digital information: (i) Availability of superconducting microstrip transmission lines capable of transferring picosecond waveforms over virtually any interchip distances with a speed approaching half of that of light, and with low attenuation and dispersion, (ii) Availability of Josephson junctions which can serve as picosecond two-terminal devices. Moreover, these junctions can be impedance-matched with the superconducting microstrip lines, ensuring the ballistic transfer of generated waveforms along lines (more precisely, information between logic devices is passed ballistically along either passive microstrip lines or active Josephson transmission lines in the form of picosecond quantized voltage pulses with a xed magnetic ux of 2.07mVps, or alternatively stated, 2.07AnH), and (iii) Even at these low impedances (e.g., 10 Ohms) the Josephson junction's static power consumption P = V 2 =R is 400nW at a bias voltage of 2mV (alternatively, with a bias current of 100A for the junction and a bias voltage of 2mV, the static 2 power dissipation per junction is 200nW). As a result, chips with Josephson-junction integrated circuits generate little heat and can thus be packed very closely. The rst version of RSFQ logic relied on having ohmic resistors for interconnection of Josephson Junctions (JJs). Later on, these resistors are replaced with JJs, resulting in improving parameter margins and increasing the operation speed from 30GHz to 770GHz for a T-Flip-Flop (TFF) [28, 86, 94]. For more complicated SFQ circuits, a 20GHz asynchronous arithmetic unit [44], a 16-bit wave-pipelined sparse-tree RSFQ adder with peak processing rate of 38.5GHz [32], and an 8-bit ERSFQ Aligned-Front (AF) adder [83] with tested clock frequency of up to 27GHz are reported. Design tools of superconducting SFQ or in more general term superconducting electronics (SCE) play an important role in the success of these devices. The rst study on these design tools is published in 1990 [50] and the latest study on the status and roadmap of these tools is published in January 2018 [46]. C. Fourie [46] argued that signicant improvements on the current best design tools in dierent design ow steps of superconducting electronics starting from Technology Computer Aided Design (TCAD) and compact SPICE model extraction tools, all the way to HDL model generation tools, logic synthesizers and simulators, and static timing analysis tools are needed. In this regard, there have been a few on-going researches to improve the state-of-the-art tools for superconducting electronics including: a depth minimization with path balancing algorithm for minimizing the depth and path balancing overhead during technology mapping [123,125], clocking techniques for SFQ circuits [144], margin and yield calculation [134], SFQ specic placement and routing [133], and SFQ library cell design [77]. This dissertation reports some of the most novel algorithms and tools designed to automate the synthesis process of superconducting SFQ devices. 3 1.2 Dissertation Contributions Our contributions span over dierent steps of synthesis process for SFQ circuits with a focus on the logic synthesis part: balanced factorization and rewriting algorithms for the technology- independent phase of the logic synthesis, path balancing technology mapping for minimizing overheads of path balancing and also minimizing area, dual clocking method for realizing SFQ circuits to eliminate the expensive full path balancing step, and handling feedback loops in se- quential SFQ circuits. Also, since this dissertation is one of the rst ones that is concerned with the synthesis process of SFQ circuits, our initial eort was to identify important properties and requirements of these devices at the logic, Register Transfer (RT), and circuit levels in order to incorporate them in the specic algorithms that should be designed for synthesizing these circuits as well as making sure that they will be functioning correctly. These specic logic, RT, and circuit level requirements of SFQ circuits are discussed in more details in Chapter 2 of this dissertation. In the following, we will give a bit more details on our algorithms designed for synthesizing and mapping SFQ circuits. • Balanced factorization and rewriting algorithms: One important property of SFQ circuits that played an important role in the design of almost all of the algorithms that we designed for synthesizing these circuits is that the nal mapped circuit should be fully path balanced. This means that in the nal graph representing a mapped SFQ circuit, length of (in terms of the clocked element count) any path from any primary input to any primary output should be the same. If this balancing requirement is not satised by nature, we have to insert some path balancing D-Flip-Flops (DFFs) to correct the length of shorter paths and bring them into the maximum length size. The balanced factorization and rewriting algorithms are designed to transform the graph that represents an SFQ circuit into the one with a more balanced structure. This will help the next step, technology mapping, to be able to provide technology mapped solutions with less path balancing overheads. In other words, 4 the balanced factorization and rewriting algorithms try to change the playground for the technology mapper into a better one. • Path balancing technology mapping: We take into account the cost of path balancing in the optimizations that are done during the technology mapping process of the logic synthesis. We propose dynamic programming-based algorithms for generating mapping solutions for SFQ circuits minimizing the aforementioned cost. Our algorithms are able to generate mapping solutions that require minimum number of path balancing DFFs and also minimum total area. We also have designed a mapper that minimizes the product of the stage delay and the length of the longest path, an estimated metric for performance of an SFQ circuit post-synthesis. • Dual clocking method for realizing SFQ circuits: However the previous two methods provide signicant reduction in overheads of full path balancing compared with baselines, but still the overheads (area, node count etc.) of full path balancing is high and exceeds those of the original logic gates in the circuit. For the rst time, we present a novel idea of dual clocking method (DCM) for realizing SFQ circuits that completely removes the need for performing full path balancing, hence, oering signicant reductions in total Josephson junction count, path balancing DFF count, total area and many other important metrics for SFQ circuits. DCM makes use of a micro (fast) and macro (slow) clocks and with the help of Non-Destructive Read Out (NDRO) DFFs repeats each set of inputs by a factor (called the imbalance factor) that is related to the structure of the circuit. Input repetition makes it necessary to add some garbage collecting gates to sample the valid outputs at the right time. Finally, we present an optimal graph partitioning algorithm for realizing parts and hence individual blocks in a given combinational circuit. DCM has the drawback of degrading the peak throughput of the circuit that we x by introducing and performing partial path balancing. 5 • Synthesizing sequential circuits in the SFQ technology: We provide the full algorithmic and tool support for synthesizing all dierent kinds of sequential circuits in the SFQ technology; this is done for the rst time in this dissertation. Synthesizing sequential SFQ circuits requires handling feedback loops that needs level assignment for nodes of a directed cylic graph. For this purpose, we rst give a solid denition for the level of a node in a directed cylic graph; the old denitions for level of nodes work only for directed acylic graphs. Then, we propose an optimal algorithm for level assignment of nodes in directed cyclic graphs. This builds the foundation for us to be able to present algorithms for synthesizing sequential SFQ circuits with feedback loops and to verify their correct operations in exhaustive simulations. We have also designed a behavioral and logic synthesis tool for SFQ circuits that is called qSyn; all of the aforementioned algorithms are implemented and available in qSyn. We will have a chapter in this dissertation explaining qSyn. 1.3 Dissertation Organization The rest of this dissertation is organized as follows: Chapter 2 gives a background knowledge on SFQ devices and their specic logic, RT, and circuit level properties and requirements. Also, it provides some preliminary knowledge on technology-independent optimizations of the logic synthesis and also the technology mapping as well as on the graph partitioning. Finally, this chapter presents the state-of-the-art technology mapping ow. Chapter 3 presents our balanced factorization and rewriting algorithms for synthesizing SFQ logic circuits. Chapter 4 gives details of our path balancing technology mappers. Chapter 5 discusses the dual clocking method for realizing SFQ circuits and provides details on our depth-bounded levelized graph partitioning algorithm. Chapter 6 discusses the challenges of synthesizing sequential SFQ circuits specially those with feedback loops and presents our methods and algorithms to tackle these challenging problems. Chapter 7 gives details on our SFQ specic synthesis tool, qSyn, and provides several 6 examples on how to use the tool and what will be the expected outcomes after running the tool. Finally, Chapter 8 brings concluding remarks and future works, and it is followed by a list of acronyms used in this dissertation. The reference list is included at the end of the dissertation. 7 Chapter 2 Preliminaries and background This chapter rst provides background knowledge on SFQ logic families and gives specic gate/circuit level requirements of these circuits, then, it talks about basic terminology and concepts in logic synthesis and graph partitioning which will be used in the rest of this dissertation. It will also summarize the ow of technology mapping in state-of-the-art logic synthesis tools. 2.1 Background on SFQ In SFQ logic, a single quantum of magnetic ux ( 0 =h=2e = 2:07mVps) is used for represen- tation of logic bits. In this representation, presence of a pulse has the meaning of \logic-1", while absence of a pulse is considered as a \logic-0". Operation of SFQ logic is based on overdamped Josephson junctions, and hence, it does not experience the problem of hysteretic I-V, which de- grades the operation speed of \1" to \0" switching. Figure 2.1 shows the I-V Characteristic (IVC) and the pulse shape representation of data in an SFQ buer. Figure 2.2 shows the circuit diagram of an SFQ NOT gate and waveforms demonstrating its correct functionality. As seen, after arrival of a clock pulse, if there is no input pulse (which means \logic-0"), a pulse will be generated at the output of the gate representing a \logic-1". On the other hand, when there is an input pulse, no pulses are generated at the output, meaning a \logic-0". 8 I V V t I c R L I b I c I in + I b Logic-0 Logic-1 I b I in J I out R L = ρ I c I b I in J I out R L = ρ I c I V V t I c R L I b I c I in + I b Logic-0 Logic-1 I b I in J I out R L = ρ I c Figure 2.1: Schematic of an SFQ buer, its IVC and the pulse shape representation of data in SFQ (waveforms are adopted from [53]). SFQ logic families are divided into two groups: ac-biased and dc-biased. Adiabatic Quantum Flux Parametron (AQFP) [145, 146] and Reciprocal Quantum Logic (RQL) [63] are examples of ac-biased and RSFQ [95] is an example for dc-biased logic family. The rst version of new SFQ logic relied on having ohmic resistors for interconncetion of JJs, hence, is called Resistive Single Flux Quantum logic [94]. Using this logic, the operation speed of up to 30GHz was reported, which was quite higher than any other digital device with the same complexity at that time [86]. Later on, another version was proposed by using JJs instead of ohmic resistors. This version is called Rapid SFQ (RSFQ) [95]. It improved the parameter margins of the rst version and also increased its operation speed to 300GHz [110]. In the following, we explain some properties and key circuit/gate level requirements of SFQ logic circuits. 2.1.1 Fanout in SFQ In SFQ logic (RSFQ/ERSFQ/eSFQ), if a gate needs to have more than one fanout, a special SFQ gate called splitter should be added to the output of this gate. Splitter is an asynchronous gate 9 IN CLK OUT LS J1 IN Ib L JS J2 OUT J3 LR CLK JR IN CLK OUT Figure 2.2: Schematic of an SFQ NOT gate and its waveforms. In case of not having any input pulses, an output pulse is generated after arrival of the clock pulse, representing a \logic-1". However, when there is an input pulse, no pulses are generated at the output, meaning a \logic-0". that accepts an SFQ pulse and produces two similar output pulses after its intrinsic delay. One splitter can produce only two fanouts. For additional fanouts, more splitters should be added in a binary tree structure (balanced and/or linear). To have n fanouts, n-1 splitters are needed. Figure 2.3 shows the circuit-level schematic of a splitter gate, its operating waveforms, and two examples of splitter binary trees for providing four fanouts (FO4). Please note that for AQFP, splitters are clocked buers that can have 1-to-2, 1-to-3 and even 1-to-4 fanouts [113,155]. 2.1.2 Gate-level pipeline Unlike CMOS gates, in SFQ logic, most of the gates receive a clock signal. There are three main methods for clock distribution in SFQ circuits: (i) counter- ow clocking where the clock ows in the opposite direction of the data, (ii) concurrent- ow clocking in which the clock and data ow in the same direction, and (iii) clock-follow-data in which the clock arrives at a gate after its inputs have arrived and processed by the gate. For more information on clock distribution, please see [48,142,144]. 10 I b1 I in J 1 I c1 I b2 I c2 I b3 I c3 A B C L 1 L 2 L 3 J 2 J 3 splitter A B C splitter A B C I b1 I in J 1 I c1 I b2 I c2 I b3 I c3 A B C L 1 L 2 L 3 J 2 J 3 splitter A B C (a) A B C 5ps (b) in out 1 out 2 out 3 out 4 splitter splitter splitter (c) splitter in splitter splitter out 1 out 2 out 3 out 4 (d) Figure 2.3: (a) A splitter gate in SFQ technology, (b) waveforms corresponding to the operation of this gate, and a splitter tree providing four fanouts with depth (c) 2, and (d) 3. Even-though the balanced structure as in (c) is usually preferred but there can be cases that the linear structure works better: Delay for generating out1 in (d) is lower than (c). Thus, using the structure of (d) is better in networks where the critical path goes through out1. 2.1.3 Path Balancing For an SFQ gate to operate correctly, all of its fanin gates should have the same logic level 1 . If there is a dierence among logic levels of fanins of a gate, some D-Flip-Flops (DFFs) should be inserted into outputs of fanin gates with smaller logic levels [119]. For example, if the rst fanin (in 1 ) of an AND2 gate has a logic level of three and the second fanin (in 2 ) has a logic level of four, one DFF should be added to the output of in 1 . Without path balancing, correct pulses on in 1 will be consumed by this AND2 gate one clock before arrival of the corresponding pulses on the second input, hence, this gate will not be able to produce correct output values. 1 Logic level of a gate g i in a network N is the length of the longest path in terms of the gate count from any primary input of N to g i . 11 x0000...00 clk a b out clk Error! in 1 in 2 (a) x1010...10 clk a b out clk Correct DFF clk in 1 in 2 (b) Figure 2.4: Gate-level schematic of Example 1 for showing the necessity of path balancing for correct operation in SFQ circuits. Example 1 suppose that there is a digital signala i =1010...10 and we want to AND it with invert of another digital signalb i =0101...01. The correct output is: x1010...10. The rst bit in the output is not valid because in the rst clock, second input of the AND gate (in 2 ) is unknown. Without path balancing, generated values at the output of the AND gate will be x0000...00 which is not correct (Figure 2.4a). The error occurred because signals on in 2 are one level behind signals on in 1 . By inserting a path balancing DFF to in 1 , all fanins of the AND gate will have the same logic level, hence, the circuit will be path balanced. In the path balanced circuit, as shown in Figure 2.4b, the correct sequence of bits are generated at the nal output (out). Input pulses to an SFQ gate can be modeled by \tokens" that must arrive at the same clock period and should be consumed by the clock pulse arrives at the end of this period. Path balancing guarantees correct arrival and consumption of these tokens. 2.1.4 Interconnects in SFQ In SFQ circuits, there are two methods for transmision of signals among gates: using Josephson Transmission Line (JTL), and using Passive Transmission Line (PTL). Most of SFQ circuits use the JTLs for transmitting signals among the gates, mainly because they can transfer the SFQ pulses without any distortion [60,127]. Figure 2.5 shows the circuit-level schematic of a JTL. This 12 I b1 I in J 1 I c1 A L 1 I b2 J 2 I c2 L 2 I b2 J 2 I c2 L 2 I b3 J 3 I c3 L 3 I b3 J 3 I c3 L 3 B L 4 I b1 I in J 1 I c1 A L 1 I b2 J 2 I c2 L 2 I b3 J 3 I c3 L 3 B L 4 Figure 2.5: Josephson Transmission Line (JTL). circuit can also provide current and power gains (amplications). For this purpose, the critical current of JJs should grow in the propagation direction (I c1 <I c2 <I c3 <:::), and the inductance values should decrease (L 1 > L 2 > L 3 > :::) in that direction [95]. The PTLs are similar to microstrip and strip lines. Each PTL requires one transmitter and one receiver gate. Generally, for short interconnects JTLs, and for long interconnects PTLs are used [60]. 2.2 Background on logic synthesis Logic synthesis is divided into technology-independent and technology-dependent (technology map- ping) phases. In the rst phase, several transformations are performed to reduce the number of literals in the nal form of a given Boolean expression. In the second phase, which is the nal stage of the logic synthesis, suitable gates from a given library are assigned to nodes of the network in order to satisfy some constraints or to minimize some cost functions. Before technology mapping, the Boolean network is transformed into And Inverter Graphs (AIGs). This step is called technol- ogy decomposition, and such a network is called the subject graph. In the following, rst, several concepts and/or terminology that are used throughout this proposal are dened/explained. Then, an overview on the ow of the state-of-the-art technology mapping is presented. 2.2.1 Terminology and denition of required concepts A k-feasible cone at node v of a network N = (V;E), denoted by C v , is dened as a sub-graph containingv and its predecessors satisfying two conditions: (i) number of inputs of this sub-graph 13 should be fewer than or equal to k, (ii) all paths connecting v to a node in C v lies entirely in C v . A cut C=(X;X 0 ) with source s and sink t in a given network N is dened as a partition of the set V into two sets X and X 0 = VX such that s2 X, and t2 X 0 . C=(X;X 0 ) is a trivial cut, if set X has only one member (source s). The node cut-size of a cut C=(X;X 0 ) denoted by n(X;X 0 ) or n(C) is dened as the number of boundary nodes in X 0 which are adjacent to some nodes inX. These boundary nodes are called the leaf nodes of the cut. A cutC=(X;X 0 ) is called k-feasible if its node cut-size is at most k (i.e. n(C)k). A k-feasible cut of a node v is dened as a valid k-feasible cut, in which node v is the source node of the cut and the sink node is a Primary Input (PI). A fanin (fanout) cone of a node v in a network N = (V;E) is dened as the set of nodes in V that can be reached through the fanin (fanout) edges of v. Maximum Fanout Free Cone (MFFC) of a nodev is a subset of its fanin cone in which any path from a node in this subset to any Primary Output (PO) of the network goes through v. During optimizations, if a node is removed, all nodes in its MFFC can be removed as well. A binary tree is a tree in which nodes have either one child or two children. A full or saturated binary tree with height (or depth) H is a binary tree which contains 2 H -1 nodes. A binary tree with height H but with fewer number of nodes is called a general binary tree. A literal is a variable or its negation, e.g. x, x 0 . A cube is AND (conjunction) of a set of literals. An algebraic expression F =fC i g, which consists of a set of cubes, is an expression in which no cube contains another one, i.e., C i ( C j ;i6= j. An expression that is not algebraic is called Boolean. The support of an expression F is the set of variables that F explicitly depends on. The product of two cubes C i and C j is a cube dened by the following: C i C j = 8 > > < > > : ; :if 9x2Ci[Cj & x 0 2Ci[Cj Ci[Cj :otherwise 14 The product of two Sum-Of-Product (SOP) expressions F andG is a SOP expression denoted by FG, dened by FG =fC i C j j C i 2F & C j 2G & C i C j 6=;g. The product FG is an algebraic product if F and G are algebraic expressions and have disjoint variable support. The sum of two SOP expressions F andG denoted byF +G is a set dened byF +G =fC i jC i 2F orC i 2Gg. The sumF +G is an algebraic sum ifF andG are algebraic expressions and no cube inF contains a cube in G and vice versa. A factored form is dened recursively as follows: a factored form is either a product or sum where a product (sum) is either a single literal or product (sum) of factored forms. Also, a factored form can be dened as a parenthesized algebraic expression. For example, ab+c(a+b), a, acd' are factored forms, but b(c+d)' is not a factored form. 2.2.2 State-of-the-art technology mapping ow In the technology mapping ow of the state-of-the-art mappers, as explained in [106], at rst the k-feasiblecuts and cuts' fucntions based on their inputs are computed for each node of the given network. Next, in a topological ordering traversal, and by using Boolean matching [99], the best matches and their best implementations using the pre-generated supergates [106] are extracted. At the end, the best cover for the given network is generated in a reverse topological ordering traversal. This approach is followed by ABC [38] as well. In the following, each of these steps are explained in more details. 2.2.2.1 k-feasible cuts k-feasible cuts for nodes in the given network are computed in a way similar to [29]. For each node, a trivial cut consisting of the node itself is added to the set of cuts. Having this trivial cut and existence of (N)AND and inverter in the library, the feasibility of nding a mapping solution for any given network is guaranteed. 15 2.2.2.2 Computing Cut's Function For all computed k-feasible cuts, except trivial cuts, the cut's function (sometimes called truth- table) is computed. Function of a trivial cut is the same as the Boolean expression of the source node of this cut. The function of a non-trivial cut is computed by assigning some variables to the inputs of the cut. Using these variables, the truth-table of the cut is computed by performing some simulations. One round of simulation includes propagation of a set of inputs through the network [162]. Some bit-parallel methods such as what is presented in [10] is used to increase the speed of simulations. Next, the function of a cut is stored inside a eld in the data structure of this cut. Since usually 4 6-feasible cuts are considered, a variable with length of 16 64 bits is enough for storing the function of a cut. 2.2.2.3 Supergates A supergate is a small single-output combinational network composed of the original gates in the given library. Supergates are generated by exhaustively concatenating the original gates in the library. This is done as a pre-processing step after reading the library and before perfoming the technology mapping. Generation of a supergate is controlled by the following factors: number of inputs of the supergate, total run-time for generating all supergates, area of the supergate, critical path delay of the supergate, and the depth of the supergate. Other than addressing the structural bias problem [106] by looking deeper into the network, using supergates makes the cut- enumeration-based method for library-based technology mapping more reasonable by providing implementation choices for more cuts. 16 Table 2.1: Average hit rate for 20 ISCAS benchmark circuits in the standard cut-enumeration-based technology mapping approach [38]. Supergate Level k L = 1 L = 2 L = 3 5 0.056 0.109 0.118 6 0.047 0.057 0.067 2.2.2.4 Boolean Matching In the state-of-the-art technology mapping ow, Boolean matching [11, 99] is used to identify whether the Boolean function of a cut can be implemented using generated supegates. A hash- table of functions of supergates is produced and during the mapping phase, function of a cut is looked up in this hash-table in a constant time. 2.2.2.5 Best Matches and Best Covers In a topological ordering traversal, the best matches for both phases (positive and negative) of each node for its best cut is computed and a supergate that gives the least arrival time for that cut is selected. After nishing this traversal and reaching the POs, in another traversal in a reverse order, best supergates for implementing functions of gates connected to POs are selected. Next, best gates implementing inputs of those supergates are chosen and so on. After the PIs of the network are visited, a mapping solution for the entire network is generated. Cut-enumeration-based technology mapping with using k-feasible cuts suits best for LUT - based FPGA technology mapping as in [29,30]. This is because for any computed k-feasible cut, there will be a k-LUT that can implement the function of this cut. However, for library-based technology mapping, theoretically, most of the time there will not be any gates in the library to implement the function of a cut. For example, by having 20 gates in the original library, and using up to level 3 supergates, there will be around 4000 supergates in the supergate library. Thus, the probability of having a supergate to implement the function of a cut in the set of k-feasible cuts fork = 5 andk = 6 will be 4000 2 2 5 10 6 , and 4000 2 2 6 10 15 , respectively. In these calculations, the 17 fact that there is at most 2 2 k dierent kinput functions is employed. In this dissertation, the aforementioned probabilty is called the hit rate 2 . Fortunately, for most of the practical circuits, there will be much fewer number of cuts than the upper bound of 2 2 k . This results in having much better values for the hit rate. Table 2.1 shows the average hit rate of 20 ISCAS [59] benchmark circuits for two values ofk and three levels of supergates. As seen, the practical hit rate, specially for L = 3, is much better than the aforementioned theoretical worst case value. Therefore, it is reasonable to usek-feasible cuts together with supergates for library-based technology mapping. In our developed technology mapping tool, the state-of-the-art ow is followed. 2.3 Background on graph partitioning Given a graph G = (V;E), with non-negative edge weights, w : E! R + , and a size s(v) for each vertexv2V , the graph partitioning problem (GPP) is dened as dividing setV into subsets V 1 ;V 2 ;:::;V k such that Eqs. 2.1, 2.2 hold and an objective function (read below) is minimized. V 1 [V 2 [:::[V k =V (2.1) V i \V j =; 8i6=j (2.2) The above problem is called k-way partitioning problem. Size of part i is denoted byjV i j, and is dened byjV i j = P v2Vi s(v). The bounded-size GPP is dened as a GPP problem in which size of the i th part is bounded by B i (jV i j B i ). A special case is balanced partitioning, where the size of all parts should be equal modulo a correction factor. More precisely, Eq. 2.3 should hold: 8i2f1; 2;:::;kg; jV i j 1 + k jVj (2.3) 2 The ratio of the total number of cuts that have at lease one supergate in the supergate library capable of implementing their function to the number of cuts without having any supergates that can implement their functions. 18 This problem is commonly denoted as a (k, 1 +)-balanced partitioning problem. If = 0, the problem is called perfect partitioning, and the special case of k = 2 and = 0 is called the minimum bisectioning. As mentioned before, in GPP an objective function should be minimized; the most commonly used objective function for GPP is the total cut size, which is dened by the following equation: X e2C w(e) (2.4) C =fe =<u;v>2E :u2V i ;v2V j ;i6=jg In this case, the problem is called min-cut GPP. Here is a more general denition for min-cut GPP. Min-cut k-way partitioning: Given a lower bound L and an upper bound U on size of any parts, the problem of min-cut k-way partitioning is dened as a GPP in which the objective is to minimize the cost function in Eq. 2.4, subject to LjV i jU for 1ik. In a special case, when k = 2, the dened GPP is called bipartitioning, and min-cut k-way partitioning is called min-cut bipartitioning. A hyper-edge is an edge which can connect to more than two vertices. A hyper-graph H = (N;L) is a generalization of a graph which consists of a set N of vertices and a set L of hyper- edges, in which each hyper-edge corresponds to a sub-set of distinct verticesN i 2N with condition jN i j 2. In a given partitioning assignment, a hyper-edge is cut if its vertices are located in more than one partition. The net cut of a partition is the total number of hyper-edges that are cut. Cut set size is dened as the number of edges which are cut; note that each hyper-edge which is cut is counted one time in the cut set size. Level of a node n in a graph G is dened as the length of the longest path in terms of the nodes from primary inputs of G to node n; if nodes of the graph are logical gates, it is called the logic level. Depth of a graph is dened as level of a node with the highest value. Depth of a part 19 in a partitioning problem is the dierence between the highest and lowest levels among node of the part plus one. Graph clustering problem is similar to GPP, but usually the number of clusters is not given in advance [18]. Dierent from graph partitioning, in graph clustering problem, clusters can have overlaps. Therefore, in graph clustering, node duplication can occur [128]. 20 Chapter 3 Balanced Factorization and Rewriting Algorithms Standard CMOS-based rewriting and factorization algorithms fail to preserve the balancing prop- erty of SFQ circuits. Therefore, they end up generating circuits with huge path balancing over- heads. Our proposed balanced factorization and rewriting algorithms are designed specically to solve this problem. Experimental results show that a combination of balanced factorization and rewriting algorithms reduces the path balancing overhead by an average of 63% for 15 benchmark circuits, and area by up to 23% compared to state-of-the-art logic synthesis tools. Please note that the proposed algorithms would be of interest for any technology that requires path balancing or adheres to a clocked propagation of data (gate-level pipelined, see Section 2.1.2) through out a circuit. 3.1 Related Work Brayton [13] presented a few methods for obtaining dierent factored forms of a given logic expression. These methods range from fast purely algebraic methods to Boolean ones, which are slower but capable of providing more saving in the total literal count. Iman and Pedram [68] addressed the problem of reducing power consumption by extending algebraic procedures for node extraction and factorization targeting minimization of a cost function that measures the power cost of Boolean expressions. Roy et. al [129] introduced a unique cost function based on 21 decomposed factored forms representation of a given Boolean expression to guide clustering and factorization methodologies for minimizing the power consumption. Mishchenko et. al [105] presented the Directed Acyclic Graph (DAG)-aware AIG rewriting algorithm, by extending the DAG-aware circuit compression method in [12]. In DAG-aware AIG rewriting, 4-feasible cuts of nodes are computed and function of each cut together with its NPN- class 1 is determined using a hash-table lookup. According to [112], for 4-input expressions, there are 222 NPN-classes, among which around 100 are being used more frequently in most of the well-known benchmark circuits [105], which can be pre-computed and stored. Soeken and Thomsen [137] showed that it is possible to use standard expression rewriting rules to derive fairly complex formulas which is benecial in algebraic operations used in reversible logic circuits. Haaswijk et. al [55] showed that the use of exact synthesis for logic rewriting can be improved by a better sub-network selection strategy, avoiding useless enumerations, and employing XOR majority graphs. In this paper, we present balanced factorization and rewriting, which are the rst structural factorization and rewriting approaches to the best of our knowledge. 3.2 Balanced Factorization In the standard method of computing algebraic factored forms, given F =G 1 G 2 +R, a factoriza- tion value is calculated as follows: fact val =lits(F ) (lits(G 1 ) +lits(G 2 ) +lits(R) ) = (jG 1 j 1)lits(G 2 ) + (jG 2 j 1)lits(G 1 ) (3.1) in which, it is assumed that G 1 , G 2 , and R are algebraic expressions, lits(F ) returns literal count of F , andjG 1 j is the number of cubes in the SOP form of G 1 . This value represents the 1 Two expressions F and G are NPN-equivalent or belong to the same NPN-class, if F can be achieved from G by Negation, Permutation of its inputs or Negation of its output. 22 number of literals that is saved by performing the corresponding factorization. The literal saving is translated into the node count reduction in the subject graph and it results in total area saving for CMOS circuits. Therefore, if a factored form gives the highest factorization value among all possible factored forms, it has a good chance to minimize the total area in CMOS circuits. However, in SFQ circuits, due to requirement of path balancing, the said standard method for selecting the best factored forms will not necessarily result in a circuit with the least area. It even can increase the total area compared with the case of not applying any factorization algorithms. This is because the standard factorization algorithm does not preserve the balance of the network being generated during the factorization process. To solve the above problem, we propose a balance preserving factorization algorithm, and consider the cost of path balancing when calculating the factorization value. More precisely, we rst generate a factoring tree for each factored form of a given expression. Factoring tree is a labeled tree in which each node is labeled with either or + expressing a conjunctive, and disjunctive operations, respectively. Also, in a factoring tree, each leaf node is a literal. Next, we compute level 2 of each node in the said factoring tree. Then, we compute an imbalance factor (denoted by ) for this factoring tree. To dene , rst we need to dene the imbalance factor of a node in a factoring tree. If the maximum level of immediate fanin nodes of the node v in a factoring tree is L max , the imbalance factor of this node is calculated as follows: v = X vi2fanins(v) (L max level(v i ) ) (3.2) 2 Level of a node v in a network N = (V;E) which is modeled by a DAG, is dened as the length of the longest path in terms of the node count from any leaf node to v. 23 × F + × d c b a + + × e f g Figure 3.1: A factoring tree for the expression given in the Example 2. where fanins(v) is the set of immediate fanin nodes of node v, and level(v i ) is the level of one of these fanins. The imbalance factors of leaf nodes are 0. The imbalance factor of a treet = (V t ;E t ) denoted by t is calculated by summing up the imbalance factors of nodes of this tree: t = X v2Vt v (3.3) Finally, to consider the eect of both node (or literal) count and imbalance factor, a new factor- ization value is dened and computed as follows: blnc overhead val =jV t j + t (3.4) wherejV t j is the total node count of the factoring tree t. Please note that unlike standard factorization method, in which a factored form with maximum value for fact val is selected, in balanced factorization, we select a factored form which minimizes blnc overhead val. This is 24 Algorithm 1: Balanced Factorization Input: A Boolean network N = (V;E) Output: Optimized network N OPT = (V OPT ;E OPT ) // pre-processing: 1 Convert the given network into AIG; 2 Sort nodes in V to be in topological order; 3 Compute k-feasible cuts of each node; 4 for each node v in V do 5 for each cut C j in k-feasible cuts of v do 6 Extract function of v based on inputs of C j ; 7 Generate factoring trees and compute their blnc overhead val (Eq (3.4)); 8 Select a factored form for node v with the least value for blnc overhead val; // Generate a new network using the best factored forms for each node: 9 N OPT = Generate Ntk(N); 10 return N OPT ; because a factored form with smaller value for blnc overhead val consumes fewer number of nodes and has smaller path balancing overhead (smaller value for imbalance factor). Example 2 Suppose F = age + agf + bge + bgf + cdge + cdgf. One possible factoring tree for this expression is shown in Figure 3.1. Total node count of this tree is equal to six and its imbalance factor is = 0 + 0 + 1 + 1 + 2 + 1 = 5. Therefore, the balanced factorization value of this factored form is equal to 11. Algorithm 1 shows pseudo code of balanced factorization. As seen, after converting the given network into AIG representation (line 1), nodes are sorted to be in a topological order (a node will be visited only when all of its predecessors have been visited) in line 2, and k-feasible cuts (k=4) for each node is computed (line 3). Then, a factored form which gives the least balanced factorization value is selected for each node (lines 4-8), and nally, the optimized network is generated in line 9. 3.3 Balanced Rewriting Balanced rewriting is a step in the technology-independent combinational logic synthesis ow of SFQ circuits which uses fast local transformation of AIG nodes. Input to this algorithm is 25 an AIG representation of the given network. In the balanced rewriting algorithm similar to the DAG-aware AIG rewriting in [105], the given AIG is traversed in a topological order starting from leaf nodes of the graph. For each node v in this graph, 4-feasible cuts are computed as in [29]. For each computed cut of node v, all pre-computed 4-variable Boolean expressions that can implement the function of this node is tried. Dierent from the DAG-aware AIG rewriting algorithm presented in [105], in balanced rewriting algorithm, in addition to extracting the node count of the new rewritten expression, we calculate an imbalance factor for the corresponding sub- graph (similar to Section 3.2). To calculate the saving that is achieved from the rewritten version of the Boolean function of node v, we keep track of the sum of the current number of nodes and imbalance factors in the current sub-graph (m 1 ) and the new value for this sum after substituting a candidate sub-graph (m 2 ). If m 2 <m 1 , there is a positive gain and the new rewritten version will be accepted. After visiting all nodes of the network and trying the rewritten version of their Boolean function, the nal optimized AIG is constructed by tracing back the best solutions for nodes connected to POs (reverse topological ordering traversal). 3.4 Experimental Results The balanced rewriting and factorization algorithms are implemented inside an open source logic synthesis and verication tool called ABC [38]. Two sets of experimental results are extracted. The rst set is node count, imbalance factor and sum of them before performing technology map- ping, and the second set is total area and logical depth after nishing the technology mapping. The baseline for comparing our technology-independent optimizations is ABC's built-in optimiza- tion scripts including resyn, resyn2, resyn2a. For mapping, the standard cut-based technology mapping command (map) of ABC is employed, and an SFQ library of gates as in [45], consisting of and2, or2, xor2, DFF, splitter, and inverter gates are used, and several ISCAS [59], EPFL [37], MCNC [157], and arithmetic benchmark circuits are considered. 26 16 F 14 10 11 9 d g e f b c a 12 13 15 (a) 17 F 13 9 10 14 c b e d g a f 12 15 16 11 (b) 15 F 13 12 9 e d g a f c b 11 14 10 (c) Figure 3.2: Three AIGs obtained for Boolean expression mentioned in the Example 3. Nodes (shown by circles) are 2-input AND gates, and dashed lines are inverted edges. (a) generated by applying standard rewriting, balance, and refactoring commands of ABC [38], (b) generated by applying resyn2 optimization script of ABC, and, (c) a perfectly balanced tree generated by our blnc syn1 optimization script (see Table 3.1). Notice that similar to the denition of imbalance factor for a factoring tree in Section 3.2, we can dene an imbalance factor for a DAG too. Also, to compare dierent technology-independent optimization algorithms, we assign a total cost to a subject graph. The total cost of a subject graph is dened as sum of its imbalance factor and total number of nodes in its AIG representation. Interleaving balanced rewriting and factorization commands together with balance 3 command of ABC, provides higher reduction in total node count and imbalance factor of a given subject 3 This command performs algebraic AND-balancing [103]. 27 graph. Moreover, our experiments show that the order of applying these optimization commands can make a huge dierence for some benchmark circuits. In the following, we provide an example from [56] to demonstrate this. Example 3 Suppose F = abg + acg + adf + aef + afg + bd + be + cd + ce. This expression is mentioned in page 434 of [56] for comparing literal saving that dierent factorization algorithms can provide. Applying standard rewriting, balance, and refactoring commands with this order will generate an optimized AIG as shown in Figure 3.2a with eight nodes, imbalance factor of four and total cost of 12. Applying ABC's resyn2 script (see Table 3.1) will generate the subject tree shown in Figure 3.2b with nine AIG nodes, imbalance factor of three and total cost of 12. However, if balanced rewriting, balanced factorization, and balance command of ABC are applied with this order, the resulting subject tree will be as shown in Figure 3.2c which is perfectly balanced, has seven AIG nodes, imbalance factor of 0, and total cost of seven. We introduce two sets of optimization scripts called blnc syn1 and blnc syn2, and also con- sidered three optimization scripts of ABC. Table 3.1 shows these scripts and their corresponding sequence of commands. In these scripts, b is an alias for balance command of ABC, rw is an alias for standard rewriting, and rf is an alias for standard refactoring. Also, rwz and rfz are aliases for standard rewriting and refactoring, respectively with accepting zero gains. In our scripts, brf stands for balanced factorization and brw stands for balanced rewriting. Also, brwz and brfz are brw and brf, respectively with accepting zero gains. Table 3.1: Dierent technology-independent optimization scripts. script Corresponding sequences of commands ABC's resyn b; rw; rwz; b; rwz; b ABC's resyn2 b; rw; rf; b; rw; rwz; b; rfz; rwz; b ABC's resyn2a b; rw; b; rw; rwz; b; rwz; b Our blnc syn1 brw; brf; b Our blnc syn2 brf; b; brw; brwz; b; brfz; brwz; b 28 0 2000 4000 6000 8000 blnc_syn2 blnc_syn1 resyn2 resyn2a resyn no opt. Figure 3.3: Comparing imbalance factors of original graphs (no opt.) and those generated by dierent optimization scripts. For better exhibition purposes, the data for priority, i10, and voter circuits are scaled down by a factor of 10. Figure 3.3 shows values of imbalance factors for graphs generated by each set of optimization scripts listed in Table 3.1 together with the case when no optimization is employed. On average for 15 benchmark circuits, our blnc syn1 and blnc syn2 optimization scripts reduce the imbalance factor by 1:06, and 1:38, respectively compared with the case of applying no optimizations, and 41%, and 63%, respectively compared with resyn2 (the best among resyn, resyn2, resyn2a). Table 3.2 lists the graph cost function of dierent benchmark circuits generated by various optimization scripts. On average for 15 benchmark circuits, our blnc syn1 and blnc syn2 reduces this parameter by 68%, and 83%, respectively compared with the case of not applying any optimizations, and 19%, and 30%, respectively compared with resyn2. Figure 3.4 shows the logical depth (critical path length) of dierent circuits optimized by em- ploying dierent optimization scripts and mapped by ABC's cut-based technology mapper (plus splitter insertion and path balancing [123]). On average for 15 benchmark circuits, our blnc syn1 and blnc syn2 optimization scripts reduce the logical depth by 26% and 36%, respectively com- pared with the case of not applying any technology-independent optimizations. When it is com- pared with resyn2, the average improvements are 10% and 18%, respectively for blnc syn1 and 29 0 10 20 30 40 50 60 70 no opt. resyn resyn2 resyn2a blnc_syn1 blnc_syn2 Figure 3.4: Comparing logical depth of the mapped circuits optimized by using dierent optimization scripts and the original one (no opt.). For better exhibition purposes, the data for priority, and IntDiv8 circuits are scaled down by a factor of 10. blnc syn2. Table 3.3 lists total area of these circuits. The total area includes area of gates, path balancing DFFs, and splitters. On average for 15 benchmark circuits, blnc syn1 reduces area by 21% compared with the case when no technology-independent optimizations are employed. Also, it has almost the same average area as resyn2. blnc syn2 reduces area by 28% when compared to not using any technology-independent optimizations, and by 4% when compared to resyn2. blnc syn2 reduces area for x4 circuit by 23% when it is compared to resyn2. Figure 3.5 shows the post place-and-route of the ISCAS c7552 benchmark circuit which is optimized using the blnc syn2 script. The dimensions are 8440m8420m, which shows around 38% less chip area compared with the case of not applying blnc syn2 (dimensions: 10090m 9750m). Smaller chip has usually shorter critical interconnect, hence, it results in increasing the frequency of the local clock. For this reason, the chip generated by applying our blnc syn2 enjoys increase in the local clock frequency from 13GHz to 14GHz. 30 Table 3.2: Comparing the graph cost of dierent benchmark circuits generated by using dierent opti- mization scripts. circuit no opt. resyn resyn2 resyn2a blnc syn1 blnc syn2 c499 1884 1604 1604 1604 1389 1362 c7552 6879 4275 4438 4792 3638 3039 c5315 9275 5428 5176 5609 4665 4743 c3540 4569 3131 2958 3108 2642 2560 c2670 1785 1544 1503 1545 1257 1345 i2c 4567 3596 3608 3587 3334 2932 priority 43083 37486 25772 36652 18505 15371 voter 25536 21519 18565 20350 19104 15544 cavlc 2271 1893 1882 1891 1867 1659 int2 oat 825 580 590 580 573 507 KSA32 1477 1437 1340 1436 1180 1156 IntDiv8 5677 5296 5295 5296 4175 3854 i10 18851 8580 8415 8489 6937 6191 x4 1555 1082 1093 1082 683 661 apex6 2069 1817 1825 1819 1405 1307 3.5 Conclusions In this chapter, balanced factorization and rewriting algorithms are presented. Unlike standard rewriting and refactoring algorithms, our technology-independent optimization algorithms pre- serve balance of the given graph while optimizing it. Experimental results show that our op- timization scripts reduce imbalance factor by an average of 1:38. Using an SFQ library of gates, our optimization scripts reduce total area and logical depth after technology mapping by an average of 28%, and 36%, respectively for 15 benchmark circuits. 31 Table 3.3: Comparing the area (mm 2 ) of dierent benchmark circuits generated by using dierent opti- mization scripts. circuit no opt. resyn resyn2 resyn2a blnc syn1 blnc syn2 c499 5.37 4.91 4.91 4.91 4.85 4.59 c7552 41.11 35.95 36.76 39.23 34.55 34.92 c5315 45.73 41.4 38.54 40.15 35.35 39.26 c3540 22.11 21.52 20.99 20.78 22.06 21.82 c2670 28.12 23.93 25.8 24.28 25.22 25.39 i2c 35.91 31.72 30.7 31.72 32.2 29.72 priority 171.22 158.11 91.41 154.73 107.38 93.06 voter 258.93 168.15 138.24 169.78 162.65 132.45 cavlc 12.15 11.79 11.93 11.75 11.85 12.09 int2 oat 4.95 4.6 4.55 4.6 4.64 4.34 KSA32 9.16 8.38 9.32 8.42 9.61 10.17 IntDiv8 21.24 20.38 20.42 20.38 20.95 20.13 i10 118.71 82.6 80.97 80.57 78.12 72.27 x4 11.1 8.6 8.8 8.85 7.71 7.13 apex6 20.85 19.22 19.24 19.08 19.48 17.2 32 Figure 3.5: Post place-and-route of ISCAS c7552 benchmark circuit synthesized by applying our technology-independent optimization script, blnc syn2. Dimensions are 8440m 8420m. Dimensions of the chip without applying our optimization scripts are 10090m 9750m. 33 Chapter 4 Path Balancing Technology Mapping Path balancing technology mapping is a new method of mapping a Register Transfer Level (RTL) description such as a Boolean network into a gate-level netlist. For a network generated by the path balancing mapper, the average logic level dierence among fanin gates of each gate in the mapped netlist is reduced (ideally zero). The path balancing technology mapping is required in standard SFQ logic families including RSFQ [95], eSFQ [152], and ERSFQ [84] for correct circuit operation. Due to some key dierences between SFQ and CMOS circuits such as gate-level pipelining and fanout limitation in SFQ circuits, existing Computer-Aided Design (CAD) tools for CMOS technology cannot be directly used for SFQ circuits [46, 76]. Therefore, to make use of benets that SFQ circuits provide in generating high performance and low-power solutions, new design concepts, automation tools, and architectures are needed [65]. An example dierence between SFQ and CMOS gates is that most of SFQ gates (except for con uence buers, splitters, TFFs and I/O cells) receive a clock signal. This makes the clock distribution network in SFQ more complex than CMOS; the clock network is much bigger in SFQ compared to CMOS and it should be designed more carefully to guarantee delivery of clock signals to all gates with acceptable amounts of jitter and skew [48,132]. 34 Another dierence between SFQ and CMOS circuits is the requirement of path balancing in SFQ circuits; if there is a dierence among logic levels of fanin gates of a gate in an SFQ circuit, path balancing DFFs should be inserted into outputs of the fanin gates with smaller logic levels. This is done to guarantee arrival of all input signals of a gate at the same clock period. Otherwise, input pulses which have arrived at earlier clock periods will be consumed, generating wrong output values. For some small circuits, one could add a few asynchronous delay elements (e.g. chain of JTLs) to make sure that all gates receive their inputs in right clock periods, hence, guaranteeing correct circuit operation. However, this solution cannot be scaled and it is hard to be automated, because it requires information of routed wires (after place and route) during logic synthesis. Therefore, path balancing should be considered in logic synthesis (e.g. during the technology mapping phase) of SFQ circuits not only to meet the balancing requirement, but to minimize the path balancing overhead by reducing the number of required path balancing DFFs. 4.1 Minimizing Total DFF count In this section, we present PBMap: a path balancing technology mapping algorithm which pro- vides optimal solutions for mapping tree-like dc-biased SFQ logic (including RSFQ, eSFQ, and ERSFQ) circuits. Note that ERSFQ logic was developed to eliminate static power losses of RSFQ by replacing bias resistors with inductors and current-limiting Josephson junctions. Similarly, eSFQ logic, which was also powered by direct current, diered from ERSFQ in the size of the bias current limiting inductor and how the limiting JJs were regulated. So, although there are key dierences among RSFQ, ERSFQ, and eSFQ in terms of their biasing network designs, these dierences do not aect the proposed mapping algorithm. In our algorithm, DFF insertion to achieve path balancing is done to enable gate-level wave- pipelining. In other words, in a circuit generated by our algorithm, length of all paths from any PI to any PO will be the same. However, the benet of using our path balancing algorithm is 35 that it reduces total number of required path balancing DFFs and as a result it reduces total JJ count and total area (Table 4.1). The rest of this section is organized as follows: Section 4.1.1 reviews some related works. Section 4.1.2 provides a motivation example for path balancing technology mapping. Section 4.1.3 presents our path balancing technology mapping algorithm, gives its proof of optimality for trees, considers retiming, generalizes the technology mapping algorithm to Directed Acyclic Graphs (DAGs), and nally talks about clock jitter accumulation problem. Finally, Section 4.1.4 gives the experimental results. 4.1.1 Related Work In the literature of the logic synthesis and verication, there are many papers addressing the technology-independent and technology mapping phases. Some of these papers developed useful algorithms/tools and invented eective heuristics for optimizing some objective functions such as literal count [7, 14{16, 80, 89] or for increasing the verication speed by presenting fast SAT solvers [35, 109]. Examples are SIS [131], MVSIS [54], and Cha [109]. Furthermore, there are many innovative methods such as integration of technology mapping and retiming [104, 117], or logic decomposition during technology mapping [91]. A logic synthesis and verication tool, ABC [38], has been developed by the Berkeley verication and synthesis research group to provide a exible programming environment to implement the recent innovations. In the literature, there are also some papers which present logic synthesis algorithms and tools for some specic applications. For example, in [150], authors proposed tree mapping and decomposition algorithms to generate a power ecient mapping solution for a given network. In [148, 149, 151, 158], some other technology mapping methods targeting the reduction of power consumption are presented. In [107], a priority-cut-based technology mapping is presented in which the priority of selecting matches can be set as delay, area, or any other metric. In [23], a 36 near optimal algorithm for technology mapping is proposed. This algorithm minimizes area under delay constraints by generating area-delay curves. There are a few papers addressing the logic synthesis for SFQ circuits [75, 125, 156, 160]. In [156], a framework is developed by constructing a virtual cell called \2-AND/XOR". This framework allows usage of the CMOS logic synthesis tools for SFQ circuits as claimed in [156]. In [160], a Binary Decision Diagram (BDD)-based top-down design methodology is used for SFQ circuits. In [75], the required path balancing DFFs and the splitter cells are added to the netlist generated by ABC [38] followed by applying the standard retiming algorithm [92] to reduce the required number of path balancing DFFs. In [125], a technology mapping tool for SFQ logic circuits (called SFQmap) is presented which provides two main optimizations: (i) logical depth minimization with path balancing, and (ii) peephole optimization for minimizing product of the worst-case stage delay and the logical depth (PSD). In this section, we present a path balancing technology mapping algorithm which favors gen- erating mapping solutions with balanced structures. In addition, several closed form formulas are developed which relate the number of leaf nodes in a tree with the required path balancing DFF count at each level of this tree. Thanks to these formulas, the optimality of path balancing tree mapping algorithm is proven for SFQ logic circuits. Path balancing can be considered during dierent phases including technology independent optimizations, technology decomposition, and technology mapping. In this section, we focus on the technology mapping phase. 4.1.2 Motivation Suppose that we want to map the following expression: F = a:b:(!c):d. As shown in Figure 4.1, state-of-the-art mappers (such as ABC [38]) produce the left circuit which requires three path balancing DFFs. However, it is possible to have a better mapping solution with fewer number of required path balancing DFFs, as shown in the right graph in Figure 4.1. This is because there is no implemented algorithm in the current state-of-the-art technology mappers for 37 DFF Figure 4.1: Two mapping solutions for F =a:b:(!c):d. The left circuit, generated by ABC's mapper [38] and requires three path balancing DFFs. There is another mapping solution with only one DFF as shown in the right graph. controlling balancing of the network which is being mapped. In the following, we present a novel path balancing technology mapping approach which generates mapping solutions with minimum number of required path balancing DFFs for mapping trees. From now on, we denote the total number of required path balancing DFFs by #DFFs. 4.1.3 Presenting Our Algorithm We present the problem of path balancing tree mapping as a dynamic programming (DP) problem. The input of the technology mapper is a network of two input (N)AND and inverters which is called the subject graph. ABC uses And Inverter Graphs (AIGs) to represent subject graphs. In AIGs, all nodes are two input AND gates. Inverter is modeled as a eld in the data structure of the node. Therefore, if the subject graph is a tree, it can be modeled as a binary tree in which all nodes have two children. A child can be a node or a PI. The goal is to nd a mapping solution for the given subject graph with fewest #DFFs. In the path balancing technology mapping algorithm, the optimal solution for mapping a tree rooted at node v i is dened as a solution which minimizes #DFFs. Suppose that the set of all k-feasible cuts of node v i is K i , and for a k-feasible cut C j 2 K i , L Cj denotes its set of leaf 38 v i C1 C3 C2 F v i+2 v i+5 v i+6 v i+1 v i+3 v i+4 v i C1 C3 C2 F v i+2 v i+5 v i+6 v i+1 v i+3 v i+4 Figure 4.2: Showing 3-feasible cuts of node vi. nodes (inputs). The value of the optimal solution, OPT (v i ), is calculated recursively using the following equation: OPT (v i ) =min 8 < : X 8v2L C j OPT (v) + B(L Cj ) 9 = ; 8C j 2K i (4.1) in which,B(L Cj ) is a function which receives the set of leaf nodes of a cut (C j here) and returns the required number of DFFs for balancing the inputs of this cut. This balancing is required if there is a dierence among logic levels of inputs of the cut. For example, suppose that C 1 has two leaf nodes v 1 and v 2 (L C1 =fv 1 ;v 2 g). If levels of v 1 and v 2 are three and ve, respectively, B(L C1 ) will return two. 39 Example 4 Consider the binary tree shown in Figure 4.2. The 3-feasible cuts of node v i are shown in this gure. Using Eq(4.1) and having k=3, the value of the optimal solution for node v i is computed as follows: OPT (v i ) =minf OPT (v i+1 ) +OPT (v i+2 ) +B(fv i+1 ;v i+2 g), OPT (v i+1 ) +OPT (v i+5 ) +OPT (v i+6 ) +B(fv i+1 ;v i+5 ;v i+6 g), OPT (v i+2 ) +OPT (v i+3 ) +OPT (v i+4 ) +B(fv i+2 ;v i+3 ;v i+4 g), g (4.2) The optimal path balancing tree mapping solution is generated as follows: In a topological ordering traversal starting from level-1 nodes, the k-feasible cuts for each node in a way similar to [29], and each cut's function based on its inputs in a way similar to [106] are computed. Afterwards, the best valid solution for each node which minimizes #DFFs is found in a DP approach using Eq(4.1). The tree traversal is continued until the root of the tree is visited. After visiting the root, the optimal path balancing mapping solution for the whole tree is calculated. This solution can be generated by tracing the tree from its root all the way back to its PIs. We will prove that the presented DP based algorithm for path balancing technology mapping provides optimal solutions for mapping trees when an SFQ library of gates is used. The complexity of computing the k-feasible cuts is O(kmn), where m is the edge count, and n is the node count [29]. The complexity of computing cut functions is linear in the size of the network [162]. By having thek-feasible cuts, the complexity of the path balancing tree mapping algorithm is O(K 0 ng), in which K 0 is the maximum number of k-feasible cuts for any node of the subject tree, n is the node count, and g is the number of supergates in the supergate library. The overall complexity of the algorithm is O(kmn). 40 In order to use DP for nding the optimal solution for a problem, this problem should satisfy the DP's principle of optimality. For this purpose, the optimal solution to the problem should be built of the optimal solutions to its sub-problems. It looks like it is possible to nd some examples in which the optimal path balancing mapping solution for a tree rooted at node v i is not built of the optimal solutions to its sub-problems. For node v i in Figure 4.2, assume that its best cut is C1, and suppose that for a tree rooted at nodev i+2 , there is a single match of (7; 4), and for node v i+1 , there are two matches (3; 2) and (5; 3). A match is shown with a couple (x;y). The rst attribute stands for the height or depth of the match, and the second attribute stands for #DFFs. The best mapping solution for node v i will contain (5; 3) for node v i+2 . This is because it gives 3+4+(7-5) = 9 required path balancing DFFs, while the other mapping solution for v i+2 gives 2+4+(7-3) = 10 required path balancing DFFs. Therefore, in this scenario, the best mapping solution for node v i is not built of the best (with the least #DFFs) for node v i+2 . This means that this example disproves the DP's principle of optimality for path balancing tree mapping. We will prove that these kinds of counter examples do not exist in actual circuits. For this purpose, we need to prove that by increasing the height of a sub-tree from H=X to H=X+p, #DFFs for internal balancing of the sub-tree will be increased by more than p, where p is a natural number. Unfortunately, by having gates with k > 2 inputs in the library, the problem becomes very complicated. In the following, we provide a proof of optimality for the case of having gates with only two inputs in the library. This is valid for the used SFQ library of gates [74]. In the proof, it is needed to have a closed form formula for the total number of input pins of the mapped tree based on #DFFs for that tree. This formula is developed in Section 4.2.3. We use #DFFs as a metric for measuring how balanced a graph is; if a graph has smaller value for #DFFs, this means that it is more balanced. Therefore, minimizing #DFFs for a graph during technology mapping results in achieving the most balanced mapping solution for this graph. 41 i+1 F i+3 i+6 i+7 i+2 i+4 i+5 i i+1 F i+3 i+6 i+7 i+2 i+4 i+5 i (a) i i+1 F x = 1 x = 2 x = 3 x = 4 = H n = 2 4 - 1×2 2 - 1×2 1 - 1 = 16 – – – 8 input pins in this part 1 input pin in this part Fertile buffer node Sterile buffer node Imaginary node Regular node (b) Figure 4.3: (a) A tree that we want to nd its #input pins, and (b) its extended tree. Fertile and sterile buer nodes and imaginary nodes are shown in the extended tree. The left sub-tree (not shown) in the extended tree is a full binary tree rooted at nodei+1. This sub-tree generates 2 4 input pins, and there is a single input pin feeding the only buer node at the last level (x=4). Thus, n=9. 4.1.3.1 Terminology input pin: Primary input (or leaf node) of a tree. inputs vs input pins : The rst one is used to refer to fanins of a gate, while the second one is used to refer to leaf nodes of a tree. n: Total number of input pins of a tree. #input pins: Total number of leaf nodes (or PIs) of a tree (=n). N: Total number of internal nodes of a tree. x: Level of a node in a tree. Root of the tree is at level one and levels of other nodes are higher than one. H: Height of a tree (last level of a tree, furthest from the root). Buer Node: A node which has one father node and one child node. A DFF sits on the place of a buer node. Imaginary Node: A node which does not have a father node, and actually does not exist in the tree (it is only a concept). This concept is used for calculating #input pins of a tree. Sterile Buer Node: A buer node which is not capable of generating imaginary nodes. 42 Fertile Buer Node: A buer node which is capable of generating imaginary nodes. A fertile buer node also generates one sterile buer node per level starting from its higher level to the last level of the tree. Extended Tree: A tree obtained by adding all buer nodes and imaginary nodes to the original tree. Extended tree is only a conceptual thing and it is used for model development. We do not actually construct an extended tree during the technology mapping. Generation of imaginary nodes: Each fertile buer node generates some imaginary nodes at the higher numbered levels (further down from the root). We are interested in the number of imaginary nodes generated at the last level of the tree. An imaginary node belongs to a buer node with smallest level (closer to the root) that can be reached from this imaginary node in the extended tree. y i : Total number of buer nodes at level i. Sum of all y i s in a mapped tree is equal to the total number of required path balancing DFFs for that tree. Y: Total number of required path balancing DFFs in a mapped tree. Y = P H i=2 y i . 4.1.3.2 Discussion about the Algorithm Total number of nodes at the last level (x=H) of a full binary tree is equal to 2 H1 , in which H is the height of the tree. Since all nodes have two inputs (Section 4.1.3), thus, #input pins for a full binary tree is 2 H . A general binary tree has fewer number of internal nodes, fewer nodes in the last level, and fewer #input pins compared with a full binary tree. In the following, a closed form formula for #input pins of a general binary tree is developed. In a general binary tree, there are some missing nodes compared with a full binary tree with the same height; wherever there is a missing node, a buer node will sit in that place. If this node (at level x) was not missing, it could create 2 2 Hx input pins at the last level of the tree. So, this amount of #input pins should be deducted from #input pins of a full binary tree to achieve #input pins for this general binary tree. This contributes to the reduction of the #input pins 43 by 2 2 Hx -1. The `-1' is due to the fact that each chain of buer nodes, starting from a fertile buer node all the way down to the last level, needs one input pin. We should be careful about not over-counting the number of fertile buer nodes. Referring to the denition of the fertile and sterile buer nodes, if the total number of buer nodes at level x+1 is the same as level x, it means no new fertile buer node is generated at level x+1. Therefore, the total number of fertile buer nodes at level x+1 is y x+1 y x . Based on the above discussion, we can write the following formula for #input pins (n) of a general binary tree with height H: n = 2 H y 2 2 H1 (y 3 y 2 ) 2 H2 (y 4 y 5 ) 2 H3 ::: (y H1 y H2 ) 2 2 (y H y H1 ) 2 1 +y H (4.3) By performing some simplications on Eq(4.3), the nal closed form formula for #input pins of a binary tree will be as follows: n = 2 H y 2 2 H2 y 3 2 H3 y 4 2 H4 ::: y H1 2 1 y H (4.4) Figure 4.3 shows a tree, its extended tree, and displays the calculation of #input pins for this tree using Eq(4.4) and the concepts of buer nodes and imaginary nodes. Note that eventhough an imaginary node is connected to the buer node at x=3, it belongs to the buer node at x=2 based on the denitions and terminology presented in Section 4.1.3.1. Therefore, the buer node at level x=3 is not considered fertile. For future use, we rearrange the above equation to obtain the following one: y H + y H1 2 1 + ::: + y 4 2 H4 + y 3 2 H3 + y 2 2 H2 = 2 H n (4.5) 44 Now we are ready to present required lemmas and the main theorem in order to prove op- timality of the algorithm presented in Section 4.1.3. From now on, we use tree and binary tree interchangeably. Lemma 4.1.1 Total number of input pins (n) for a binary tree is one more than the total number of nodes in this tree, i.e., n=N+1. Recall that in our problem, all nodes of a binary tree have two children. Proof: We use induction hypothesis for proving lemma 4.1.1. Base case: a tree with one node has two input pins. Induction step: assume that for a binary tree with N internal nodes, there are N+1 input pins. Now, for the N+1 step, we need to add one more node to the previous tree by replacing an input pin with a new node. If an input pin is replaced by a new node, both the number of nodes and the number of input pins increase by one (one input pin is lost and two new input pins are gained in return for the new added node.). Suppose that there is a binary tree t 1 with height X and #input pins of n. Suppose that the height of this tree is increased fromX toX+p while #input pins remains the same. The resulting tree is called t 2 . Theorem 4.1.2 If the total number of buer nodes of the binary tree t 2 is more than the total number of buer nodes of the binary tree t 1 by a positive integer value y, then yp (p is a natural number). Generally, going from a binary tree with height X to a binary tree with height X+p, while preserving #input pins, the total number of buer nodes will be increased. This is because total number of nodes will be the same for both trees (lemma 4.1.1), thus, we need to remove at least p nodes from the internal nodes of the rst tree and put one at each level to increase the height of the tree from X to X+p. There are dierent trees with height X+p and #input pins of n. We want to prove that if the total number of buer nodes (for the new tree with height X+p) is increased, it cannot increase by less than p. So, a valid assumption is considering the most 45 1 2 F x = 1 x = 2 1 2 F x = 1 3 x = 2 1 4 F x = 1 6 x = 2 x = 3 2 x = 4 3 5 7 1 2 F x = 1 3 x = 2 x = 3 1 2 F x = 1 3 x = 2 x = 3 1 2 F x = 1 x = 2 1 2 F x = 1 3 x = 2 1 4 F x = 1 6 x = 2 x = 3 2 x = 4 3 5 7 1 2 F x = 1 3 x = 2 x = 3 Figure 4.4: Two possible binary trees with height X = 2, and the most unbalanced binary trees with height X=3, and X=4. balanced tree fort 2 , and the least balanced tree fort 1 . If the theorem is proven for this case, then obviously it is valid for all other cases. First, we need to nd a lower bound for the total number of buer nodes of a tree with height X+p and #input pins of n, and also an upper bound for the total number of buer nodes of a tree with height X and #input pins of n. Lemma 4.1.3 The maximum value for p (dierence between height of t 2 and t 1 ) in the Theorem 4.1.2 is n-1-X, (pn-1-X). The minimum value is 1. Proof: Since #input pins is n, based on lemma 4.1.1, the total number of nodes is xed at n-1. Now, if we want to create a tree with the maximum height (to maximize p), a single node should be put at each level, because at least one node has to be present at each level. Thus, the maximum height will be n-1, hence, X+pn-1. Therefore, pn-1-X.. Please note that similar to what is mentioned before, in the following, the most and the least balanced binary trees are ones with the minimum and maximum values for total number of buer nodes, respectively. 46 Lemma 4.1.4 The most unbalanced (least balanced) binary tree with height 1 X 3 has X nodes and total number of buer nodes equals Y = (X-1)X=2. Proof: SinceX is not a large number, it is easy to manually check the correctness of this lemma. Figure 4.4 shows two possible binary trees for X=2, and the most unbalanced binary tree for X = 3. The buer nodes are shown in these gures too. In the most unbalanced case there is one buer node at level two, two buer nodes at level three, three buer nodes at level four ,..., X-1 buer nodes at the last level. So, the total number of buer nodes is equal to the sum of the natural numbers from 1 to X-1, which is (X-1)X/2.. Lemma 4.1.5 The most unbalanced (least balanced) binary tree with height X 4 has 2X-1 nodes and Y =(X-2)(X-1) total buer nodes. Proof: The most unbalanced binary tree with height X 4 is achieved when we start placing a node at each level from level x=X to the level x=1 (until now, X nodes are consumed), and returning from the right side of the tree with similar method, consuming X-1 more nodes. So, the total number of nodes will be 2X-1, and based on lemma 4.1.1, #input pins is 2X. From now on, adding more nodes means removing one buer node, so, making the tree more balanced. The total number of buer nodes is computed similar to lemma 4.1.4.. Lemma 4.1.6 The most balanced binary tree with height X and #input pins of n is obtained as follows: Starting with y 2 and maximizing it (=1). If the left hand side of Eq(4.5) is not larger than its right hand side, we keep y 2 =1 and go for maximizing y 3 . Otherwise, y 2 =0. Continuing this way and choosing the maximum valid values for y i s that satisfy Eq(4.5), the resulting tree will be the most balanced tree with height X and #input pins of n. Proof: To have the most balanced binary tree with height X and #input pin of n, we need to nd a tree with minimum total buer node count, or equivalently the minimum value for Y (sum of the y i s). By xing the values of H and n in Eq(4.5), the minimum value for Y is obtained 47 2 4 F x = 1 8 x = 2 x = 3 9 1 5 10 11 3 6 12 7 13 14 x = 4 2 3 F x = 1 6 x = 2 x = 3 7 1 4 8 9 5 10 11 x = 4 2 4 F x = 1 8 x = 2 x = 3 9 1 5 10 11 3 6 12 7 13 14 x = 4 2 3 F x = 1 6 x = 2 x = 3 7 1 4 8 9 5 10 11 x = 4 Figure 4.5: Two examples for lemma 4.1.6. when a y i with a larger coecient contributes more in this equation. In other words, we should start with y 2 and maximize it, then if more values are needed to satisfy Eq(4.5), y 3 should be maximized. We should continue this way until the left hand side of Eq(4.5) becomes equal to its right hand side. If this happens at level i, then y j =0 for all j >i. It may also be needed to add a few buer nodes to level i+1 to satisfy the Eq(4.5) without maximizing y i+1 . Figure 4.5 shows two examples for X=4. As seen in the left tree with n=15, the maximum value fory 2 cannot be used because it does not generate a valid tree based on the given constraints for height and #input pins. Thus, y 2 is set to 0. For this tree, the only scenario which satises Eq(4.5) is when there is a buer node at the last level, as shown in this gure. However, forn=12, we can set y 2 =1, which generates the right graph in Figure 4.5. Lemma 4.1.7 The most balanced binary tree with height X+p and #input pins of n that can be generated from the most unbalanced binary tree with height X and the same #input pins (the tree in lemma 4.1.5) has (X-p-1)(X-p-2)=2+2pX+p-2p 2 total buer nodes. Proof: Based on lemma 4.1.6, to nd the most balanced binary tree, we should start consuming buer nodes at the lower levels (closer to the root of the tree). In the case of a tree described in lemma 4.1.5, there are 2X-1 nodes. X+p nodes should be put at each level (one per level) to generate the height of X+p. The rest of the nodes, X-p-1, are put at the last level of the tree. So, a tree similar to what is shown in Figure 4.6 will be obtained. Now, we need to count the number of buer nodes of this tree. It consists of two groups. First, the X-p-1 nodes at the last 48 F x = 1 x = 2 x = 3 H = X + p H = X + p - 1 H = X + p - 2 H = X + p - 3 X + p X – p - 1 2p F x = 1 x = 2 x = 3 H = X + p H = X + p - 1 H = X + p - 2 H = X + p - 3 X + p X – p - 1 2p Figure 4.6: The most balanced binary tree that we can get by increasing the height of the tree described in lemma 4.1.5 without increasing the number of input pins. level. The rst of these nodes needs 0 buer node, the second one needs 1, the third one 2,..., the (X-p-1) th one needs X-p-2 buer nodes. This sums up to (X-p-2)(X-p-1)=2. The second group of buer nodes correspond to the long wires which start from a node at level 1 to level 2p and end at the last level of the tree. The rst wire in this group which starts from the node at level x=1, needs X-p-1 buer nodes because it travels from level 2 to the last level of the tree and at each level it needs one buer node. The second wire in this group needs X-p-2 buer nodes,..., summing up to 2pX+p-2p 2 . Therefore, the to. Now, we are ready to prove the Theorem 1. By transforming the statements in Theorem 1 into the mathematical expressions, we basically need to prove that the following inequality is not valid for any natural number for p: 1<Y diff <p, in which, Y diff is the dierence between the 49 1 4 F x=1 6 x=2 x=3 2 x=4 3 5 7 1 4 F x=1 6 x=2 x=3 2 x=4 3 5 7 (a) 1 4 F x=1 6 x=2 x=3 2 x=4 3 5 7 1 4 F x=1 6 x=2 x=3 2 x=4 3 5 7 (b) Figure 4.7: Matches for a node in a subject graph: (a) before retiming (b) after retiming. total number of buer nodes of the second tree (t 2 ) and the rst one (t 1 ). The following equation shows the expression for Y diff : Y diff =f(Xp 1)(Xp 2)=2 + 2pXp 2p 2 g f(X 2)(X 1)g = (X 2 + 4(p + 1)X 2p 3p 2 3)=2 (4.6) in which,X;p 1 are natural numbers. Another constraint, as mentioned in the previous lemmas, is 1pX 1. By solving these inequalities, it is easy to see that there is no valid values for p that satises all inequalities. What we just proved has the following meaning: having gates in the library with no more than two inputs (as in the SFQ library of gates [74]), it is not possible to provide a counter example to disprove the optimality of the dynamic programming based approach presented for path balancing tree mapping. Therefore, the path balancing tree mapping algorithm presented in Section 4.1.3 gives the optimal solution for 2-input gates and serves as an eective heuristic for multi-input gates. 50 4.1.3.3 Retiming After nishing the technology mapping and inserting the path balancing DFFs, a standard retim- ing algorithm [92] as in [75] can be used to reduce the total number of path balancing DFFs. In our path balancing technology mapping algorithm, we considered the retimed versions of matches for each node during the tree traversal. In other words, the number of path balancing DFFs that is considered in Eq(4.1) is for retimed matches. Figure 4.7 shows a match for a node in a subject graph before and after applying the retiming algorithm. In our path balancing technology mapping algorithm, the retimed version (Figure 4.7b) is used for counting the number of DFFs for a match. Therefore, it should be proven that the developed formulas in Section 4.1.3.2 for #input pins is valid for retimed matches as well. For this purpose, it is enough to show that after applying the retiming algorithm to a match, Eq(4.4) will be valid for relating its #input pins to its buer node count. Lemma 4.1.8 Eq(4.4) is valid for a retimed match. Proof: Referring to Section 4.1.3.2, the formula of #input pins for a binary tree is developed by considering the eect of each buer node at each level of the tree in reducing the total number of input pins compared with a full binary tree. After retiming, a subset of buer-nodes will be moved from higher levels (closer to leaves of the tree) to lower levels (closer to the root of the tree). Therefore, it should be proven that the contribution of buer-nodes in Eq(4.4) for new architecture (after retiming) and for the old one (before retiming) are the same. In other words, Eq(4.4) is valid for before and after retiming. For this purpose, suppose that there is a node (node j) at levelx=X of a binary tree and for simplicity, suppose that all of its inputs are coming from PIs (e.g. node 3 in Figure 4.7a). This node generates two buer nodes per level starting from 51 x=X+1 all the way down to the last level (x=H). The contribution of those buer nodes in the right-hand side of Eq(4.4) is as follows: 2f2 H(X+1) + 2 H(X+2) +::: + 2 1 + 1g (4.7) After retiming, node j will be moved to the last level (x=H), and it will generate a single buer node at each level starting from x=H-1 all the way up to x=X. For example, after retiming is applied to the tree in Figure 4.7a, node 3 will be moved to the last level (x=4) and it will generate one buer node at level three and one at level two. Contributions of the new buer nodes generated after retiming to the right-hand side of Eq(4.4) is as follows: 1f2 H(X) + 2 H(X+1) +::: + 2 1 g (4.8) It is easy to see that both of Eq(4.7) and Eq(4.8) have the same values equal to 2 HX+1 2. This shows that Eq(4.4) is valid for before and after retiming, hence, lemma 4.1.8 is proven.. 4.1.3.4 DAG Mapping For nding path balancing mapping solutions for DAGs, a cut-enumeration-based method similar to what is presented in [25, 29] followed by a dynamic programming approach similar to what is discussed in Section 4.1.3 is used. The subject graph in this case is a DAG as opposed to Section 4.1.3 which was a tree. As experimental results in Section 4.1.4 shows, this method provides considerable improvements in reduction of the total number of path balancing DFFs and total area compared with the state-of-the-art technology mappers. Note that for most of the benchmark circuits the subject graph is actually a DAG. 52 Algorithm 2: PBMap Input: Given network: N = (V;E) Output: Mapped network with minimum path balancing overhead: N Map 1 //pre-mapping computations: 2 Computing kfeasible cuts for each node. 3 Computing truthtables for each cut. 4 Constructing and initializing the mapping manager, pMan. 5 Generating the library of supergates. 6 for each node v in N do 7 Find the best mapping solution based on Eq(4.1). 8 Depth Minimization (pMan;N) 9 Area Optimization (pMan;N) 10 //generating the mapped network: 11 N Map = Network From Map (pMan;N) 12 return N Map 4.1.3.5 Clock Jitter Accumulation Clock jitter accumulation is a measurement of the timing uncertainty at the user dened time oset over the course of a few clock cycles [64]. In the worst case, this can result in obtaining erroneous outputs. Therefore, it is crucial to design a clock distribution network with acceptable amount of accumulated jitter. We believe that in our proposed path balancing technology mapping and retiming algorithm, to a rst order, the accumulated clock jitter along input-output paths of the circuit will not be changed compared to conventional path balancing methods. This is because our algorithm reduces the number of required path balancing DFFs for a given circuit without changing the gate-level wave-pipelined structure of the circuit. 4.1.4 Experimental Results The path balancing technology mapping algorithm (PBMap) is implemented inside the ABC [38]. Algorithm 2 shows its pseudo code. After nding balanced matches for each node, depth of the best match is minimized using an algorithm similar to what is presented in [29]. This depth minimization is done without degrading the best achieved path balancing solution (without increasing the balancing overhead, i.e., #DFFs). Faster system operation in the sense of nishing 53 s 0 s 1 s 2 s 3 c out clk a 0 a 1 a 2 a 3 b 0 b 1 b 2 b 3 c in (a) s 0 s 1 s 2 s 3 c out (b) Figure 4.8: Simulation results for a 4-bit Kogge-Stone adder (KSA4). (a) input/output signals for the KSA4 circuit generated by our algorithm, (b) output signals generated for the same inputs by a KSA4 circuit which is not path balanced. Four sets of random inputs are applied: a0=1001,a1=1111,a2=1001, a3=1111,b0=1010,b1=1100,b2=1010,b3=1100,cin=1001. The correct outputs are: S0=1010,S1=1010, S2=1110, S3=1010, Cout=1101. As seen, only the results in (a) are correct. a given task in a shorter amount of time directly depends on the logical depth. In fact, the operation latency is the product of the number of cycles needed to do the operation and the clock cycle time. Shorter logical depth directly translates to lower cycle latency for the operation, but its eect on clock cycle time is hard to characterize before place and route is done. That is why we only talked about reducing the logical depth as our objective. One could consider the total number of gates and DFFs as the cost function and develop theorems similar to what is presented in Section 4.1.3 for the new cost function. However, to 54 minimize the total area, we added an extra area optimization pass as in line 9 of the shown pseudo code. In this area optimization pass, a match with the least area which preserves the best obtained #DFFs and the minimum depth is chosen for each node. To see if circuits generated by PBMap operate correctly, we simulated a few benchmark circuits including a 4-bit Kogge Stone Adder (KSA4) using JSIM [71]. Figure 4.8 shows the input/output waveform for the KSA4 circuit for two cases: (a) when our path balancing algorithm is applied, (b) without any path balancing. Four random values are considered for inputa, inputb, and carry in (c in ): a 0 =1001,a 1 =1111,a 2 =1001,a 3 =1111,b 0 =1010,b 1 =1100,b 2 =1010,b 3 =1100,c in =1001. The correct sum (S 0 S 3 ) and carry out (C out ) for these inputs are as follows: S 0 =1010,S 1 =1010, S 2 =1110, S 3 =1010, C out =1101. Please note that having a 0 as subscript of a signal makes it the least signicant bit and having a 3 makes it the most signicant bit. As seen in Figure 4.8a, the circuit generated by PBMap produces the correct outputs, while as Figure 4.8b shows, a circuit with no added path balancing DFFs produces erroneous outputs. Notice that since the depth of KSA4 circuit is 6, we have to wait at least 6 clock periods after applying the rst set of inputs to see the rst round of correct outputs. An SFQ library of gates as in [74], consisting of and2, or2, xor2, DFF, splitter, and inverter gates were used, and several ISCAS [59], EPFL [37], MCNC [157], and arithmetic benchmark circuits were considered. Table 4.1 show the experimental results for PBMap and a baseline mapper. The baseline mapper is ABC's mapper plus inserting path balancing DFFs and applying the standard retiming algorithm [92] for minimizing DFF count. The total number of path balancing DFFs are mentioned for both before and after applying the retiming algorithm mainly to show the eectiveness of our path balancing algorithm in reducing DFF count. Retiming algorithm helps in reducing the DFF count in netlists generated by both PBMap and baseline mappers. In Table 4.1, KSA16 and ID8 are 16-bit Kogge-Stone adder and 8-bit Integer Divider, respectively. 55 PBMap was able to reduce #DFFs by 2:7, and 1:2 before and after retiming for one of the EPFL benchmark circuits (priority) compared to the baseline. PBMap reduces area, total JJ count (#JJ), logical depth, and run-time by 1:11, 1:08, 96%, and 7:66, respectively over the baseline for the same circuit. Area in Table 4.1 is the total area of gates, path balancing DFFs, and splitters. On average for all benchmark circuits, PBMap improves the run-time over the baseline by 49.78% mainly because its run-time for retiming is less than the baseline due to requirement of inserting fewer path balancing DFFs. #DFFs (before retiming), #DFFs (after retiming), area, #JJs, and logical depth are reduced by an average of 20.64%, 15.06%, 12.22%, 11.22%, 14.56%, respectively for PBMap compared with the baseline. To compare circuits generated by PBMap with other published papers, we include experimental results of a 16-bit wave-pipelined sparse-tree RSFQ adder [32]. The fabrication results published in [32] shows that JJ count for this design is 9941. Using the same cell library (CONNECT cell library [159]), PBMap consumes 8901 JJs for mapping a 16-bit Kogge-Stone adder which shows 10.5% reduction in JJ count eventhough the Kogge-Stone adder is itself more complex than the sparse-tree adder. The dierence between these two sets of numbers are because of the following reasons: (i) our algorithm is highly eective in reducing total JJ count. (ii) the results we presented are for post-synthesis i.e., we did not account for any JJs used in JTL connections, while the results in [32] account for such connection costs. Please note that using CONNECT cell library, JJ count for KSA16 is increased compared with the case of using the cell library in [74]. One important reason for seeing this dierence is that there is no OR gate in the CONNECT cell library, and since inverter is expensive (it consumes 10 JJs while an XOR has 11 JJs and an AND gate has 13 JJs), implementing OR gate, which frequently appears in arithmetic circuits, using AND and inverter gates (De Morgan's law) consumes more JJs by a factor of 3 times or more. In [43], an 8-bit RSFQ ALU is presented, which supports 12 sets of operations including: ADD, ADD-Invert A, AND, NOR, XNOR. Therefore, this ALU is much more complex than our adders and comparing its JJ count with our adders is not fair, hence, it is not mentioned in this section. 56 4.2 Minimizing Area We have observed that if the path lengths of a given SFQ circuit are not controlled during the technology mapping, we may end up having a circuit that needs the insertion of many path balancing DFFs to ensure the said path balancing requirement. Figure 4.9 compares original logic gate and required path balancing DFF counts for a few benchmark circuits. These circuits are mapped using a state-of-the-art open-source academic logic synthesis tool called ABC [38], and are path balanced and retimed [92] similar to [75]. For IntDiv8 circuit (an 8-bit integer divider), the DFF count is as high as 4:5 that of the original logic gate count in the circuit. On the other hand, if the balancing requirement of the circuit is considered during the technology mapping, it is possible to come up with much better mapping solutions which need fewer path balancing DFFs. For example, for a 2-bit adder, as shown in Figure 4.10, two mapping solutions with the same gate count and area exist. However, one of them requires 10 path balancing DFFs whereas the other one needs only three path balancing DFFs. In this section, we present a technology mapping algorithm which takes the balancing overhead into account and tries to nd the most cost-ecient mapping solutions for a given circuit. Thanks to this algorithm, the total gate count (accounting for both the original logic gates and the path balancing DFFs) is reduced, which results in decreasing the total area and static power consumption of the circuit (static power is the main source of power consumption in SFQ circuits [111]). The proposed algorithm provides an optimal solution for balanced tree mapping, and a modied version of it acts as a very eective heuristic for path balancing general Directed Acyclic Graph (DAG) mapping. The optimality of the algorithm for tree mapping is proven by developing models relating the number of required path balancing DFFs at each level of a given subject tree to the leaf node count of this tree and to its height. Our path balancing technology mapping algorithm can be used in wave-pipelined circuits to increase the rate at which data can propagate through the circuit (throughput) by decreasing the 57 0 1000 2000 3000 4000 KSA32 c3540 IntDiv8 c7552 c5315 #Gates #DFFs Figure 4.9: Path balancing DFF count versus original gate count for a few benchmark circuits. KSA32 is a 32-bit Kogge-Stone adder; IntDiv8 is an 8-bit integer divider and c5315, c7552, and c3540 are chosen from ISCAS benchmark suite [59]. These circuits are mapped using map command of ABC [38], and path balanced and retimed using full path balancing [75,123] and retiming [92] algorithms. dierences between lengths of the shortest and longest paths [20, 31]. It also can be used in any technologies or design methods that require full path balancing. For example, in stateful logic [140], each gate combines logic and memory and if all logic paths have equal length, the throughput of the circuit can be increased by orders of magnitudes at the expense of an area overhead of required path balancing buers [140]. Our path balancing technology mapping algorithm helps reducing this area overhead. In the rest of the paper, we use path balancing DFFs or buers to refer to the elements that should be inserted onto short paths to satisfy the full path balancing property. The main contributions of this section are as follows: • Capturing eect of required path balancing DFFs by proposing a new cost function and performing simultaneous technology mapping and path balancing. • Developing new formulas relating the number of required path balancing DFFs at each level of a tree to the leaf node count of this tree and its height. 58 b 0 2 xnor2 5 inv1 3 and2 4 xor2 1 nand2 6 nand2 7 nand3 8 aoi21 11 nand3 10 xor2 9 xor2 a 1 b 1 a 0 c in sum 0 sum 1 c out (a) a 0 a 1 b 1 b 0 c in 1 nand2 2 xnor2 3 nand2 4 xnor2 5 inv1 6 oai21 7 nor3 8 oai21 9 xnor2 10 xnor2 11 or2 sum 0 sum 1 c out (b) Figure 4.10: Two mapping solutions for a 2-bit Kogge-Stone adder using mcnc.genlib library of gates (red squares are path balancing DFFs): (a) Consuming 11 gates with area of 37.0 units and requiring 10 path balancing DFFs , (b) Consuming 11 gates with the same area but requiring only three path balancing DFFs. It is obvious that the total area of gates and path balancing DFFs in the second circuit is much less than the same in the rst one. • Presenting new lemmas and algorithms to prove optimality of the presented technology mapping algorithm for tree-like structures. • Presenting an eective DAG mapping heuristic for area minimization in path balancing technology mapping. The rest of this section is organized as follows: Section 4.2.1 summarizes several previous logic synthesis and technology mapping works with focus on papers targeting area minimization. Section 4.2.2 presents our path balancing technology mapping framework targeting area mini- mization in SFQ circuits. It presents the new cost function, algorithms, lemmas, and theorems for proving optimality of our algorithm for trees, and nally, a heuristic for area minimization in path balancing DAG mapping. Finally, Section 4.2.5 gives experimental results. 4.2.1 Related Work Keutzer [80] developed DAGON : a technology binding tool with local optimizations. DAGON partitions general DAGs into forests of trees and nds optimal solutions for these trees by pattern 59 matching. Cong and Ding [29] developed FlowMap: the rst load-independent delay optimal DAG mapping algorithm for FPGAs. Chaudhury and Pedram [22] presented an optimal area-delay mapper by constructing the pareto optimal frontier using dynamic programming. Mishchenko et al. [107] developed a priority-cut-based technology mapping tool in which the priority of selecting matches for individual nodes can be chosen as delay, area, or any other desired metric. In [4], majority-based logic synthesis is introduced and is further developed in [5]. Soeken et al. [136] proposed an eective algorithm for exact logic synthesis of Boolean networks using majority- inverter graphs, which improved both area and delay after lookup table (LUT)-based technology mapping. In [81], a gate sizing and buer insertion algorithm is proposed to achieve path balancing in CMOS circuits; this algorithm reduces load capacitance and glitches at the same time. Pasandi et al. [125] developed a depth minimization with path balancing algorithm for technology mapping, which provides optimizations for the product of the worst stage delay and length of the longest path, as well. In [123], a path balancing technology mapping algorithm is presented for single ux quantum logic circuits with the goal of minimizing total required number of path balancing DFFs; this algorithm is proven to provide optimal tree mapping solutions in case of having up to 2-input gates in the given library. In [103], an SOP-balancing algorithm by generalizing an AND-balancing approach is presented; while the AND-balancing algorithm is limited to multi- input AND gates, the SOP-balancing algorithm supports more complex functions, therefore, it has opportunities for further delay minimization. Regarding area minimization in technology mapping, there are several published papers; Far- rahi and Sarrafzadeh [40] proved that even restricted cases of the lookup-table count minimization as a measure of area in FPGAs are NP-complete for DAGs. However, these authors presented a polynomial time algorithm for minimum-area tree mapping and presented a polynomial time heuristic for area minimization in general Boolean networks. Chaudhary and Pedram [23] pre- sented a near optimal algorithm for technology mapping that minimizes area under delay con- straints; this is achieved by generating area-delay curves in a topological ordering traversal of the 60 given subject graph and selecting solutions from these curves in a reverse topological ordering traversal. Chen and Cong [25] studied the technology mapping problem for FPGAs targeting area minimization; they considered the potential node duplication during the cut-enumeration process such that the mapping cost is encoded into cuts, and after the timing constraints are met, the constraints on non-critical paths are relaxed in order to minimize the area. Manohararajah et al. [100] presented IMap, an iterative technology mapping tool that supports depth-oriented, area-oriented, and duplication free mapping modes; in the depth-oriented mapping mode, area is a secondary objective, while in the area-oriented mapping mode, area is the rst objective. In this section, we present a technology mapping framework targeting area minimization in logic circuits which require full path balancing. For this purpose, our new technology mapping cost function captures the eect of the path balancing buers and our algorithm minimize total area including the area of gates and these buers. We will prove that our algorithm provides optimal tree mapping solutions and a modied version of it, by capturing node duplication encoded in the information of logic levels, acts as a very eective heuristic for DAG mapping. There are two main dierences between the mapper presented in this section and the one in the previous section : (i) In the mapper presented in the previous section, total number of path balancing DFFs is minimized and total gate count is not considered in optimizations. This may increase the sum of gate and DFF counts, which potentially can increase the total area. To solve this problem, in this section, we dene and minimize a new cost function which captures eect of gates and path balancing buers (DFFs) at the same time, resulting in ensuring total area reduction. (ii) In the mapper presented in the previous section, the optimality of technology mapping algorithm is proven for trees, assuming that the gates in the library have up to two inputs. However, in this section, we provide proof of optimality for a general case of having any multi-input gates in the library. In addition, in this section we propose a novel heuristic for mapping general DAGs by encoding duplication of nodes into their logic levels, which helps generating more area ecient solutions for DAGs. Moreover, in favor of CMOS circuits, we 61 present a modied version of path balancing technology mapping algorithm which preserves the best obtained critical path delay, while it still reduces the path balancing overhead. 4.2.2 Technology Mapping 4.2.2.1 Terminology and Notation Buer Node: A single-fanin, single-fanout node which is added by the path balancing technology mapper to some selected edges of the mapped circuit to ensure that the circuit is fully path balanced. A buer node is replaced by a DFF in SFQ circuits. And-Inverter Graph (AIG): A subject graph in which nodes are 2-input AND gates. Inverters are modeled as a eld in the data structure of the node [38]. Therefore, if the AIG is a tree, it will be a binary tree rooted at node t with every node having exactly two inputs. Reverse level R i of node i in a tree T: Length of the longest path (in terms of the node count including the node itself) from node i to the root node t in T . The root is at reverse level r=R t =1. Leaf node of a tree T: An input to the tree (we assume a tree input is modeled as a fanin-free, dummy node). Logic level L i of node i in a tree T: Length of the longest path (in terms of the node count including the node itself) from any tree inputs to this node. The tree inputs are at logic level 0. Height H of a tree T : Number of nodes on the longest path from any tree inputs to the root (this is the same as the largest reverse level of any node). n: Denotes the number of leaf nodes in tree T . N: Denotes the number of internal nodes of tree T including the root node t. y(z;r;i): Denotes the number of nodes with z inputs at reverse level r of tree T rooted at node i. 62 Y (z;i): Denotes the number of nodes withz inputs in treeT rooted at nodei: Y (z;i)= P H r=1 y(z;r;i). 4.2.2.2 Presenting the Algorithm The balanced tree mapping problem is solved using dynamic programming (DP). The input of the technology mapper is an AIG with tree structure, and the pattern graphs are gates in the library [74], or supergates 1 [106] generated using these gates. Matches at node i are obtained by enumerating all k-feasible cuts [29, 107] of the node and examining all supergates that can implement function of this node based on inputs of each computed cut. In the path balancing technology mapping, cost of any added buer nodes (DFFs in SFQ circuits) are considered in the total area cost of the mapping solution. The cost of mapping a sub-tree rooted at node i in a given subject tree can thus be written as follows: Cost (i) =minfCost(m;i)g8m2matches(i) Cost(m;i) =Cost(m) +SLD(SuppC(m;i))Cost BFR + X 8k2SuppC(m;i) Cost (k) (4.9) whereCost (i) is the dynamic programming (i.e., minimum) cost of the mapping of the AIG sub- tree rooted at node i, Cost(m;i) is the cost of a particular match m at node i, and matches(i) andSuppC(m;i) denote the possible matches for nodei and the support (boundary nodes) of the cut (among k-feasible cuts of node i) that corresponds to the match m. In the above equation, SLD(S) denotes a function that returns the sum of the absolute value of level dierences of pairs of nodes in a node set S. For example, SLD(fi;j;k;lg) where L i = 1;L j = L k = 2;L l = 4 is equal to (4 1) + 2 (4 2) = 7. Cost of a path balancing buer is denoted by Cost BFR . 1 Supergates are small trees, which are generated by exhaustively concatenating original gates in the given library. 63 The product of SLD(SuppC(m;i)) and Cost BFR thus gives the total cost of the require path balancing buers. Example 5 Consider the AIG tree shown in Figure 4.11. The value of the optimal solution for mapping the tree rooted at node i when k=3 (in the cut computation procedure) is calculated as follows: Cost (i) =minf Cost(mC 1 ) +SLD(fi + 1;i + 2g)Cost BFR +Cost (i + 1) +Cost (i + 2), Cost(mC 2 ) +SLD(fi + 1;i + 3;i + 4g)Cost BFR +Cost (i + 1) +Cost (i + 3) +Cost (i + 4), Cost(mC 3 ) +SLD(fa;b;i + 2g)Cost BFR +Cost (i + 2) +Cost (a) +Cost (b) g (4.10) wheremC 1 ,mC 2 , andmC 3 are costs of supergates (there can be several for each) which implement function of node i based on inputs of cuts C 1 , C 2 , and C 3 , respectively. Cost () is equal to 0 for leaf nodes. Squares in Figure 4.11 denote leaf nodes. Please note that in the rest of this section (Figures 4.12, 4.13, 4.14), the leaf nodes are not shown. The minimum-cost fully-path-balanced mapping solution for a given tree is generated using the following algorithm which is called BalancedMap: First, k-feasible cuts and their truth-tables are computed for each node [29, 107]. Next, in a topological ordering traversal starting from the PIs of the tree, nodes are visited and the best solution for each node which gives the least value for the said cost function is computed using the dynamic programming algorithm with Eq. 4.9 as its value of the optimal solution. When the 64 a i C1 C2 F i+2 i+4 i+3 i+1 b c d e f C3 Figure 4.11: An AIG tree with ve nodes. 3-feasible cuts (C1,C2, andC3) of nodei in this subject tree are shown. Leaf nodes are shown with squares. Nodes are labeled in a Breadth First Search (BFS) order (root, left, right). root is visited, the optimal mapping solution for the whole tree is computed which is generated by a reverse topological ordering traversal from root to leaf nodes of the tree. Finally, splitters and buers are inserted. Algorithm 3 shows the pseudo code of our path balancing technology mapping algorithm. In lines 1, the mapping manager is constructed and a few pre-processing steps such as generating supergates, computing k-feasible cuts and their truth tables are done. These steps are similar to ABC [38]. In lines 2-3, the most cost ecient solutions minimizing the cost function in Eq. 4.9 are calculated. In line 4, the optimally mapped circuit is obtained by traversing the tree back from its root to its PIs. In line 5, the mapped circuit is given to a function to insert path balancing buers (DFFs) wherever it is needed 2 and to perform standard retiming [92] 3 . Finally, if this algorithm is used for mapping SFQ circuits, in line 6, splitters are inserted to outputs of gates with more than one fanouts; this step can be ignored for technologies that do not require splitters for generating fanouts more than one. 2 This function traverses over all gates; for a gate, it nds maximum logic level among its immediate fanin gates (Lmax), and inserts LmaxL i buers to the immediate fanin gate G i , where L i is the logic level of this fanin gate. 3 To minimize the register count; path balancing buers are treated as registers in the retiming algorithm. 65 Algorithm 3: BalancedMap Input: Tree T = (V;E) rooted at node i comprising node sets V and edge set E, and gate libraryL Output: The optimally mapped, path balanced circuit with inserted splitters, N Map 1 Start the mapping manager and perform pre-processing steps. 2 for each node v in V do 3 Find the most cost ecient mapping solution based on Eq. 4.9. // generating the mapping tree: 4 T i = Network From Map (T ) // Inserting path balancing DFFs and performing retiming: 5 N 2 = add Buers Retime(T i ) // inserting splitters: 6 N Map = InsertSplitters(N 2 ) 7 return N Map The complexity of computing k-feasible cuts is O(KMN), where K is a constant, N is the node count, andM is the edge count [29]. The complexity of selecting the best solution for a node after having its k-feasible cuts is O(K 0 Np), in which, K 0 is the maximum number of k-feasible cuts for a node in the given subject graph, and p is the number of gates in the library. The complexity of generating the mapped circuit is O(N), because each node will be visited once at maximum. The complexity of adding path balancing buers is O(M +N), because each node is visited once and for a visited node each input edge is visited one time. The complexity of retiming isO(MNlog(N)) [92]. Since in the splitter insertion process each node is visited once, the complexity of inserting splitters is O(p 0 N), where p 0 is a constant value representing the required time for inserting splitters at the output of a gate. Therefore, the complexity of the whole algorithm is determined by the retiming step to be O(MNlog(N)). 4.2.3 Proof of Optimality In this section, we will prove that the principle of optimality of dynamic programming 4 is satis- ed in our problem formulation and therefore the aforesaid algorithm produces the optimal tree 4 Optimal solution for a subset of the problem should be built from the optimal solutions for its sub-problems. 66 mapping solution in polynomial time. For this purpose, we need to rst develop a few models relating height, leaf node count, and path balancing buer count in a tree. Total leaf node count of a full binary tree with height H is 2 H . For a full b-ary tree 5 , this number is equal to b H . A tree which is not full has fewer leaf nodes. If a node with in-degree of b 0 <b is present at reverse level r of a tree, it will contribute in reduction of leaf node counts by the following amount: b Hr+1 b 0 b Hr . As a sanity check, a node with maximum in-degree (b 0 =b) does not contribute in reduction of leaf node counts. Based on this fact and using some basic properties of trees, a closed form formula for the leaf node count of a tree T rooted at t based on its height and number of nodes (with more than one input) and buers at dierent levels of the tree can be obtained, written as follows: b H (b 1) y(1; 2;t)b H2 +y(1; 3;t)b H3 +::: +y(1;H;t) (b 2) y(2; 2;t)b H2 +y(2; 3;t)b H3 +::: +y(2;H;t) ::: =n (4.11) The above equation starts with the leaf node count of a full b-ary tree (b H ) and rst accounts for contributions of nodes with in-degree of b-1 in reduction of the total leaf node count of the tree, then it accounts for nodes with in-degree of b-2, ..., all the way to nodes with in-degree of one. This equation can be rewritten as follows: b1 X z=1 ( (bz) H X r=2 y(z;r;t)b Hr ) =b H n (4.12) Note that in the above equation, it is assumed that the root node t has the in-degree of b. If the root node has fewer number of inputs (e.g. b 0 <b), the right hand side of Eq. 4.12 should be replaced with b 0 b H1 n. 5 The tree with maximum number of nodes for the given height in which all nodes have in-degree of b. 67 F Buffer node Regular node x = H = 3 x = 2 x = 1 Figure 4.12: An example to verify the correctness of Eq. 4.12 for giving total leaf node count of a general tree. The leaf node count of this tree is : 2 3 2 2f1 3 + 1g + 1f0 + 2g = 18 8 2 = 8X. Example 6 Suppose that there are up to 3-input gates in the library; the left hand side of Eq. 4.12 will be P 2 z=1 f(3z) P H r=2 [y(z;r;t) 3 Hr ]g. Figure 4.12 shows a tree with height H=3. The leaf node count of this tree is calculated as follows: 2f1 3 + 1g + 1f0 + 2g = 2 3 2 n )n = 18 8 2 = 8 X (4.13) Figure 4.12 veries the correctness of these calculations. Lemma 4.2.1 The total leaf node count of a tree with N internal nodes is calculated as follows: n = ( N X i=1 (b i 1) ) + 1 (4.14) where, b i is the in-degree of the internal node i. Proof by induction: Base case: a tree with only one internal node (node 1) has b 1 leaf nodes. Induction step: assume that a tree with j internal nodes hasf P j i=1 (b i 1)g + 1 leaf nodes. If a leaf node is replaced with a new internal node (node j+1), one leaf node will be lost, but b j+1 68 new leaf nodes will be generated by the added node. In total, b j+1 -1 leaf nodes will be added. So, the total leaf node count will be as in Eq. 4.14. A special case of the above lemma (as mentioned in lemma 1 in [123]) for b=2 is: n=N+1. Instead of going through all internal nodes and summing up their in-degree minus one values (as in Eq. 4.14), we can traverse a tree level by level and account for contributions of gates with an specic in-degree in total leaf node count and repeat it for all in-degree values larger than or equal to two (we start with in-degree of two, because for in-degree of one, b-1 = 0). Therefore, using the y(z;r;t) terminology, for a tree with height H, Eq. 4.14 can be equivalently rewritten as follows: n = b X z=2 ( (z 1) H X r=1 y(z;r;t) ) + 1 (4.15) Furthermore, the cost function introduced in Eq. 4.9 can be simplied as follows. Let T i denote the mapping solution corresponding toCost (i) in Eq. 4.9. By assuming that the cost of a library gate is proportional to the number of its inputs, the normalized cost ofT i (denoted bynCost (i)) may be calculated as follows: nCost (i) = b X z=1 zY (z;i) (4.16) whereb is the maximum in-degree of any library gates. Please note that since the iterator z starts from 1, cost of the path balancing buers (as single input nodes) are taken into account. This cost function is equal to the total edge count in T i . Clearly if i is selected to be the root node t, then nCost (t) denotes the total cost of the best mapping solution for the given AIG tree. As mentioned before, by assuming that the area of a library gate with 2z inputs is 2 the area of a gate with z inputs, Eq. 4.16 gives the total area cost of the best mapping solution for an AIG tree (including any path balancing buers). The above formula can be further rewritten as follows: nCost (i) = b X z=1 ( z H X r=1 y(z;r;i) ) (4.17) 69 Using Eqs (4.15, 4.16), the above cost function is rewritten as follows: nCost (i) = H X r=1 y(b 1;r;i) + 2 H X r=1 y(b 2;r;i) + 3 H X r=1 y(b 3;r;i) +::: + (b 1) H X r=1 y(1;r;i) +Const: = H X r=1 ( b1 X z=1 (bz)y(z;r;i) ) +Const: (4.18) whereConst: refers to some constant values with no impact on the minimization procedure, thus, can be ignored. Therefore, the nal form of the cost function for mapping a tree rooted at node t is expressed by Eq. 4.19. nCost (t) = H X r=1 ( b1 X z=1 (bz)y(z;r;t) ) (4.19) On the other hand, Eq. 4.12 which relates the total leaf node count of a tree to its height and number of nodes at dierent reverse levels, can be rewritten as follows: H X r=2 (( b1 X z=1 (bz)y(z;r;t) ) b Hr ) =b H n (4.20) The expression inside braces in Eq. 4.19 is the portion of the total cost function which corresponds to the reverse level r, hence, can be denoted by CostFunc r . Eq. 4.20 acts as a constraint for minimizing value of the cost function in Eq. 4.19. The next two lemmas introduce trees which give lower/upper bounds for the cost function. Lemma 4.2.2 Among all b-ary trees with height H and total leaf node count of n, the tree which maximizes CostFunc r at each reverse level 1 r H (starting with smaller reverse levels) subject to satisfying Eq. 4.20 for the whole tree, gives the lower bound for the value of the cost function in Eq. 4.19. 70 F Buffer node Regular node x = H = 4 x = 2 x = 1 x = 3 CostFunc j = 2×y(1,j) + y(2,j) y(1,j) y(2,j) 0 2 2 1 1 3 1 0 2 2 1 5 2 0 4 1 2 4 0 3 3 } j=2 } j=3 CostFunc j Figure 4.13: An example of the most cost ecient tree withn=7,H=4, andb=3. The right-most column in the table shows portions of the total cost function for a specic reverse level (r). Outline of the Proof: The intuition behind this lemma is that since the right hand side of Eq. 4.20 is xed, in order to minimize the cost function (which is equal to sum of all CostFunc r at dierent reverse levels),CostFunc r that is multiplied by larger power of b in the left hand side of the Eq. 4.20 (which corresponds to smaller reverse levels) should have larger value. Therefore, if we start with smaller reverse levels and maximizes their CostFunc r , the total cost function will be minimized. Please note that if maximizing CostFunc r for reverse level r gives rise to a nal tree topology that violates Eq. 4.20, it is not accepted and a smaller value should be used for that. Example 7 Suppose that we want to generate the most cost ecient tree with n=7, H=4 rooted at t, and having b=3. Having these values, CostFunc r will be 2y(1;r;t) + 1y(2;r;t). For the root, a 2-input gate should be used. For the next level, there are some options for choosing gates; three of them are mentioned in the rst set of numbers in the table shown in Figure 4.13. The second option which gives the highest valid value for CostFunc 2 is selected. This is implemented by using one 2-input gate and one buer node at reverse level r=2. Note that at this reverse level, there are more options that are not valid. For example, someone may use two 3-input gates atr=2. This will generate six leaf nodes at the reverse level 2. We have to generate two more reverse levels 71 (to reach H=4) by putting at least one 2-input gate per level. This will generate at least two more leaf nodes for the nal tree according to lemma 4.2.1. Since six leaf nodes are generated up to now, only one more leaf node can be generated to reach n=7. Thus, using two 3-input gates at reverse level 2 of this tree, the maximum achievable height for n=7 will be H=3, therefore, using two 3- input gates at reverse level 2 does not lead to a valid solution. For reverse levels 3 and 4, similar gate assignment procedure can be used to generate a valid tree. Due to space limitations, in the table of Figure 4.13, options for reverse level 4 are not shown. Using Eq. 4.19, the value of the cost function for the tree shown in Figure 4.13 is (11)+(21+11)+(22+11)+(22+11) = 14. Lemma 4.2.3 Among all b-ary trees with height H and total leaf node count of n, the tree which minimizesCostFunc r at each reverse level 1rH (starting with smaller reverse levels) subject to satisfying Eq. 4.20 for the whole tree, gives the upper bound for the value of the cost function in Eq. 4.19. Outline of the Proof: With similar explanations given for the lemma 4.2.2, the intuition in this lemma is to give smaller values forCostFunc r at smaller reverse levels because they are multiplied by larger powers of b in Eq. 4.20, resulting in maximization of the sum of all of them, which is equivalent to maximizing the cost function. Example 8 The least balanced tree with n=7, H=4 rooted at t, and b=3 (the same values as in Example 3) is shown in Figure 4.14. Note that for reverse levelr=2, there is another option which makes CostFunc 2 equal to two, but it is not selected as the nal choice. This option corresponds to having one 3-input and two 2-input gates at this reverse level. This will generate seven leaf nodes up to now. We cannot generate more leaf nodes and still two more reverse levels have to be generated, which is not possible. Thus, this option is not valid. The value of the cost function for this example is (0) + (2 2) + (2 4 + 1 1) + (2 5 + 1 1) = 24, which is larger than 14 in Example 3. 72 F x = H = 4 x = 2 x = 1 x = 3 y(1,j) y(2,j) 2 0 2 2 1 5 1 2 4 4 1 9 0 2 2 5 1 11 } j=2 } j=3 } j=4 CostFunc j = 2×y(1,j) + y(2,j) CostFunc j Figure 4.14: An example of the least cost ecient tree with n=7, H=4, and b=3. In the above two lemmas, it is assumed that the buer nodes are only used for path balancing. Note that in Eqs. 4.19, 4.20, the upper bound for iteratorz can be increased tob without changing values of the equations (bb = 0). This allows selection of gates with the maximum in-degree b in the above lemmas. Lemma 4.2.4 If T 1 (with height X) and T 2 (with height X +p) are two valid solutions for mapping a tree rooted at node t of a given subject tree, and Cost 1 and Cost 2 are their cost functions based on Eq. 4.19, Cost 2 Cost 1 cannot be less than p, where p is a positive integer. Outline of the Proof: It is enough to prove the lemma for the following T 1 and T 2 trees with the mentioned cost functions below (the proof is removed): T 1 : The least cost ecient tree with height X. Cost 1 =bn b + 0 =b b X 1 b 1 (4.21) 73 T 2 : The most cost ecient tree with height X +p. Cost 2 = 2n 2 +bn b +BFRcnt = 2 (X +p 1) +b (X +p)+ (X +p 2) (X +p 1)=2 (4.22) where BFRcnt is the total path balancing buer count. Theorem 4.2.5 The presented minimum-cost fully-path-balanced tree mapping algorithm satises the principle of optimality of DP, therefore, it gives the optimal solution in polynomial time. Proof: Let S t refers to the optimal solution for mapping a tree rooted at node t. For simplicity and without loss of generality, suppose thatS t is built of two sub-partsi andi 0 . Let's call the part ofS t related to mapping the sub-tree rooted at i (i 0 ),S i (S i 0). The above theorem claims thatS i (S i 0) is the optimal solution for mapping the sub-tree rooted at i (i 0 ). By contradiction, suppose that there is another solution for one of these sub-problems (S 0 i for i) which is more expensive than the original optimal solution (S i ) for mapping the sub-tree rooted at this node, but it gives rise to a better solution for mapping the tree rooted at node t. This can only happen if using S 0 i results in returning a value by SLD() in Eq. 4.9, which is smaller than the dierence between cost of S i and S 0 i . Based on lemma 4.2.4, this cannot happen. Therefore, the theorem is proven . As mentioned in Section 4.2.2.2, the above proof is based on the assumption of having linear relation between cost of a gate and its number of inputs (which is a reasonable assumption). If this relation is not linear (e.g. exponential), then the optimality of the proposed algorithm cannot be guaranteed. 74 4.2.4 DAG Mapping Most of the terminology and methods presented for trees in this section can be easily extended for DAGs. For example, the reverse level for nodei can be dened as length of the longest path from node i to any root node of the graph. For DAG mapping, we developed the following heuristic: After computing k-feasible cuts and their functions for each node, similar to what is presented in Section 4.2.2.2, in a topological ordering traversal, the value of the cost function is computed for each node using Eq. 4.9. The main dierence is that while choosing the best cut for a node, if the following inequality holds for one of the inputs (boundary nodes) of a cut (e.g. q), Cost (q) in Eq. 4.9 will be set to 0. (rep(q) 1 ) L q D 1 (4.23) where D is the depth of the AIG representation of the given Boolean network, which is dened as the largest logic level among its nodes, rep(q) is the number of times node q has been used up to now, and L q is the logic level of node q. The intuition behind this heuristic is to prevent replication of big cones of logic, which are already implemented and used for implementing other nodes. Note that multiplication by the normalized logic level (i.e., Lq D ) should not be removed from the said inequality, otherwise, the generated mapping solutions will be very unbalanced. For mapping DAGs, line 3 of the Algorithm 3 should be modied to capture the revision given by Eq. 4.23. 4.2.5 Experimental Results The presented path balancing technology mapping algorithm is implemented inside ABC [38]. We implemented two versions of path balancing mappers, BalancedMap (BM), and BalancedMapDelay (BMD). In BM, the most cost ecient solutions as in Section 4.2.2 are computed. BMD chooses the most cost ecient matches for each node of the network subject to not degrading the best achieved delay in a prior delay optimization pass. For comparison, we have included results from 75 [123], which is referred to as PBMap from herein, and also extracted experimental results using the synthesis approach presented in [75] for SFQ circuits; in this synthesis approach, circuits are mapped using default cut-based technology mapper of ABC, followed by inserting path balancing DFFs, applying standard retiming to reduce the total DFF count, and nally, inserting splitters. In the rest of this section, we refer to this baseline synthesis approach by Base. An SFQ library of gates as in [74] consisting of and2, or2, xor2, not, splitter, JTL, DFF, and MUX21 logic gates is used. Several benchmark circuits from ISCAS [59], EPFL [37] benchmark suites, and some arithmetic circuits are used. The complexity of these circuits ranges from s38584 with 12/278 I/Os, 19407 nodes, 32910 edges, and 25185 cubes, sin with 24/25 I/Os, 5416 nodes, 10832 edges, and 5416 cubes to dec with 8/256 I/Os, 304 nodes, 608 edges, and 304 cubes. Table 4.2 shows results for area and delay. For Area, two sets of numbers are reported; the numbers inside parenthesis are the total area of gates, path balancing DFFs, and splitters, while the area numbers outside parenthesis does not include area of splitters. The reason behind reporting two sets of results for area is to show that our algorithms not only reduce the total area of gates and DFFs as our cost function demands, but it also provides area reduction in case of considering area of splitters too. Delay (inps) is the critical path delay, which is the sum of delays of gates in the critical path of the mapped circuit. Figures 4.15, 4.16, and 4.17 compare the value of the cost function (Eq. 4.16), normalized total static power consumption, and total Josephson junction count for BM, PBMap, and Base. In the following, some statistics are reported. Similar to the method in Table 4.2, for area two sets of numbers are reported; the one inside parenthesis is for area of gates, DFFs, and splitters. The results for total number of path balancing DFFs is not included in this section, but their statistics are mentioned in the following. For the priority circuit, the cost function, area, total JJ count (#JJs), total number of required path balancing DFFs (#DFFs), and static power consumption are reduced by 1:07, 1:12 (1:11), 1:09, 1:18, and 1:10 compared to the Base, but its delay is degraded by 6.0%. On average for all benchmark circuits, BM decreases the value of 76 0 5000 10000 15000 BM BMD ABC Figure 4.15: Value of the cost function (Eq. 4.19) for dierent circuits generated by three dierent mappers. For better exhibition purposes data for sin, priority, s35932, and s38584 circuits are scaled down by a factor of ve. the cost function, area, #JJs, #DFFs, and static power consumption by 86.1%, 29.5% (26.3%), 25.6%, 27.8%, and 25.1% over Base, while the average delay is increased by 5.6%. As seen, the average area reduction in case of considering the area of splitters is reduced from 29.5% to 26.3%. This is because on average the path balancing algorithm generates circuits with more fanout counts, therefore, it requires more number of splitters. BM also shows superiority over PBMap; it reduces area, #JJs, and static power consumption by an average of 6.0%, 6.2% (5.9%), and 5.8% and up to (for ISCAS c1908 benchmark circuit) 24.6%, 24.9% (24.7%), and 24.3% compared with PBMap. BMD is able to oer savings on all of these metrics over the baseline without degrading the critical path delay. On average for all benchmark circuits, BMD reduced the value of the cost function, area, #JJs, #DFFs, and static power consumption by 5.7%, 3.4% (2.3%), 3.8%, 6.0%, and 3.3% compared to the Base. To verify the correct functionality of circuits generated by our algorithm, we simulated a few circuits generated by our algorithm including a 2-bit Kogge-Stone Adder (KSA2) using JSIM [71]. Figure 4.18 shows the corresponding waveforms for this adder. Four sets of random inputs (a 0 =1010, a 1 =1100, b 0 =0101, b 1 =1001, c in =0011) and their correct outputs generated 77 0 5000 10000 15000 20000 25000 BM BMD ABC Figure 4.16: Normalized total static power consumption (the main source of power consumption in SFQ circuits [111]). For better exhibition purposes data for sin, priority, s35932, and s38584 circuits are scaled down by a factor of ve. by this adder (S 0 =1100, S 1 =0110, C out =1001) are shown in this gure. Please note that in these waveforms, presence of a pulse means \1" and absence of a pulse means `0". Finally, post place-and-route results show that circuits generated by our algorithm will provide a considerable improvement on total chip area as well. For example, chip area for the dec circuit from the EPFL suite mapped by BM is reduced by around 33% compared to the Base. Figure 4.19 shows the post place-and-route result of the dec circuit. 4.2.6 Conclusions In this chapter, a dynamic programming-based path balancing technology mapping algorithm is presented. This algorithm is designed to minimize area of gates and required path balancing D- Flip-Flops (DFFs) in Single Flux Quantum (SFQ) logic circuits, and it can be used for reducing full path balancing overhead in any technology which requires having the same length for all logic paths. The optimality of the algorithm is proven for circuits with tree structure and it is shown that a modied version of the algorithm acts as a very eective heuristic in generating cost ecient solutions for general Directed Acyclic Graphs. Experimental results in SFQ technology showed that on average for 15 benchmark circuits, our technology mapper reduced area, Josephson 78 0 10000 20000 30000 40000 50000 60000 BM PBMap Base Figure 4.17: Total number of Josephson junctions for gates, path balancing DFFs, and splitters. For better exhibition purposes data for sin, priority, s35932, and s38584 circuits are scaled down by a factor of ve. junction count, DFF count, and static power consumption by 26.3%, 25.6%, 27.8%, and 25.1% compared to the state-of-the-art academic technology mappers. 4.3 Minimizing Product of Stage Delay and Circuit Depth Single ux quantum (SFQ) logic is a promising candidate to replace the CMOS logic for high speed and low power applications due to its superiority in providing high performance and energy ecient circuits. However, developing eective Electronic Design Automation (EDA) tools, which cater to special characteristics and requirements of SFQ circuits such as depth minimization and path balancing, are essential to automate the whole process of designing large SFQ circuits. In this section, a novel technology mapping tool, called SFQmap, is presented, which provides optimization methods for minimizing rst the circuit depth and path balancing overhead and then the worst-case stage delay of mapped SFQ circuits. Compared with the state-of-the-art technology mappers, SFQmap reduces the depth and path balancing overhead by an average of 14% and 31%, respectively. 79 clk a 0 a 1 b 0 b 1 c in s 0 s 1 c out Figure 4.18: Simulation waveforms for a 2-bit Kogge-Stone Adder (KSA2) generated by MB. For four sets of random inputs, correct outputs for S0, S1, and Cout are generated which are shown. 4.3.1 Introduction Logic synthesis is an important step in the design ow of digital circuits and systems with a big impact on the nal delay and area of the chip. Logic synthesis has two main phases: technology- independent optimizations, and technology mapping. The rst phase includes several algebraic or Boolean optimizations such as restructuring, resubstitution, common subexpression extraction, node minimization. The second phase performs mapping or binding of logic expressions to actual gates in a given cell library to minimize the circuit delay, area or other desired metrics. Rapid SFQ (RSFQ) is a family of Single Flux Quantum circuits, which was developed in late 1980s [95]. More recent versions of SFQ logic family include ERSFQ [82,110] and eSFQ [111] as well as AQFP [146]. Generally, SFQ logic has been touted as a good candidate for achieving energy ecient and high performance circuits [65]. SFQ devices are made of Josephson Junctions, which are superconducting devices that exhibit the Josephson eect, i.e., the phenomenon of a current (called super-current) that ows indenitely long without any applied voltage. The propagation delay through these devices is as low as 1ps, while each switching action consumes in the order of 10 19 J energy [19]. The JJ switching energy is 2-3 orders of magnitude lower than that of the 80 Figure 4.19: Post place-and-route of dec circuit which is mapped by BM. Dimensions are 3960m 3310m. The dimensions are increased to 4870m 3990m when the circuit is mapped using the map command of ABC. end-of-CMOS-scaling devices as stated in [147], which again shows the promise of SFQ circuits as a replacement for CMOS circuits. SFQ gates can operate as fast as 370GHz at T = 4:2K [19]. Although advantages of SFQ logic in achieving super fast and very low-power circuits have been proven, state of the art in Electronic Design Automation (EDA) technologies is not advanced. On the other hand, due to key dierences between SFQ and CMOS logic, CMOS Computer Aided Design (CAD) tools cannot be directly used to optimize SFQ circuits. Therefore, it is critical to develop appropriate CAD tools including synthesis, place and route, timing and power analysis, and verication tools to fully automate the design and test process of SFQ circuits. In this section, we present SFQmap, a technology mapping tool for SFQ circuits. SFQmap includes two main phases: (i) logical depth minimization with path balancing, and (ii) peephole optimization to reduce the worst-case delay of any stage of the logic circuit. SFQmap improves key parameters of SFQ circuits considerably compared with the state-of-the-art technology mappers. For example, it decreases the number of path balancing D-Flip-Flops (DFFs), and the Product of 81 the worst-case Stage delay with the logical Depth of the circuit (which we shall denote as PSD) by up to 43%, and 63%, respectively (see Section 4.3.3). Due to the gate-level pipelining nature of the SFQ logic, the input-to-output latency of an SFQ circuit is determined as the product of the clock cycle time and the logical depth of the circuit. The clock cycle time is in turn set by the worst-case delay of any single stage (gate plus any present splitters plus interconnect) of logic in the circuit. 4.3.2 Technology Mapping 4.3.2.1 Prior Work In [24, 80], a tree mapping and decomposition method is presented to generate minimum area or low power circuits. In [25], a cut-enumeration-based method which involves cut generation and cut selection was presented targeting the depth-optimal area optimization for the FPGAs. For the SFQ circuits, there are couple of papers addressing the logic synthesis [75, 156, 160]. In [156], a framework is proposed for synthesizing the SFQ circuits by constructing a virtual cell called \2-AND/XOR". In [160], a top-down design methodology for the SFQ circuits based on the Binary Decision Diagram (BDD) is presented. In [75], an academic logic synthesis tool (ABC) [38] is modied to meet the requirements of SFQ logic without developing any SFQ specic optimization algorithms for the technology mapping. 4.3.2.2 Motivation To map the Boolean expression,F =ab(!c)d, the ABC mapper [38] produces the circuit shown in Figure 4.20(a). As explained in Section 2.1.3, the SFQ circuits need to be path balanced. Thus, for the mapped circuit generated by ABC, three path balancing DFFs have to be inserted into the network. Another circuit with fewer number of required path balancing DFFs can be found to implement the given Boolean expression, as shown in Figure 4.20(b). On the other hand, to reduce the computation latency, the PSD should be minimized. In practice, between the logical depth 82 (a) (b) (c) DFF Figure 4.20: Three mapping solutions for the expression F = ab (!c)d. (a) the circuit generated by ABC mapper [38] requiring three path balancing DFFs and has depth of three; two other mapping solutions requiring only one DFF with depth of (b) three, and (c) two. and the worst-case stage delay, the former has more impact on the overall computation latency; therefore, we target the minimization of the logical depth (along with minimization of the path- balancing DFF overhead) rst; Only after this optimization is done we turn to direct minimization of the product of the stage delay and logical depth. Based on this explanation, the circuit shown in Figure 4.20(c) is more desirable than the other two . This is because it requires only one DFF while its logical depth is less. We need to develop a technology mapper for SFQ which prefers the circuit shown in Figure 4.20(c) over the other two mapping solutions. Our proposed technology mapping for the SFQ circuits consists of two main phases: depth minimization and path balancing, and peephole optimization for reducing the PSD. The aforesaid phases are discussed in more details in the following. 4.3.2.3 Depth Minimization with Path Balancing In this section, we present the problem of tree-mapping as a Dynamic Programming (DP) problem. The optimal solution is the one with the least depth and in the case of a tie, it is the one which requires less number of path balancing DFFs. A similar cut-enumeration methods as in [29,107] is used. As mentioned in [29], this method provides the optimal depth solution for general DAGs. The main dierence between the method which is presented in [29] and our depth minimization 83 and path balancing method is that we optimize both depth and balancing, while in [29] only depth is considered. Figure 4.2 shows a binary tree and the 3-feasible cuts for nodei. (refer to [29] for the denition of k-feasible cuts.) Eq. 4.24 denes the value of optimal depth solution, D[i], as the minimum achievable logical depth for mapping a tree rooted at node i. D[i] is calculated recursively as follows: D[i] =minf max( D[i + 1];D[i + 2] ) + 1, max( D[i + 1];D[i + 5];D[i + 6] ) + 1, max( D[i + 2];D[i + 3];D[i + 4] ) + 1, g (4.24) The three terms in the Eq. 4.24 corresponds to the depth of cutsC1C3 (Figure 4.2), respec- tively. Basically, in Eq. 4.24 a mapping solution with the least depth (D min ) among all choices corresponding to the k-feasible cuts is being computed. Among all the solutions with depth D min , the one which requires fewer path balancing DFFs is selected recursively in a Dynamic Programming approach as shown in Eq. 4.25. In this equation, DFF [i] gives the DFF count for 84 a solution with depth of D min and the least number of path balancing DFFs (#DFFs). DFF [i] =minf DFF [i + 1] +DFF [i + 2] +B(i + 1;i + 2), DFF [i + 1] +DFF [i + 5] +DFF [i + 6] +B(i + 1;i + 5;i + 6), DFF [i + 2] +DFF [i + 3] +DFF [i + 4] +B(i + 2;i + 3;i + 4), g (4.25) In the above equation,B() accounts for the required number of DFFs for balancing the inputs of the corresponding cut because of the level dierence among cut's inputs. For example, if the level of node i+1 is 2 and i+2 is 3, B(i+1;i+2) returns 1. The complexity of computing the k-feasible cuts is O(kmn), where m is the edge count, and n is the node count [29]. The complexity of our depth minimization and path balancing algorithm by having the k-feasible cuts is O(K 0 gn), where K 0 is the maximum number of k-feasible cuts for a node in the network, g is the number of gates in the library, and n is the total number of nodes in the network. Usually K 0 and g are small numbers. Thus, the complexity for this algorithm is determined by the number of nodes in the network with linear relationship (=O(n)). 4.3.2.4 Peephole Optimization for Reducing the Sequential Depth As explained before, unlike CMOS, in the SFQ circuits to reduce the computation latency, PSD has to be reduced. To reduce the PSD, we perform the following heuristic: After nding the minimum depth and most balanced solution for each node of the network as in Section 4.3.2.3, a gate with the worst stage delay is found. This gate is usually a gate with high fanout count. Next, a temporary network consisting this gate and its immediate fanins and 85 Algorithm 4: SFQmap Input: Network pNtk Output: mapped network pNtkMap // perform pre-mapping computations: 1 Compute k-feasible cuts for pNtk; 2 Initialize the mapping manager, pMan to map the input network, pNtk; // two phases of SFQmap: // (i) Minimizing depth: 3 for each node n in pNtk do 4 Find min-depth and balanced mapping solution based on Eqs. 4.24, 4.25. // (ii) Peephole optimization for PSD: 5 Set(InitFanoutCount,MaxFanoutCount); 6 pNtkNew = Copy (pNtk) // initial fanout count 7 FanoutCout = InitFanoutCount; 8 while p> 0 do 9 if FanoutCount MaxFanoutCount then 10 Find the node with worst stage delay, pNodeWorst; 11 Generate a network comprising pNodeWorst and its FanIOs, pNtkTemp; 12 Remap pNtkTemp subject to the FanoutCout limit; 13 else 14 Increase FanoutCount; 15 Decrease p by 1; // generating the mapped network 16 pNtkMap = NetworkFromMap (pMan, pNtkNew); 17 return pNtkMap; fanouts is generated. The temporary network is re-mapped while the fanout count for any node is limited. If the product of the worst stage delay and the length of the longest path decreased, the change will be accepted. Otherwise, we increase the fanout counts' limit and re-do the process. This process is repeatedp times. Experimental results show that for havingp = 5, there will be a considerable decrease in the PSD of the circuit while the run-time is acceptable (Tables 4.3, 4.4). The complexity of the peephole optimization isO(m +n), where,m is the edge count andn is the node count of the network. This is because for calculating the fanout dependent stage delay for all nodes (to nd the worst stage delay), a breath-rst search should be done. The peephole optimization determines the overall complexity of the SFQmap tool to be O(m +n). 86 Algorithm 4 describes the pseudo code of the SFQmap. In this algorithm, the main procedure as well as functions for combined depth minimization and path balancing, and then peephole optimization for reducing the PSD are shown. 4.3.3 Experimental Results We used several ISCAS benchmark circuits [59] for testing our developed technology mapper. The generic mcnc.genlib library is used. Table 4.3 lists the key parameters for dierent benchmark circuits for SFQmap and ABC technology mapping tools. As explained before, our developed technology mapper focuses on improving three important parameters in SFQ circuits including logical depth, #DFFs, and PSD. As shown in Tables 4.3, 4.4, SFQmap provides considerable improvements on all of these critical parameters on average of all ve benchmark circuits. However, its average run-time is increased by 34% over the ABC mapper. This is mainly because of the peephole optimization phase which dominates the run-time of the SFQmap. We implemented the peephole optimization phase with the capability of determining the number of iterations. The experimental results in the Table 4.3 is for ve iterations of peephole optimization (SFQmap -i 5 ). If we cross out the peephole optimization phase to trade the PSD with the run-time, the run-time overhead over the ABC mapper will be decreased to less than 12%. In Table 4.4, \SFQmap -i 0 " is for not having any peephole optimization runs. i stands for the number of iterations. 4.3.4 Conclusions In this section, a novel technology mapping tool, SFQmap, is presented which is developed for the SFQ circuits. This mapper performs two main optimizations to improve the most important parameters for the SFQ circuits including the logical depth, the number of path balancing DFFs, and the PSD. The implementation of this mapper allows to select the number of iterations for the optimization runs. \SFQmap -i 5" improves the logical depth, #DFFs, and the PSD by 14%, 31%, and 35%, respectively over the ABC [38] mapper for ve ISCAS benchmark circuits with 87 34% increase in run-time. \SFQmap -i 0" reduces the overhead of run-time to less than 12% and provides even more improvements on the logical depth, and #DFFs over ABC (see Table 4.4). 88 Table 4.1: Experimental results for PBMap and baseline mapper (ABC's mapper). #DFFs is reported for before and after applying the retiming algorithm. Area is in mm 2 and run-time is in second. Area and JJ count (#JJs) are for after retiming. Logical depth is the maximum logic level in the network. #DFFs (before) #DFFs (after) Area #JJ Logical Depth Run-time circuits PBMap Baseline PBMap Baseline PBMap Baseline PBMap Baseline PBMap Baseline PBMap Baseline c1908 1033 1216 696 844 8.7 9.3 12013 12785 20 24 0.14 0.21 c5315 5289 6146 2908 3575 37.2 42.1 52033 58661 23 28 1.4 2.1 c7552 3681 4354 2429 2867 34.3 37.4 48482 52641 19 22 1.04 1.9 c3540 2683 3187 1159 1372 20.3 21.8 28300 30165 31 37 0.56 0.73 c499 674 632 476 444 5.6 5.6 7758 7734 13 13 0.064 0.066 c880 1406 1663 774 957 9.3 10.4 12909 14415 22 26 0.16 0.22 s1196 1226 1328 746 817 11 11.8 15332 16443 18 20 0.29 0.25 s38417 15929 21289 8405 12306 143 168.7 208289 243091 21 30 10.72 17.13 s1238 1558 1665 864 984 12.6 13.8 17617 19171 19 23 0.26 0.32 int2 oat 507 528 270 274 4.5 4.8 6432 6725 16 16 0.082 0.04 cavlc 1514 1544 522 565 11.6 12.2 16339 17115 17 17 0.19 0.02 priority 9313 35040 9064 19925 71.9 152.3 102085 212467 127 249 41.9 363.2 decoder 51 51 8 8 4 6.2 5469 8340 4 5 0.012 0.012 sin 75861 89481 13666 16858 153.8 176.9 215318 245736 182 229 409.8 589.5 i10 11212 15007 7776 10182 81.5 99.7 114306 139263 33 43 5.6 10.7 frg2 2796 2974 1375 1470 21.7 23.9 30340 33`237 12 13 0.53 0.62 pj1 66490 83007 36897 43631 411.1 468.2 585751 663755 34 44 115.8 186.6 i9 1275 1612 647 876 12.8 14.9 17842 20734 12 15 0.26 0.3 9sym 327 353 143 149 3.4 3.6 4859 5041 14 14 0.05 0.05 KSA4 30 30 25 25 0.5 0.5 692 692 6 6 0.02 0.0175 KSA16 233 235 199 200 3.4 3.5 4797 4842 10 10 0.08 0.07 ID8 4505 5494 1854 2140 16.1 19.4 22752 27020 77 85 2.32 3.32 avg. imp. 20.64%# 15.06%# 12.22%# 11.22%# 14.56%# 49.78%# 89 Table 4.2: Area and delay results for BalancedMap (BM), baseline (Base) [75], and PBMap [123]. For area two sets of numbers are reported; numbers inside parenthesis include area of gates, path balancing DFFs, and splitters, while the area numbers outside parenthesis do not include area of splitters. Since delays of circuits generated by PBMap is very close to the same generated by BM and due to the lack of space, they are removed from the table. Area (mm 2 ) Delay (ps) Circuits Base PBMap BM Base BM c5315 26.7 (42.1) 23.4 (37.3) 20.8 (33.3) 173.6 182.2 c3540 13.6 (21.8) 12.5 (20.4) 12.5 (20.5) 204.8 226.6 c7552 23.6 (37.4) 21.4 (34.4) 19.6 (31.5) 125.4 142.6 c2670 16.2 (25.1) 14.7 (23.1) 13.9 (21.7) 123.8 136.8 c1908 5.9 (9.3) 5.5 (8.8) 4.4 (7.0) 125.8 126.8 c6288 25.0 (39.9) 26.2 (41.6) 24.6 (34.4) 526.8 533.6 s35932 57.4 (97.4) 44.1 (76.2) 41.7 (72.4) 225.6 234.3 s38584 119.2 (194.4) 105.2 (173.9) 100.8 (167.1) 273.6 305.6 s5378 17.6 (28.4) 15.6 (25.6) 13.7 (22.2) 162.2 165.3 sin 113.0 (177.0) 97.9 (154.9) 94.2 (150.4) 1133.2 1274.2 dec 3.4 (6.2) 1.9 (4.0) 1.9 (4.0) 29.4 30.0 priority 100.9 (152.3) 47.5 (72.0) 47.7 (72.0) 1890.4 2004.8 i2c 21.1 (33.7) 19.7 (31.6) 19.5 (31.5) 128.4 129.4 KSA32 5.4 (8.9) 5.4 (8.8) 5.4 (8.8) 88.0 88.0 IntDiv8 12.6 (19.4) 10.5 (16.2) 9.9 (15.3 ) 619.4 640.4 90 Table 4.3: Experimental results for SFQmap, and ABC mapper using mcnc.genlib library. The run-time, which measures the amount of time it takes to generate the mapping solution, is measured in second s. #DFFs #Gates Logical Depth PSD Run-time Circuits SFQmap ABC SFQmap ABC SFQmap ABC SFQmap ABC SFQmap ABC s4863 3381 4274 1183 1010 30 36 297 378 0.198 0.15 c5315 2519 4437 1399 1206 20 25 179 234 0.2 0.12 c7552 2603 3639 1786 1457 19 21 267.9 718.2 0.25 0.22 s6669 7621 9638 1517 1462 46 52 510.6 842.4 0.258 0.145 s38417 13953 23253 8363 7471 14 16 180.6 252.8 0.667 0.32 91 Table 4.4: Improvement percentages of \SFQmap -i 5" and \SFQmap -i 0" over ABC mapper for key parameters. These results are the average of all ve tested benchmark circuits. # shows a decrease, and " shows an increase in the corresponding quantity. Mapper Logical Depth #DFFs PSD Run-time SFQmap -i 5 # 14% # 31% # 35% " 34% SFQmap -i 0 # 15% # 37% # 13% " 12% 92 Chapter 5 Dual Clocking Method for Realizing SFQ Circuits In this chapter, we present a new architecture for realizing SFQ circuits by employing fast and slow clock signals, hence, it is called the Dual Clocking Method (DCM). In this new architecture, there is no need to insert any path balancing DFFs to ensure the correct operation of an SFQ circuit. Considering the fact that DFF count can be very high (e.g. 4:5 as the gate count in an 8-bit integer divider), this new architecture results in large savings in terms of the total JJ count and chip area. Consequently, the local clock frequency can be increased due to a shortening of the transmission lines needed to connect SFQ logic gates. However, the new architecture will degrade the peak throughput of the circuit. The degree of throughput degradation can be systematically reduced by doing partial path balancing of the circuit, resulting in a trade o between path balancing DFF overhead and the peak throughput. Notice that due to instruction data dependencies, program branches, etc., actual (sustainable) throughput is typically a lot less than the peak throughput (of course, the amount of deviation between actual and peak throughput is application dependent). Therefore, some throughput loss is acceptable. In the second part of this chapter, we present a new graph partitioning problem and provide an optimal solution for it by using a dynamic programming-based algorithm. This algorithm nds optimal places for inserting NDROs into the circuit that is needed in the DCM. 93 5.1 An Ecient Pipelined Architecture for SFQ Logic Circuits Utilizing Dual Clocks 5.1.1 Introduction In this section, we present a new architecture for realizing SFQ circuits by employing a macro and a micro (slow and fast) clock pulses. In this new architecture, there is no need to insert any path balancing DFFs to ensure the correct operation of an SFQ circuit. Considering the fact that DFF count can be very high (4:5 as the gate count in an 8-bit integer divider), this new architecture results in large savings in terms of the total JJ count and chip area. Consequently, the local clock frequency can be increased due to a shortening of the transmission lines needed to connect SFQ logic gates. However, the new architecture will degrade the peak throughput of the circuit. The degree of throughput degradation can be systematically reduced by doing partial path balancing of the circuit, resulting in a trade o between path balancing DFF overhead and the peak throughput. Notice that due to instruction data dependencies, program branches, etc., actual (sustainable) throughput is typically a lot less than the peak throughput (of course, the amount of deviation between actual and peak throughput is application dependent). Therefore, some throughput loss is acceptable. The remainder of this section is organized as follows. In Section 5.1.2, need for path balancing in standard SFQ circuits is explained, our new architecture for fully (partially) removing path balancing DFFs is presented, and a Pareto optimal trade-o curve of the peak throughput and path balancing DFF overhead is derived. This section also discusses timing requirements of the new architecture and presents design of adders using the new architecture. Section 5.1.3 presents the simulation set-up for verifying the correct operation, and post-synthesis experimental results for circuits generated by the new architecture. It also gives qualitative and quantitative results 94 of designing an example circuit using clock-follow-data and concurrent clocking methods and compares them with our new architecture. Finally, Section 5.1.4 concludes the manuscript. 5.1.2 Throughput Versus Path Balancing DFFs 5.1.2.1 Path Balancing Requirement in SFQ Circuits Each RSFQ logic gate in a RSFQ circuit has two or more stable ux states. The logic gate is fed by SFQ pulses which can arrive on input lines and a clock line. Each clock pulse marks a boundary between two adjacent clock periods by setting the cell into some known initial state. During a new clock period, SFQ pulses may (or may not) arrive at each of the cell inputs. The arrival of a SFQ pulse at any input line during the current clock period denes a logic value \1" for the corresponding input signal, whereas the absence of a pulse during this period denes the logic value \0" of this signal. (Input pulses can arrive in any sequence.) Each pulse may change the internal state of the cell, but it cannot produce any immediate response at the gate output. Only the clock pulse is able to generate the output pulse based on the internal state of the gate (which itself is determined by the input signal pulses which have arrived during this period). The same clock pulse denes the end of the clock period and resets the logic gate into its initial state. Thus, an elementary logic gate of the RSFQ family is equivalent to a conventional combinational logic gate coupled with a DFF storing its output value until the end of the clock period. In other words, any input pulses to a logic gate may be treated as tokens that must arrive in the same clock period and are consumed by the clock pulse that arrives at the end of the period. According to the standard SFQ logic circuit design methodology, it is required to insert path balancing DFFs to ensure that there are the same number of clocked circuit elements in any path from a PI to a PO of the circuit. This is called Full Path Balancing (FPB) method. A fully path-balanced circuit generates valid logic values at its internal nodes and POs by guaranteeing 95 1 0 1 0 0 0 0 PI 1 : PI 2 : PI 3 : PI 4 : PI 5 : g 1 g 2 g 3 g 4 (a) clk DFF 1 0 1 1 0 1 1 1 0 1 1 0 0 1 1 (b) Figure 5.1: (a) An unbalanced circuit with incorrect operation, (b) path-balanced version of this circuit with correct operation. the delivery of pulses at all inputs of a logic gate in the same clock cycle so that the clock pulse that arrives at the end of this cycle can read the correct output value. If a circuit is not fully path-balanced, there will exist at least one gate in the circuit with one early input pulse (i.e, a pulse that arrives during a previous clock period). As stated above, this early input pulse will be consumed by the clock pulse at the end of the corresponding clock period and before the arrival (or not) of pulses on other inputs of the said logic gate, thereby (potentially) generating a wrong value at the gate output in the current period. For example, in Figure 5.1a, OR of the rst (PI 1 ) and the second (PI 2 ) inputs is supposed to be passed to the following AND gate (g 2 ) together with the third (PI 3 ) input. OR of the rst and the second inputs in this example is \1", which will appear as a pulse at the output of this gate (g 1 ). However, before this pulse arrives at g 2 , the value on PI 3 (which initially was a \1") is consumed in the rst clock pulse by g 2 . Therefore, the generated pulse by g 1 is evaluated by the second round of inputs on PI 3 (a \0" in this example), and causes generation of a wrong value of \0" at the output of g 2 . The same problem shows up in regards to the fourth input pulse (value of \1" on PI 4 ). These wrong values will propagate through the circuit and nally create a wrong \0" at the nal output of the circuit (output of the XOR gate). By inserting path balancing DFFs into this circuit (cf. Figure 5.1b), there will be no early signals at input lines of any logic gate. Therefore, the circuit 96 0 1000 2000 3000 4000 KSA32 Mult8 C3540 IntDiv8 C5315 #Gates #DFF Figure 5.2: Comparing the required number of path balancing DFFs to the original gate count for a few benchmark circuits when the FPB method is employed. functions correctly and at the fourth clock cycle, it generates a pulse representing a \1" at the output of the XOR gate. 5.1.2.2 No Path Balancing (NPB) The FPB method requires insertion of many path balancing DFFs, which can exceed the total number of gates in the original circuit. Figure 5.2 compares the path balancing DFF count with the original gate count for a few ISCAS [59] and arithmetic circuits synthesized and mapped using FPB+retiming method (see Section 5.2.4). KSA32 is a 32-bit Kogge-Stone Adder, Mult8 is an 8-bit array multiplier, and IntDiv8 is an 8-bit integer divider. For IntDiv8, DFF count is nearly 4:5 as the gate count. Fast/Slow Clocks and Logic Bands: Our key observation is that to avoid insertion of any path balancing DFF, inputs of the circuit should be presented c times where c is related to the logical depth of the circuit. This way, correct values will be generated at outputs of the circuit periodically. For example, considerg 1 andg 2 gates in Figure 5.1a. If values onPI 1 ,PI 2 , andPI 3 are repeated once (having two copies for each data value), when the correct value is generated at the output of g 1 (at the arrival of the rst clock pulse), it will be passed to g 2 together with 97 the second copy of a valid value on PI 3 . This will generate a valid value of \1" at the output of g 2 at the arrival of the second clock pulse. Similarly, if values on PI 1 -to-PI 4 are repeated twice (resulting in three copies of each data value), correct values at the output of g 3 gate will also be generated at the arrival of the third clock pulse. Finally, if all inputs are repeated three times, every fourth clock pulse, a valid value will be generated at the output of the XOR gate. Consider a circuit partitioned into a set of computational blocks, where outputs of one block feed directly to the inputs of the next block according to some linear ordering of these blocks (Figure 5.3a). It is crucial to design the said circuit blocks such that each receives its inputs and produces correct outputs at the right times. In other words, having all inputs of a block received at time t, it produces its correct outputs at the right time (at t tgt >t, see discussions below for learning what should t tgt be) and prevents propagation of \garbage" values to the next circuit block at times t 0 < t tgt . Evidently, if this garbage collection at the output of the current logic block is not performed, invalid pulses will be injected into the inputs of the next logic block, which will eventually cause errors at the circuit outputs. To perform input repetition and valid output collection for each block in the proposed ar- chitecture for realization of SFQ circuits, we rely on two dierent clock pulse streams (one fast, the other slow) and (guard) bands (fences) around each block. More precisely, given a standard pipelined SFQ circuit as in Figure 5.3a, we use a repeat band (green) at the input of each block and a mask band (blue) at the output of each block as shown in Figure 5.3b. Both repeat and mask bands receive a fast clock and a slow clock signal. In the repeat band, a non-destructive read-out register (NDRO) is used on the path of each arriving input to repeat their values. Figure 5.4 shows a block diagram of an NDRO, its state transition diagram, and operation waveforms. To write a \1" into this NDRO, a pulse should be applied to its \set" pin before the clock pulse that does the reading. To write a \0" into this NDRO, a pulse should be applied to its \reset" pin before the said clock pulse. Each NDRO in a repeat band receives a fast clock on its \clk" 98 Block 1 Block 2 inputs Fast clock Fast clock Outputs (a) Block 1 Depth D 1 Block 2 Depth D 2 Fast clock Fast clock Fast clock Slow Clock Fast clock Slow Clock Slow Clock inputs Outputs Fast clock (b) Figure 5.3: (a) A (linear) pipeline architecture for SFQ circuits in which each block is fully path-balanced, (b) The new architecture employing a fast and a slow clock signals together with repeat (green) and mask (blue) bands. In the new architecture, original circuit blocks are either partially path-balanced or they are not path-balanced at all. Frequency of the slow clock is D times lower than that of the fast clock, where D=max(D1, D2). pin, and a slow clock on its \reset" pin. Note that inputs of the circuit block (which may be PI's or outputs of a previous circuit block) are connected to the \set" pins of these NDROs. Inside a mask band, there are 2-input AND gates, which operate at the speed of the fast clock and have the slow clock as one of their inputs. The other input of these gates come from the preceding circuit block (e.g., Block 1 in Figure 5.3b). A mask band prevents propagation of invalid values to the succeeding logic block until such time that valid values are generated and can be passed onto the next block. Recall that in each cycle of the fast clock, a value (presence or absence of a pulse) will be generated at each output of the rst circuit block. However, only the last generated value is valid and should be allowed to propagate to the second logic block. 99 NDRO reset set clk OUT (a) 0 1 reset set clk clk reset set OUT (b) clk reset set OUT (c) Figure 5.4: (a) Block diagram of an NDRO, (b) its state transition diagram, (c) its operation waveforms obtained by simulating an NDRO using the Josephson simulator (JSIM) [71]. If the depth of an original block (e.g. Block 1 in Figure 5.3b) is D, then the inputs to this block should be presentedD +2 times in order to guarantee generation of a valid value at all of its outputs before the arrival of the next pulse on the slow clock. This can be achieved if we set the frequency of the slow clock to be 1=(D + 2) of that of the fast clock. When there are two or more logic blocks in the system, D will be the maximum of logic depths of these blocks. Subject to some timing constraints, the factor ofD +2 may be reduced to a factor ofD (see Section 5.1.2.4). Feedforward Wires: In the architecture shown in Figure 5.3b, there may be some wires with no clocked elements connected to them in the rst circuit block. These wires are primary inputs of the circuit that are not used in the rst block but must be fed to the second circuit block. Similarly, there may be some wires with no clocked elements connected to them in the second circuit block. These wires are primary outputs of the circuit that are generated by the rst block and simply pass through the second block. Let's call these types of wires feedforward wires. In a multi-block architecture similar to Figure 5.3b with N circuit blocks, suppose that a feedforward input wire exists in thei th block (i.e., one that goes directly from an output of circuit block i 1 to an input of circuit block i + 1). In this case, there is no need to repeat the value on this wire as it goes through circuit block i, i.e., there is no need to use an NDRO for this 100 input in the corresponding repeat band of thei th block. Moreover, there is no need for a garbage- collecting AND gate for this wire in the corresponding mask band of the i th block because there is no generation of repeated pulses on this wire. However, to ensure arrival of all pulses at the input lines of circuit blocki + 1 in the same period of the slow clock, we must insert a DRO DFF in the repeat (or mask) band of circuit blocki. This DRO DFF operates at the slow clock. As an interesting special case, consider a circuit primary input x that is fed forward directly to circuit block j and used there. In this case, we insert DRO DFFs in the repeat bands of circuit blocks 1;:::;j 1 but an NDRO DFF in the repeat band of circuit block j. As another interesting case, consider an output y of circuit block j is fed forward directly (after it went through mask band of this block) as a primary output of the circuit. In this case, we use a garbage collecting AND gate in the mask band of circuit blockj but insert DRO DFFs in the mask bands of circuit blocks j + 1;:::;N. To summarize, using slow and fast clocks, one can design a pipeline architecture for SFQ circuits, whereby the circuit is decomposed as a set of logic blocks, each operating at a fast clock frequency without using any internal path balancing DFFs. However, the circuit is still fully path-balanced with respect to the slow clock and uses a combination of NDRO and DRO DFFs and AND gates to ensure correct circuit operation. 5.1.2.3 Partial Path Balancing (PPB) As explained in the previous subsection, the NPB method degrades the peak throughput of a circuit by a factor of D + 2, whereD is the maximum logic depth of all individual logic blocks in the circuit. This throughput loss may not be acceptable in some applications. So in the following we present a graceful throughput degradation scheme, in which it is possible to control throughput loss by partially balancing logic blocks of the circuit. In fact, inverse of the throughput and the path balancing DFF overhead for a circuit block exhibit the relationship shown in Figure 5.5. Therefore, if we desire a higher peak throughput, we may insert some path balancing DFFs into 101 Figure 5.5: Inverse of normalized throughput versus path balancing DFF count for the KSA32 circuit. (a) (b) Figure 5.6: Partial path balancing of the circuit shown in Figure 5.1a with (a) = 1, (b) = 2. the logic blocks [starting with the one that had the maximum imbalance factor (see below) among all circuit blocks]. The imbalance factor for a logic gate in a given circuit is the maximum dierence between logic levels of its fanin gates. The imbalance factor for a single-output circuit may be dened as the maximum imbalance factor of any gate in the circuit. For a multi-output circuit, one may add a dummy node and connect outputs of the original circuit to this dummy node, thereby producing a single-output circuit. In this way, the imbalance factor for a multi-output circuit can easily be calculated. Evidently, for a circuit of depth D, the imbalance factor ranges from 0 to D [note 102 that an imbalance factor of D implies a circuit with at least two outputs, one with a logic depth of D and the other with a logic depth of 0 (a feed forward wire)]. In partially path-balanced circuits, we make sure that the imbalance factor of the circuit is upper bounded by an integer value , called the imbalance bound. If the imbalance factor of the circuit is greater than , then we must insert path balancing DFFs to the circuit to meet the partial path balancing requirement as explained below. We have devised a simple heuristic for doing the said partial path balancing as follows. We traverse the circuit nodes in topological order from circuit inputs toward circuit outputs. If the length of the longest and shortest paths from any PI to any traversed node v i are L vi and S vi , and vi = L vi S vi > , then vi path balancing DFFs will be inserted onto the shortest path. Partially path-balanced versions of the circuit in Figure 5.1a with = 1, and = 2 are shown in Figure 5.6a, and Figure 5.6b, respectively. The peak throughput of a partially path-balanced logic block for a given will be 1 +1 times that of the fully path-balanced version of the block. Therefore, by adjusting , one can control degradation in the peak throughput. In the extreme case of = 0, the circuit block is fully path- balanced and the repeat and mask bands are unnecessary. In the other extreme of = D (or =D 1 for a single-output circuit), the circuit block is not path-balanced at all, and the peak throughput loss is simply 1=( + 1). For example, in KSA32 circuit, the NPB method decreases the peak throughput by a factor of 12, while the gate and DFF count is decreased from 998 (for a fully path-balanced and retimed circuit) to 596. Based on Figure 5.5, the PPB method limits the peak throughput degradation to a factor of four by inserting 200 path balancing DFFs, increasing the gate and DFF count to 894. Algorithm 5 shows the pseudo code for partially path balancing a mapped circuit. After initializing the variables in lines 1-7, nodes are sorted to be in a topological order in line 8. The topological sorting is to ensure that in the next two for loops, whenever a node is visited, its children had already been visited. Next, in lines 9-12, length of the longest and shortest paths 103 Algorithm 5: Partial Path Balancing Given an Imbalance Bound Input: A mapped network N = (V;E), : imbalance bound Output: Partially balanced network NPB = (VPB;EPB) // Initializing variables used for lengths of longest/shortest paths: 1 for each node Vi in V do 2 if Vi is a PI then 3 LV i = 0; 4 SV i = 0; 5 else 6 LV i = INT MIN; 7 SV i = INT MAX; 8 Sort V to be in topological order. // Finding lengths of shortest/longest paths before inserting DFFs: 9 for each node Vi in V do 10 if Vi is not a PI then 11 LV i = maxjfLV j g + 1 :8 Vj2 fanins(Vi); 12 SV i = minjfSV j g + 1 :8 Vj2 fanins(Vi); // Inserting DFFs: 13 for each node Vi in V do 14 V i =LV i SV i ; 15 if V i > then 16 Insert V i number of DFFs onto the shortest path of Vi; 17 Update LV j and SV j for all Vj2TFO(Vi); 18 NPB =N; 19 return NPB; from any PI to a node in the given network are found. In lines 13-16, partial path balancing DFFs are inserted into the shortest path which does not satisfy the said requirement for a given partial balancing bound. In line 17, length of the longest and shortest paths for all nodes in the Transitive Fanout Cone (TFO) 1 of node V i are updated. Since both PPB and NPB methods use slow and fast clocks, the Dual Clocking Method or DCM suits both of them. 1 In a network N = (V;E), which is modeled by a DAG, TFO(V i ) is a set of nodes in V which are reachable from V i by traversing only on the direction of edges. 104 5.1.2.4 Timing Requirements of DCM Let denote the maximum of imbalance factors for any logic blocks of the circuit. To make sure that the architecture in Figure 5.3b works correctly, some timing requirements must be met, which are listed below (see Figure 5.7): 1. t 1 the intrinsic reset delay of an NDRO. 2. t 2 the setup time of an NDRO. 3. t 1 + t 2 T Fast Clock . 4. T Fast Clock =T Slow Clock =( + 1). The rst requirement ensures that inputs are applied to the \set" pin of NDROs after the reset operation is done. The second requirement is a simple statement of setup time constraint for NDROs in the repeat bands. The third requirement is to ensure that the valid value will not be lost by late or early arrival of the second input of a masking AND gate, i.e., t 1 t 2 t 3 as shown next. Valid outputs of the current block (e.g., Block 1 in Figure 5.3b) are generated between two pulses of the fast clock at t 1 andt 3 . To ensure that masking AND gates process these valid outputs correctly, pulses on the Slow Clock should arrive at the second input of these AND gates before t 3 . Otherwise, the only set of guaranteed valid values on these outputs will be consumed and will not have a chance to enter the repeat band of the next block (e.g., Block 2). Let T =T Slow Clock . Based on Figure 5.7, we have: t 1 = t 1 + t 2 + + 1 T 105 t=0 t 1 t 2 t=t 1 t=t 2 t=t 3 Slow Clock inputs Fast Clock Figure 5.7: Waveforms showing timing requirements for correct operation of DCM. The third term in the above equation comes from the fact that the period of the slow clock is + 1 times that of the fast clock and that t 1 denotes the time at which the last pulse of the fast clock within a window of T occurs. According to Figure 5.7: t 2 =T t 3 =t 1 + 1 + 1 T = t 1 + t 2 +T Because of t 1 t 2 t 3 , we can write: t 1 + t 2 + + 1 TT t 1 + t 2 +T which leads to the following inequality: t 1 + t 2 1 + 1 T (5.1) For example, if the maximum imbalance factor among all blocks is=2, and t 1 =t 2 =2:5ps, then the period of the slow clock must satisfy the following inequality: T > 15ps. Using the terminology presented in this subsection,t tgt which is mentioned above should satisfy the following inequalities: t 1 t tgt t 3 . 106 It is important to ensure that the last correct output pulse (within a window of T ) of a block is generated in less than t 00 seconds after pulses of the slow clock for this stage arrive, where t 00 <T Fast Clock . In other words, a valid output pulse of a block and pulse of the slow clock going to the masking band and repeat band following this block should come in between two consecutive pulses of the fast clock. If this does hold true, with similar explanations as given for the third requirement above, the only correct value on a single output pin of this block cannot successfully pass the AND gate in its following mask band. To ensure this property, slow clock signals of mask band and repeat band following this block may be delayed. Example 9 Suppose that depth of a block in stage one of a circuit with an architecture of Figure 5.3b is four. Thus, the maximum logic level among outputs of this block will be ve, because the preceding repeat band adds one to the maximum logic level. Therefore, after the fth fast clock pulse arrives, correct values at outputs of this block are ready. Now, if =4, these correct values will successfully go through the corresponding masking AND gates. However, if =3, while the correct outputs are generated at arrival of 5 th , 9 th , 13 th , ..., fast clock pulses, pulses of the slow clock come at the arrival of 4 th , 8 th , 12 th , ..., fast clock pulses. Therefore, the correct output values cannot pass the masking AND gates. To solve this issue, the slow clock signals going to mask band and repeat band following this block should be delayed by one cycle of the fast clock. In general, if D max is the maximum logic level of outputs of a block in a circuit with an architecture as of Figure 5.3b, the slow clock signals in the mask band and repeat band following this block should be delayed by (D max %( + 1))T Fast Clock , where % is the modulo operation. For delaying a slow clock signal byr multiples of the fast clock cycle, it should be passed through r series DRO DFFs operating with the fast clock. In the above example, where D max = 5, and =3, the slow clock of the following mask band and repeat band was delayed by 1T Fast Clock by passing through a single DRO DFF. 107 Note that when satisfying the above requirements is dicult to achieve for a design or we do not want to insert extra DFFs for delaying clock signals of some mask and repeat bands, then another solution is to asynchronously delay the slow clock signals of these bands to meet the requirements. 5.1.2.5 Adder Design Adders are important datapath blocks that are prevalent in all kinds of processing circuitry. We dedicate this subsection to adder design using the proposed DCM. Node Count Analysis: Suppose that we want to create ann-bit adder usingm smallerk-bit adders (n =mk) using an architecture similar to what is shown in Figure 5.3b. In this design, there will be some overheads regarding the mask and repeat bands. However, since a k-bit adder is less complex than ann-bit adder (k<n), and path balancing overhead is removed (relaxed) in DCM, there is a potential for decreasing the total node count. In the following, we extract the exact node count for this new design. For an n-bit adder, there are 2n+1 inputs, out of which, 2k+1 will be used in the rst block (corresponding to the rst k-bit adder), generating k+1 outputs. k of these outputs are the nal sum bits, whereas one of these outputs is the carry-out of the rst k-bit adder block, which is used as the carry-in to the second k-bit adder block. The remaining inputs, 2(nk), need to go to next blocks as feedforward wires. In total, there will be 2(nk) +k + 1 = 2nk + 1 wires going from the rst stage of pipeline to the second stage. In the second adder block, 2k of the remaining primary inputs will be consumed and again k+1 new outputs will be generated. In total, 2(n 2k) + 2k + 1 = 2n 2k + 1 wires will go from the second stage of pipeline to the third stage. Similarly, there will be 2(npk) +pk + 1 = 2npk + 1 wires going from the p th adder block to the (p + 1) st adder block, where p<m. Inputs that are used in thep th stage need to go through NDROs in the repeat band preceding this stage. For primary inputs that are not used in thep th block, DRO DFFs in all of the preceding 108 repeat bands are sucient and they do not need a garbage collection AND gate in the mask band of this block neither. Final outputs (sum bits) which are generated at the p th block need garbage collection AND gates only in the mask band following the p th block. In the next mask bands, these nal outputs only need DRO DFFs, and they do not need any NDROs in the next repeat bands. Based on the above explanations, for creating an n-bit adder using m smaller k-bit adders, m(k + 1) AND gates, and m(2k + 1) NDRO DFFs are required. Regarding DRO DFFs count, m(2n)2k(1+2+3+:::+m) = 2mnkm(m+1) are needed in the repeat bands for feedforward wires of type primary input. Also, 0 + 2k + 3k +::: + (m 1)k = m(m 1) k 2 DRO DFFs are needed for feedforward wires of type primary output generated in the blocks preceding the last block. Therefore, in total, 2mnm(m + 3) k 2 DRO DFFs are needed. Each of these NDRO DFFs and AND gates receive a fast and a slow clock, while each DRO DFF only receives a slow clock. To deliver clock signals to z gates, we need to use z-1 clock splitters. Thus, total number of splitters needed to deliver slow clock signals to gates in the repeat and mask bands ism(k + 1) +m(2k + 1) + 2mnm(m + 3) k 2 1 = 2mn + 2m + 3 2 mk m 2 2 k 1. Moreover,m(k + 1) +m(2k + 1) 1 = 3mk + 2m 1 splitters are required for the delivery of fast clock to these gates. Therefore, the total number of gates, DFFs, and splitters used in the mask and repeat bands is: 4mn + (6mm 2 )k + 6m 2 (5.2) Example 10 Forn = 16,k = 4, andm = 4, the overhead of repeat and mask bands is as follows: AND gates count: 4 5=20, NDRO DFFs count: 4 9=36, DRO DFFs count: 2 4 16 4 7 2=72, total node count including splitters: 4 4 16 + (6 4 4 2 ) 4 + 6 4 2=310. Example 11 Forn = 32,k = 16, andm = 2, the total number of gates, DFFs, and splitters used in the mask and repeat bands is: 394. Using the library of gates in [45], FPB method uses 2; 534 and 1; 005 nodes for the 32-bit and 16-bit KSAs, respectively. If no path balancing DFF is used, 109 KSA2 a 0 c in Slow Clock Fast Clock Fast Clock Fast Clock Fast Clock Slow Clock NDROs NDROs b 0 NDRO NDROs KSA2 Fast Clock S 0 ' C out S 0 S 3 C out Slow Clock a 0 ' a 1 ' a 1 b 1 NDRO a 2 , a 3 b 2 , b 3 S 1 ' S 0 " S 2 S 1 DRO DRO DROs b 0 ' b 1 ' C in S 1 " Figure 5.8: Design of a 4-bit KSA by using two 2-bit KSAs and following the presented DCM and architecture shown in Figure 5.3b. the total gate count for the 16-bit adder will be 565. Therefore, if we use the above method for creating a 32-bit adder using two 16-bit adders, the total node count will be: 394+2 565=1; 524, which shows a (2534 1524)=2534 100 =40% reduction in the total node count compared to the standard FPB method. Example 12 For n = 4, k = 2, and m = 2. FPB method and NPB-based DCM consume 145 and 130 nodes for generating a 4-bit KSA. A detailed design of the circuit generated by DCM for this example is shown in Figure 5.8. Latency, Logical Depth, and Throughput: Suppose that the latency, logical depth, and throughput of ak-bit adder generated by FPB method areL k ,D k , andT k , respectively. Latency of the n-bit adder presented in this subsection will be more than the latency of the standard FPB method and its throughput will be less. More accurately, if the latency, logical depth, and throughput of the n-bit adder designed based on the DCM are denoted by L 0 n , D 0 n , and T 0 n , respectively, we will have: L 0 n =m (L k + AND2 + NDRO ) (5.3) D 0 n =m (D k + 2) (5.4) 110 T 0 n = T k + 1 (5.5) where AND2 and NDRO are intrinsic delays of an AND2 and NDRO DFF gates, respectively. Example 13 Using gates in [45], latency, logical depth and throughput for KSA32 and KSA16 generated by FPB method are: L 32 = 93:4ps, D 32 = 12, T 32 = 27:5GHz, and L 16 = 76:9ps, D 16 = 10, T 16 = 35:7GHz. For the 32-bit adder in Example 11, we will have: L 0 32 =182:2ps, D 0 32 = 24, T 0 32 = 3:25GHz. This throughput is 7:46 lower than the throughput of a 32-bit adder generated by FPB method. The PPB-based DCM decreases this throughput gap. For example, given=3, throughput will be 8:93GHz, which is 2:07 less than that of the 32-bit adder generated by the FPB method. The total node count in the PPB-based DCM will be 1,876 for =3, which shows 25% increase compared to NPB-based DCM. 5.1.3 Simulation and Experimental Results To verify the correct operation of circuits generated by our proposed method, we simulated a KSA4 circuit using the Josephson simulator (JSIM) [71]. The architecture of this circuit is shown in Figure 5.8. Figure 5.9 shows waveforms for inputs, outputs, and signals for some internal nodes. The value of in this design is four, hence, the frequency of the slow clock is 1=5 of the fast clock. Four random values for inputs are used: a 0 =0101, a 1 =1011, a 2 =0010, a 3 =1001, b 0 =0110, b 1 =1000, b 2 =1011, b 3 =0110, and c in =1010. For each of these inputs, correct values for outputs as follows are observed which can be seen in Figure 5.9: S 0 = 1001,S 1 = 0101,S 2 = 0011, S 3 = 0101, and C out = 1010. Embedded NDROs in the repeat bands repeat each set of these inputs by four times (for a total of ve presentations of the same data values). For example, aftera 0 passes the corresponding NDRO,a 0 0 is generated, with the waveform shown in Figure 5.9. After passing outputs of NDROs connected to inputs a 0 , a 1 , b 0 , and b 1 to the rst KSA2 block, pulses for S 0 0 , S 0 1 , and C 0 out which are outputs of this KSA2, are generated. S 0 0 is shown in Figure 5.9 as an example. These pulses 111 Slow Clock Fast Clock a 0 a 0 ' a 1 a 2 a 3 b 0 b 1 b 2 b 3 c in S 0 ' S 0 " S 0 1 S S 2 S 3 C out 1 st 2 nd 3 rd 4 th Figure 5.9: Simulation results for a 4-bit KSA (KSA4) generated by our dual clocking method (Figure 5.8). Inputs, outputs and clock pulses are shown. Four sets of random inputs are applied to this circuit: a0 = 0101,a1 = 1011,a2 = 0010,a3 = 1001,b0 = 0110,b1 = 1000,b2 = 1011,b3 = 0110, andcin = 1010. The correct outputs are S0 = 1001, S1 = 0101, S2 = 0011, S3 = 0101, and Cout = 1010 are generated. will be ltered in the following mask band and at the fth cycle of the fast clock, the correct values of these signals will be captured and given to the next stage (look at waveform of S 00 0 ). S 00 0 andS 00 1 do not need to have any NDROs in the second repeat band nor any AND gate in the last mask band. As shown in Figure 5.8, for timing synchronization of these signal, DRO DFFs are used in the last mask band. Dierent design metrics including total Josephson junction count (#JJs), total area, DRO and NDRO DFF counts, and total node count including gates, DFFs and splitters are extracted for 15 benchmark circuits. These circuits are in ISCAS [59] and EPFL [37] benchmark suites, or they are some arithmetic circuits. The arithmetic circuits are: KSA32, KSA4, an 8-bit array multiplier (Mult8), and an 8-bit integer divider (IntDiv8). The SFQ cell library that we used [45] 112 0 10000 20000 30000 40000 50000 60000 NPB PPB FPB Figure 5.10: Total Josephson junction count for dierent benchmark circuits. For better exhibition purposes, data for voter, priority, and sin benchmark circuits is scaled down by a factor of 10. contains the following gates: and2 with 12 JJs, or2 with 8 JJs, xor2 with 8 JJs, not with 9 JJs, DRO DFF with 7 JJs, NDRO DFFs with 11 JJs, splitter with 3 JJs, and JTL with 2 JJs. Table 5.1, and Figures 5.10, 5.11, 5.12 show the experimental results for the aformentioned circuits. In these results, it is assumed that there is only one block in the NPB- and PPB-based DCM. Therefore, the latency of circuits generated by NPB- and PPB-based DCM will be more than the latency of counterpart circuits generated by FPB method by an amount equal to sum of delay of an AND2 gate and delay of an NDRO DFF. The results for PPB are extracted for =4. Therefore, the throughput for circuits generated by PPB is 1=5 of the FPB method. Throughput of circuits generated by NPB-based DCM is 1=d 0 of the throughput of the same circuit generated by FPB, whered 0 is depth of the circuit. The cut-based technology mapping algorithm (command map) of ABC [38] is used in all three methods. Having only one original block, NDRO DFF count for circuits generated by FPB method, and DRO DFF count for circuits generated by NPB method are 0, hence, the corresponding elds are removed from Table 5.1. On average for 15 benchmark circuits, NPB-based DCM reduces #JJs, #Nodes, and area by 2:23, 1:85, and 2:0, respectively compared with FPB method. The amount of improvements 113 Table 5.1: Experimental results for FPB, PPB, and NPB. #PI, #PO, and #DFFs stand for PI count, PO count, DFF count, respectively. DFF count is split into NDROs and DROs. #PI #PO Depth #DFFs(NDRO) #DFFs(DRO) circuits PPB NPB FPB PPB c1908 33 25 20 33 33 683 473 c432 36 7 24 36 36 655 447 c499 41 32 13 41 41 442 218 c7552 207 108 21 207 207 2790 1912 c3540 50 22 32 50 50 1220 795 c5315 178 123 27 178 178 3421 2554 i2c 147 142 17 147 147 2461 1544 int2 oat 11 7 16 11 11 277 139 priority 128 8 249 128 128 22480 21115 sin 24 25 186 24 24 13717 13581 voter 1001 1 71 1001 1001 11144 6716 KSA32 65 33 12 65 65 532 167 KSA4 9 5 6 9 9 25 0 Mult8 16 16 40 16 16 734 655 IntDiv8 16 16 86 16 16 2098 1995 for those circuits with higher ratio of DFF count to gate count is more. For example, NPB- based DCM has 14:0 less area for priority circuit, and 2:93 fewer total node count for IntDiv8 compared with FPB method. This is because, these circuits need a huge number of path balancing DFFs in the standard FPB method. This is the reason behind seeing a local minimum for priority circuit in the curve of NPB shown in Figure 5.12. The average improvements on #JJs, #Nodes, and area that PPB-based DCM provides is 12.06%, 9.62%, and 15.38%, respectively compared to FPB. Also, PPB-based DCM does not require any path balancing DFFs for KSA4, and it reduces the total DRO DFF count by an average of 46.34% compared to FPB method for other 14 benchmarks. In the conventional RSFQ circuit design ow, designers use the clock-follow-data clocking method [49] for delivering clock pulses to gates inside each block of a given linear pipeline circuit as in Figure 5.3 while using concurrent clocking method [49] at the boundaries of blocks. This way, the path balancing DFFs inside each block can be eliminated. This conventional method does not require insertion of input repeating NDROs in repeat bands and garbage collecting AND gates in the mask bands, but it requires adding asynchronous delay elements to guarantee that 114 0 5 10 15 20 25 30 35 40 NPB PPB FPB Figure 5.11: Total area inmm 2 for dierent benchmark circuits. For better exhibition purposes, data for voter, priority, and sin benchmark circuits is scaled down by a factor of 10. a clock pulse arrives to a gate only after its inputs have arrived and processed, which makes the conventional design method more challenging and vulnerable to issues such as timing jitter and process variations. Example 14 Implementing KSA4 using clock-follow-data scheme: Using the map command of ABC and library of cells in [45], the mapped circuit will be as shown in Figure 5.13. The worst stage delay after synthesis is around 15.7ps, which is determined by an xor2 gate with four fanouts. This sets the clock cycle time of the circuit to 15.7ps. Delay of a single JTL is around 5.0ps (peak to peak delay using simulations with JSIM), which means that if we want to delay the clock pulse of a gate by one clock cycle, at least four series JTLs should be used. For example, consider the xor2 gate at level 2 (the gate with Id of 16 in Figure 5.13), the clock pulse of this gate should be delayed by inserting four JTLs. Similarly, the clock pulse of a gate at level 3 should go through 2 4 JTLs, ..., the clock pulse of a gate at level i should go through (i 1) 4 JTLs. Gates at level 1 should be designed using the concurrent- ow clocking method, which means that no delay elements are needed for delivering clock pulses to these gates. 115 0 2000 4000 6000 8000 10000 12000 NPB PPB FPB Figure 5.12: Total node count including: gates, DFFs, and splitter for dierent benchmark circuits. For better exhibition purposes, data for voter, priority, and sin benchmark circuits is scaled down by a factor of 10. As shown in Figure 5.13, we will need 5 4 = 20 total JTLs and ve splitters for splitting clock pulse between dierent levels. Please note that this analysis is for post synthesis. If post place- and-route is considered, there will be even more JTLs needed to guarantee that input pulses of a gate that is going through some interconnects arrive before its clock pulse (because of extra delay of interconnects). The total JJ count for this method and our DCM for implementing KSA4 are almost the same. However, our partial path balancing method makes it possible to limit the dierence between lengths of the shortest and longest paths from any PI to a gate in the circuit, resulting in having the capability to send waves of input pulses to the circuit (i.e., there can be several waves of data traveling in the circuit at the same time). This clearly has the throughput advantages over the above mentioned standard method. The last thing that we want to compare in this section is experimental results of two dierent designs for KSA32. The rst design is what is presented in Example 14 which is using two KSA16 based on PPB DCM. Let's call this adder, Adder1. The second design is an adder created using PPB-based DCM using a single block of KSA32 having the same value for as in Adder1. Let's 116 Figure 5.13: A 4-bit Kogge-Stone adder circuit implemented using the clock-follow-data clocking scheme [49]. The worst stage delay is 15.7ps, which determines the minimum clock period (for post synthesis). Critical nodes with stage delay of 15.7ps are highlighted, and gates at dierent logic levels are shown. To generate a delay value equal to or larger than the clock period of this circuit, at least four series JTLs should be used. Therefore, to comply with the clock-follow-data clocking scheme, clock pulse of a gate at level i needs to go through (i 1) 4 series JTLs. Only splitters which are needed for splitting clock pulses between dierent levels are shown; other plitters are not shown in this graph. Note that in the above circuit only one clock pulse, i.e. slow clock is needed. call this design, Adder2. Adder1 requires 1876 total nodes with latency of 182:2ps given =3. Adder2 requires 2171 total nodes and its latency is 107:6ps. Throughput for both adders will be the same. Therefore, if fewer node count is more important in a design, Adder1 is a better choice, while if the latency has a higher priority, Adder2 will be a better option. Note: In the most general case when there are more than one individual block in the given linear pipeline architecture (Figure 5.3a) and at least one of these blocks (except for the rst block) has an imbalance factor that is equal to the maximum imbalance factor in the whole circuit, then the + 1 factor mentioned in Section 5.1.2.4 is better to be changed to + 2. Even-though this results in more degradation in the peak throughput (1=( + 2) instead of 1=( + 1)), it will make it much easier to meet the timing requirements for correct operation of the circuit that is realized using DCM. When there is only one individual block in the given circuit (which is the case for experimental results presented in Section 5.1.3: Figures 5.10-5.12, Table 5.1), using + 1 117 is adequate. Also to make sure that the timing requirements will be met easier, it is better to use repeating NDROs and garbage collecting AND gates for feedforward wires rather than using DROs. 5.1.4 Conclusions In this section, two dual clocking methods for realization of Single Flux Quantum (SFQ) circuits based on No Path Balancing (NPB) and Partially Path Balancing (PPB) methods are presented. In these methods, a micro clock is used as input clock signal for gates in original blocks of the given circuit, and a macro clock is used for sampling the correct output values of these blocks. Some NDRO DFFs are employed to repeat inputs of each original block and some AND garbage collecting gates are used to sample valid outputs of a block. The proposed method helps reducing total Josephson junction count, total area, and total node count of SFQ circuits by orders of magnitudes compared to the standard fully path balancing method. Our approach increases the similarity between realization of SFQ circuits and CMOS circuits at the gate level, RTL, and higher abstraction levels, hence, it opens doors for employing well-matured techniques developed for CMOS circuits in their SFQ counterparts. 5.2 Depth-Bounded Levelized Graph Partitioning Algorithm Graph partitioning (GP) consists of dividing nodes of a graph into smaller (typically, equal size) parts to minimize a cost function subject to some constraints. GP plays an important role in many dierent applications including parallel processing, processing complex networks, image processing, and VLSI design [18]. The application of GP in parallel processing is to divide a task evenly amongk processors to ensure load balance and to minimize the amount of communications among them. As an example of complex networks, GP is used in social networks for solving community detection problem [115], where it is shown that widely-used maximum likelihood 118 methods for community detection in social networks can be translated into a min-cut graph partitioning problem. In image processing, GP can be used for image segmentation which is a fundamental task in computer vision [116]. 5.2.1 Introduction In SFQ circuits, most circuit elements (including all logic cells) are clocked elements, i.e., each logic cell becomes a pipeline stage. For correct operation of an SFQ logic cell, it is necessary for dierent inputs of the cell to arrive at the cell input at the same clock cycle. Hence, when dierent inputs take dierent path lengths to traverse, they must be explicitly path balanced using DFFs. This is known as Full Path Balancing (FPB) method and usually requires insertion of many DFFs into the circuit, which imposes signicant power, area, and even delay overheads (see Figure 5.2). In this section, we present Depth-bounded Levelized Graph Partitioning (DLGP) as a new Graph Partitioning (GP) problem and provide optimal solution for it using a dynamic programming algorithm. Thanks to DLGP and our new dual clocking method, the problem of FPB overheads is addressed. In particular, using our method for realization of SFQ circuits, the total static power consumption as the main source of power consumption in SFQ circuits [111] is reduced signicantly compared to FPB method. The main contributions of this section are as follows: • Introducing a new problem in GP (DLGP) and providing an optimal solution for it. • Presenting a new SFQ logic gate called pulse-repeating gate. • Employing a novel dual clocking method for realization of SFQ circuits (introduced in [119, 122]), which removes the need for expensive full path balancing step in design and implementation of SFQ circuits. 119 • Realizing SFQ circuits by utilizing the said DLGP algorithm and the dual clocking method, resulting in signicant reduction in power consumption, area, and other important metrics for evaluation of SFQ circuits. The rest of this section is organized as follows: Section 2 gives some background knowledge on the superconducting SFQ logic and on the graph partitioning. Section 5.2.2 reviews some of the previous graph partitioning works. Section 5.2.3 presents DLGP problem formulation and our algorithm for solving it. Section 5.2.4 explains the dual clocking method for realization of SFQ circuits. In Section 5.2.5 experimental results are given, and Section 5.2.6 concludes the manuscript. 5.2.2 Related Work The balanced min-cut k-way paritioning problem is NP-complete [51,66]. It remains NP-complete even for k = 2 and with identical vertex size and unit edge weight [51, 93]. There are exact and approximate algorithms for solving GPP. For example, an exact algorithm based on branch and bound for solving edge weighted GPP while having a lower bound and an upper bound on the size of each set is presented in [58]. The worst case computational complexity of this algorithm is O(2 n ) where n is the number of nodes in the graph. It is shown in [6] that there is no constant- factor approximation for perfectly balancing min-cut k-way partitioning problem. However, if 0< 1, an O(log 2 n) factor approximation is available [39]. If > 1, an approximation factor ofO(log n) can be achieved. For min-cut k-way partitioning problem without any constraints on balancing, Goldschmidt et al. proved that an optimal solution with complexity of O(n k 2 ) can be found [52]. However, if k is not part of the input, the problem is NP-complete [52]. The rst well-known heuristic for solving 2-way balanced partitioning problem is Kernighan- Lin heuristic (K-L) [79]. In K-L, a pairwise interchange process is performed by exchanging vertex pairs that yield the largest decrease in cut-size. The exchanged vertices are locked and the process 120 continues until all vertices are xed. A well-known modication of K-L heuristic is presented by Fiduccia and Mattheyses in [42]. Fiduccia and Mattheyses heuristic (F-M) handles unbalanced partitions, supports hyper-edges, and has a linear run-time in total number of circuit pins. After F-M, many other papers presented eective partitioning heuristics using simulated annealing [70], multilevel approach [2, 62, 73], network ow [97, 98], using spectral method [57, 101] or a unied approach by combining a few methods [87,97]. In [90], a module labeling and clustering algorithm is presented which is known as Lawler's clustering algorithm. In Lawler's clustering algorithm, an optimal solution for tree clustering to minimize delay with constraints on the size and the maximum pin number of each cluster is given. A unity delay model is used for each cluster and it is assumed that delays of interconnects are zero. Lawler's algorithm provides optimal solution for DAGs if replication of modules are allowed. Rajmohan et al. presented a clustering algorithm for minimizing the delay in combinational circuits [128]. Unlike Lawler's algorithm, in [128] a general delay model is used and an optimal solution subject to area capacity constraints is given. More recently, Dominique et al. presented a shared-memory k-way partitioning method to increase the speed of partitioning [88]. In this paper, partitioning engine is rened in order to use parallelism, and unlike the previous methods, the quality of partitioning is not degraded while a few times improvements on speed is achieved [88]. In [102], a labeling propagation technique is used to partition very large graphs with up to three billion nodes. By introducing a sizing constraint, authors in [102] were able to apply the label propagation technique in both coarsening and renement phases of the multilevel graph partitioning. In [138], a new graph partitioning problem called doubly balanced connected graph partitioning (DBCP) is introduced and studied. In DBCP, a weighting functionp :V !f1; +1g is applied to nodes of a given graphG = (V;E) and assigns a weight of1 or +1 to each node subject top(V ) = P vi2V p(v i ) = 0. The objective is to partitionV into two sub-setsV 1 ;V 2 V , while the corresponding sub-graphs to these sub-set of nodes are connected,jp(V 1 )j;jp(V 2 )jc p , andmaxf jV1j jV2j ; jV2j jV1j gc s for some constant valuesc p 121 and c s . The application of DBCP is in power grid islanding [141]. In [3,18,26,61], good surveys on GP and its applications are given. Dierent from other GPPs, in DLGP, there is a constraint on the depth of each partition while balancing each partition in terms of total size is not of an interest. In the standard denition, DLGP is considered as un-balanced partitioning. However, since in DLGP the depth of parti- tions should be the same, it can be called as depth-balanced or depth-limited partitioning. In this context, the standard balanced partitioning problem can be called size-balanced partitioning problem. 5.2.3 Proposed Algorithms In this section, rst, the DLGP problem is dened, and then a method to convert it to chain graph partitioning problem is presented, and an optimal solution to the latter problem is given. Finally, a method of generating a solution for DLGP problem using solutions of the latter problem is also presented. A directed acyclic graph as shown in Figure 5.14 is used as a running example to better explain concepts and algorithms. DLGP problem denition: Given a directed acyclic graph G = (V;E), a mapping function which species the level of each node inV to be between 1 and a maximum value ofL (i.e., depth of G), and a positive integer p, partition set V into K parts V 1 ;V 2 ;:::;V K , each of which giving rise to an induced sub-graphG 1 ;G 2 ;:::;G K , such that (i) the depth of each sub-graph is no more thanp, and (ii) if a node of level l is included in some part V k , then all nodes of level l must also belong to this part. Furthermore, the total cut size (TCS) as dened below is minimized: TCS = X l2V 0 cut size(<Pre(l);Post(l)>) (5.6) V 0 =fl i :l i =max (v j )8v j 2V i ; 1i<Kg 122 v 1 v 2 v 3 v 4 v 6 v 5 v 8 v 7 v 10 v 9 l=1 l=2 l=3 l=4 l=5 Figure 5.14: An example directed acyclic graph used in explaining DLGP and DCGP problems and the proposed algorithms for graph partitioning. This graph has 10 internal nodes and levels of these nodes range from 1 to 5. Triangular shapes at the bottom of the graph represent primary inputs and those at the top show primary outputs. Color-coding is used for easy identication of hyper-edges. 2-pin nets are shown in black color. where Pre(l) denotes all nodes i in G such that (i)l, Post(l) denotes all nodes j in G such that (j) > l. < Pre(l);Post(l) > denotes the edge separator between levels l and l+1 in G i.e., the set of edges in G which originate at any node i in Pre(l) and terminate at any node j in Post(l). cut size(<Pre(l);Post(l)> calculates the number of edges in G that exist between Pre(l) and Post(l) sets. Havingp=2 andK=3, a partitioning solution for the graph in Figure 5.14 is as follows. There are three parts V 1 =fv 1 ;v 2 ;v 3 ;v 4 g, V 2 =fv 5 ;v 6 g, and V 3 =fv 7 ;v 8 ;v 9 ;v 10 g with induced sub- graphs ofG 1 ;G 2 ; andG 3 as shown in Figure 5.15. Pre(2) include nodesv 1 ,v 2 ,v 3 , andv 4 , whereas Post(2) has the rest of the nodes. Post(3) includesv 7 v 10 , whilePre(3) has the remaining nodes. 123 v 1 v 2 v 3 v 4 (a) v 6 v 5 (b) v 8 v 7 v 10 v 9 (c) Figure 5.15: The induced sub-graphs achieved in partitioning the graph in Figure 5.14 into three parts: V1 =fv1;v2;v3;v4g, V2 =fv5;v6g, and V3 =fv7;v8;v9;v10g, having p = 2 and K = 3. (a) G1, (b) G2, and (c) G3. cut size(< Pre(2);Post(2) > is equal to four and cut size(< Pre(3);Post(3) > has a value of four as well. Next, we dene a weighted directed chain graph C = (U;F;w) with nodes labeled 1:::L where L is equal to the depth of graph G = (V;E). Each such node represents all nodes of G which are at the same level and will get that common level as its label. There will be a directed edge uv in F between nodes labeled u and v only when v=u+1. Weight of the incoming edge to node v accounts for total number of edges connected to any node with level v from nodes with level u. More precisely, ifv=u+1, the weightw uv of theuv edge is dened as the number of directed edges inG that connect any nodes inPre(u) and any other nodes inPost(u). Ifv6=u+1,w uv =0. 124 Note that the above denitions and weight assignment function work equally well for hyper- graphs and hyper-edges (simply add \hyper" before any occurrence of \graph" or \edge".) A di- rected hyper-edge is one with a distinguished connected node called a source node, thereby, estab- lishing a clear sense of directionality between the source node and all other connected sink nodes. Also, in calculating weights, a hyper-edge is counted only once when it is cut by a cutline such as C 1 during the partitioning process. For the running example, cut size(<Pre(2);Post(2)> is equal to three and cut size(<Pre(3);Post(3)> has a value of two in case of using hyper-edges. Depth-bounded Chain Graph Partitioning (DCGP) problem denition: Given a weighted chain graphC = (U;F;w), a mapping function which assigns a label to each node inU to be between 1 and a maximum value of L (i.e. depth of the graph), and a positive integer p, partition set U into K parts U 1 ;U 2 ;:::;U K such that (i) the depth of each part is no more than p, (ii) the total cut weight as dened below is minimized: TCW = X k2V 00 cut weight(<Pre(k);Post(k)>) (5.7) V 00 =fk i :k i =max (u j )8u j 2U i ; 1i<Kg wherecut weight(<Pre(k);Post(k)>) is calculated as the sum of the edge weights w ij of every distinct pair of nodes, i2Pre(k) and j2Post(k). Based on the way that the directed weighted chain graph is constructed, w ij will be equal to the weight of the edge connecting node k to node k+1. The directed weighted chain graphs for the running example using regular edges and hyper- edges will be as depicted in Figure 5.16b and Figure 5.16c, respectively. In these gures the optimal parts which minimize TCW are also shown. cut weight(< Pre(2);Post(2) > will be equal to w 23 =4, and cut weight(< Pre(3);Post(3) > will be w 34 =4. In case of using hyper- edges, cut weight(< Pre(2);Post(2) > will be three and cut weight(< Pre(3);Post(3) > will have a value of two. As shown in Figure 5.16, the cuts that realize optimal parts minimizing 125 C 1 C 2 C 3 v 1 v 2 v 3 v 4 v 6 v 5 v 8 v 7 v 10 v 9 (a) C 3 C 1 1 2 3 4 5 w 12 =7 w 23 =4 w 34 =4 w 45 =3 (b) C 2 C 1 1 2 3 4 5 w 12 =6 w 23 =3 w 34 =2 w 45 =3 (c) Figure 5.16: The optimal cuts for the graph in Figure 5.14 and the corresponding directed weighted chain graphs. (a) showing the optimal cuts of the graph which give optimal parts minimizing TCS. The directed weighted chain graph with optimal parts are shown using (b) regular edges, (c) using hyper edges for calculating weights. TCW =7 in (b) and TCW =5 in (c). TCW are C 1 and C 3 in case of using regular edges and C 1 and C 2 in case of using hyper-edges, respectively. Lemma 5.2.1 The DCGP problem can be solved optimally using a dynamic programming algo- rithm. We use the dynamic programming algorithm for nding the optimal solution for the DCGP problem. The i th instance of the optimal solution O(i) is dened as the partitioning solution for a sub-graph G i of the original graph G minimizing the total cut weight as dened by Eq. 5.7. Notice that G i is an induced graph obtained from G by including all nodes of G with levels less than or equal toi. O(L) denotes the optimal partitioning solution for the whole graph. The value 126 of the optimal solution for the i th instance of the problem is denoted by OPT (i). We initialize OPT (i)=0 for all 1ip, where p is the given depth limit in DCGP problem. Next, the value of the optimal solution, OPT (i), which is dened as the minimum value for TCW for induced graph G i , is calculated recursively as follows: OPT (i) =min q fOPT (iq)+ (5.8) cut weight(<Pre(iq);Post(iq)>)g for 1qp Proof of optimality: If a problem satises principle of optimality of the dynamic programming algorithm, using dynamic programming will yield the optimal solution for this problem. It is needed to show that the optimal solution of a subset of problem O(i) is built of optimal solutions for its sub-problems. For this purpose, we use proof by contradiction as follows: suppose that the i th instance of the problem with optimal solution O(i) has a sub-problem iq with optimal solutionO(iq) and optimal value of OPT (iq) =M. Suppose that O(i) is built of a solution for (iq) th sub-problem with value M 0 > M. Let's call this solution O 0 (iq). Now, we can generate another solution for thei th instance of the problem by replacingO 0 (iq) withO(iq). Since M <M 0 , then the new solution for the i th instance of the problem is better than the rst one which is a contradiction, because the rst solution was supposed to be the optimal solution. Therefore, the optimal solution for thei th instance of the problem is built of the optimal solutions for its sub-problems. Lemma 5.2.2 DCGP and DLGP are equivalent problems, i.e. solving the DCGP problem yields the solution of the DLGP problem and vice versa. Proof: It is enough to show that a solution to DCGP problem can be transformed into a valid solution for DLGP problem and vice versa. We use proof by contradiction for this purpose. Let G = (V;E) be the given directed acyclic graph and two positive integers p and K be the 127 maximum depth of an induced sub-graph representing a part and the number of parts, respectively. Let C = (U;F;w) be the equivalent directed weighted chain graph obtained by following the procedure presented in this section. Suppose that there is a solution for DCGP problem on graph C partitioning U intoU 1 ;U 2 ;:::;U K and minimizing TCW ; let's call this minimum value TCW 1 . A solution for DLGP problem on graph G partitioning V into V 1 ;V 2 ;:::;V K can be generated as follows: Add to V 1 all nodes v2V with level smaller than or equal to the largest label of nodes in U 1 ; let's call this largest label L 1 . Add to V 2 all nodes v2 V with level larger than L 1 and smaller than or equal to the largest label of nodes in U 2 . Similarly, V 3 ::: V K can be constructed. This solution will minimize TCS. By contradiction, let's assume that there will be another partitioning solution for graph G, partitioning set V into V 0 1 ;V 0 2 ;:::;V 0 K and gives a lower value for TCS. This solution can give rise to a partitioning solution for the equivalent directed weighted chain graph C 0 = (U 0 ;F 0 ;w 0 ) with parts U 0 1 ;U 0 2 ;:::;U 0 K . Following the denition of directed weighted chain graph and looking at the way its weights are calculated, it is easily seen that this new partitioning solution will have a value for TCW which will be smaller than TCW 1 . This is a contradiction, because it was assumed that the rst partitioning solution with total cut weight of TCW 1 is the optimal solution with the minimum possible value for the total cut weight. Therefore, a solution for DCGP problem that minimizes TCW will lead to a solution for DLGP problem that minimizes TCS. Similarly, it can be shown that a partitioning solution for DLGP problem minimizing TCS will lead to a solution for DCGP problem minimizing TCW. Note that when p is equal to or larger than the depth of graph G, then there will be only one part equal to G itself and the problem is trivial. Theorem 5.2.3 The DLGP problem can be solved optimally in polynomial time. Proof: Based on Lemma 5.2.1, the DCGP problem can be solved optimally using the dynamic programming algorithm. Also, based on Lemma 5.2.2, DCGP and DLGP are equivalent problems 128 Algorithm 6: DLGP algorithm Input: G = (V;E), p: constraint on depth, a mapping function which returns level of a node in G. Output: An optimal set P =fV1;V2;:::;VKg of parts. 1 L = Compute Graph Depth(G). 2 if pL then 3 return P =fVg 4 Generate the weighted directed chain graph C = (U;F;w). 5 for i=1; ip; i++ do 6 OPT (i) = 0 7 for i=1; iL; i++ do 8 Find O(i) and calculate its value, OPT (i), using Eq. 5.8. 9 Find N sel =fm1;m2;:::;mK1g by tracing back from the O(L) solution. 10 for i=1; iK; i++ do 11 Find Vi using Eq. 5.9 12 return P =fV 1 ;V 2 ;:::V K g.; meaning that an optimal solution for the DCGP problem leads to an optimal solution for the DLGP problem and vice versa. Therefore, the DLGP problem can be solved optimally and since the dynamic programming is used, it will have a polynomial time complexity. After nding the optimal solution, parts can be generated by tracing the O(L) solution back as follows: Generate an empty set of selected levelsN sel . Add the indices of sub-problems ofO(L) to N sel , i.e. if j=i-q yields the minimum value for O(L) in Eq. 5.8, add j to N sel . Repeat these steps for O(j), and trace all the way back to reach the boundary sub-problems. At the end, we will have N sel =fm 1 ;m 2 ;:::;m K1 g. Having N sel , the i th sub-set of nodes, V i , corresponding to the i th part in DLGP problem is obtained using the following equation: V i = 8 > > > > > > > < > > > > > > > : fv2Vj 0< (v)m1g : i = 1 fv2Vj mi1 < (v)mig : 1<i<K fv2Vj (v)>mK1g : i =K (5.9) in which, (v) returns the level of node v. Algorithm 1 shows the pseudo code of the DLGP algorithm. Complexities of line 1 and line 4 are O(m +n) each, where n is the node count and 129 b 3 a 3 b 2 a 2 b 1 a 1 b 0 a 0 c in v 1 and2 v 2 xor2 v 3 and2 v 4 xor2 v 5 and2 v 6 xor2 v 7 and2 v 8 xor2 v 16 xor2 v 15 and2 v 9 and2 v 10 and2 v 12 and2 v 13 and2 v 17 or2 v 18 or2 v 19 or2 v 21 or2 v 22 and2 v 23 and2 v 24 and2 v 25 or2 v 26 xor2 v 27 or2 v 28 or2 v 29 xor2 v 30 xor2 v 31 or2 v 14 and2 v 11 and2 v 20 and2 c out sum 3 sum 2 sum 1 sum 0 (a) 1 2 3 4 5 w 12 =9 w 23 =16 w 34 =10 w 45 =10 6 w 56 =7 C 2 C 1 (b) Figure 5.17: (a) A graph used in Example 15, corresponding to a 4-bit mapped adder. Color-coding is used to show hyper edges. 2-pin nets are shown in black color. (b) The corresponding weighted chain graph having p=2 and K=3. The hyper-edges are used for calculating weights and therefore to nd optimal cuts and parts. For the shown partitioning solution, TCW =26. m is the edge count. Complexity of lines 7-8 is O(pL), because the for loop iterates through 1 to L and for each iteration p dierent values should be considered in Eq. 5.8. Complexity of lines 10-11 is O(n), because we visit each node only once and put them in a part based on Eq. 5.9. Therefore, the overall complexity of the DLGP algorithm is O(m +n). For the graph shown in Figure 5.14, in case of using regular edges, the selected levels will be N sel =f2; 4g, and the sub-set of nodes corresponding to optimal parts will beV 1 =fv 1 ;v 2 ;v 3 ;v 4 g, V 2 =fv 5 ;v 6 ;v 7 ;v 8 ;v 9 g, and V 3 =fv 10 g. However, for the case of using hyper-edges, the selected levels will be N sel =f2; 3g, and the sub-set of nodes corresponding to optimal parts will be V 1 =fv 1 ;v 2 ;v 3 ;v 4 g, V 2 =fv 5 ;v 6 g, and V 3 =fv 7 ;v 8 ;v 9 ;v 10 g. Example 15 Figure 5.17a shows a graph with 31 internal nodes, and depth of six (L=6). This graph is an implementation for a 4-bit adder; nodes are labeled from v 1 to v 31 and each node also 130 v 1 and2 v 2 xor2 v 3 and2 v 4 xor2 v 5 and2 v 6 xor2 v 7 and2 v 8 xor2 v 16 xor2 v 15 and2 v 9 and2 v 10 and2 v 12 and2 v 13 and2 v 14 and2 v 11 and2 (a) v 17 or2 v 18 or2 v 19 or2 v 21 or2 v 22 and2 v 23 and2 v 24 and2 v 25 or2 v 26 xor2 v 20 and2 (b) v 27 or2 v 28 or2 v 29 xor2 v 30 xor2 v 31 or2 (c) Figure 5.18: Induced sub-graphs for the graph shown in Figure 5.17a corresponding to the partitioning solution with p = 2 and K = 3 shown in Figure 5.17b. (a) G1, (b) G2, and (c) G3. has a logic gate functionality, i.e. and2, or2, or xor2. For p=2, K=3, and using hyper-edges for calculating weights, the corresponding directed weighted chain graph will be as shown in Figure 5.17b. Using the DLGP algorithm, the selected levels will be N sel =f2; 4g, and the sub-set of nodes corresponding to optimal parts will be V 1 =fv 1 ;v 2 ;:::;v 15 ;v 16 g, V 2 =fv 17 ;v 18 ;:::;v 25 ;v 26 g, and V 3 =fv 27 ;v 28 ;v 29 ;v 30 ;v 31 g. The induced sub-graphs for these parts will be as shown in Fig 5.18. 131 5.2.4 Realization of SFQ Circuits using Dual Clocks Evaluation of SFQ gates is destructive with respect to any internal state (loop current state) of the gate and any incoming input pulses. In other words, after an SFQ gate receives a clock pulse to produce its output, any stored internal state of the gate is destroyed and the input pulse is consumed. For example, if a 2-input AND gate receives a 1 pulse on itsin 1 before the clock signal arrives, it will store the said input pulse as a persistent current in one of its internal loops. Next, if the gate receives a second 1 pulse on itsin 2 again before the clock comes, it will store this input value in a second internal loop as a persistent current. Finally, when the clock input to the gate arrives, both loop currents will reset (revert back to the other direction of current ow) and an output pulse is reproduced to signify the 1 output for the AND gate. Now consider a situation in which in 1 arrives in clock cycle 1 whereas in 2 arrives in clock cycle 2. One would expect that the AND gate will produce a 0 at the end of clock cycle 1 and a 1 at the end of clock cycle 2. However, this is not the case. In SFQ logic, the AND gate will receive and consume input pulse on in 1 during clock cycle 1 producing a 0 output and then it will receive and consume input pulse on in 2 in cycle 2, again producing a 0 output. This is precisely why the full path balancing method is employed to make sure that the AND gate will produce the correct 1 pulse output at the end of clock cycle 2. Our key observation is that to avoid path balancing DFFs, all we have to do is to make sure that the producer of the 1 pulse on in 1 produces that same pulse in both clock cycles 1 and 2. Therefore the AND gate will produce a wrong value of 0 at the end of clock cycle 1 but the correct value of 1 at the end of clock cycle 2. So, as long as we initiated new data toward the AND gate with a slow clock frequency which is half of the fast clock frequency used to clock the AND gate, then the AND gate will produce the correct output at multiples of the slow clock. This observation is rst reported in [119], and is presented as an architectural level solution for design and implementation of SFQ circuits in [122]. This method eectively eliminates path 132 balancing DFFs inside an SFQ logic circuit although the method comes at the expense of using two dierent clocks (micro and macro clocks) and a number of Non-Destructive Read Out (NDRO) DFFs (output-replicating or repeating). These NDRO DFFs are read by micro clock and are written by macro clock. Since they are being read by micro clock, in each cycle of the micro clock the correct pulses will be re-generated and put on the primary inputs of each part. Notice that Destructive Read Out (DRO) DFFs cannot be used here, because the read operation in DRO DFFs is destructive, hence, they cannot re-generate the correct pulses for p + 1 times that is needed here. As explained above and in Section 2.1.3, for correct operation of SFQ circuits, full path bal- ancing is required. One way of addressing the path balancing problem is to add as many DFFs as required to remove any dierences among levels of inputs to any SFQ gate (FPB method). In [75], it is suggested to apply the standard retiming algorithm [92] after a heuristic FPB algorithm to minimize the number of path-balancing DFFs (called FPB+retiming). In spite of this algorithm, our experiences show that the FPB+retiming will add a large number of DFFs to the circuit which can dominate the original gate count in the network even considering the fact that the area cost of an DRO DFF is somewhat less than the area cost of say 2-input SFQ AND gate [45]. We will show how DLGP algorithm helps solving this problem. As hinted earlier, we propose to use a fast micro and a slow macro clock and to use the DLGP algorithm to minimize the aforementioned overheads of FPB. Thanks to the DLGP algorithm, it is possible to divide the corresponding graph of a given SFQ circuit into a few depth-bounded parts and add NDRO DFFs only on the hyper-edges which are cut by various part boundaries. These DFFs will pass values that go from one part to the other one with the macro clock, while gates inside each part operate with the micro clock. Since the DLGP algorithm guarantees giving the minimum total cut weight, the number of inserted NDRO DFFs in the circuit will be minimized. Furthermore, since the NDRO DFFs which are placed at the inputs of each part are also clocked by the micro clock to continuously reproduce their outputs, there is no need to 133 DRO Micro_Clk (a) Micro_Clk DFF Macro_Clk (b) Figure 5.19: An example of (a) full path balancing (FPB), and (b) DLGP-based DCM (with p=6) for SFQ circuits. FPB uses DRO DFFs whereas the DCM uses NDRO DFFs. FPB requires 9 DRO DFFs and DCM requires 5 NDRO DFFs. add any path balancing DFFs inside each part. The resulting SFQ circuit is thus functionally pipelined allowing a number of K data instances to exist in the circuit at the same time, each being operated in the corresponding part of the K-part circuit. The number of parts will thus aect the total operational throughput of the circuit (when there are no pipeline stalls). Figure 5.19 shows FPB and DLGP-based Dual Clocking Method (DCM) for an example circuit. As seen, the DLGP-based DCM requires four fewer number of DFFs compared with FPB. Note that although NDRO DFFs are more expensive than the DRO DFFs, the reduction in total DFF count far outweights this dierence in element cost. See experimental results. In our DLGP-based DCM, the depth of each part is at most p. Since gates in each part are evaluated by the micro clock and the speed of micro clock is p + 2 times faster than the macro clock, between two consecutive edges of the macro clock, several dierent invalid values will hit the inputs of DFFs at the inter-part boundaries. If every such value reaches the NDRO DFFs, this will cause wrong values to be stored in the NDRO DFFs, which will be passed to the next part. Indeed we want that only the value which is generated in the last cycle of the micro clock is written into the NDRO DFFs, because only this value is guaranteed to be valid. This issue can easily be addressed using the following pulse-repeating gate and by ensuring that the macro clock has a certain phase relationship with the micro-clock and a clock frequency which is p + 2 times slower (T Fast = 1=(p + 2)T Slow ). 134 in out Macro_Clk Micro_Clk splitters NDRO Figure 5.20: Pulse-repeating gate: an SFQ gate consists of an NDRO DFF, an AND gate, and two splitters to give the macro and micro clocks to these gates. Algorithm 7: DLGP-based Dual Clocking Method Input: a graph G = (V;E) corresponding to the input circuit, p: constraint on depth Output: A timing-correct circuit represented by graph G 0 = (V 0 ;E 0 ) 1 Gm = Technology Mapping(G) 2 Parts =fV1;V2;:::;V k g = DLGP (Gm;p) 3 Pulse Repeating(Gm;Parts) 4 PO Balancing(Gm) 5 G 0 =Splitter Insertion(Gm) 6 return G 0 ; We have invented an SFQ pulse-repeating gate shown in Figure 5.20 to address the aforesaid challenges in the DLGP-based DCM. In this gate, an AND gate is added to the input of the NDRO DFF gate. One of the inputs of this AND gate is connected to the macro clock which is \0" while the correct value is not generated on the \in" port. Therefore, it will pass a \0" to the input of the NDRO DFF as desired. Only in the last cycle of the micro clock the AND gate will be transparent and pass the valid value to the NDRO DFF. In addition, due to usage of NDRO DFF, inputs of each part will be re-generated (repeated) at each cycle of the micro clock. The DLGP algorithm can be used to nd the optimum places for inserting pulse-repeating gates in the dual clocking method presented above. In DLGP-based dual clocking method, rst, circuits are mapped using ABC's technology mapping engine [38]. After that, these circuits are passed to the DLGP algorithm to nd the optimum places for inserting pulse-repeating gates. After this step, splitter insertion and balancing of Primary Outputs (POs) are performed. Algorithm 2 shows the pseudo code for DLGP-based DCM. In line 1, the given circuit is mapped using the technology mapping engine of ABC. In line 2, the optimal parts are determined. In line 3, the 135 0 7 14 21 28 35 42 49 56 63 70 77 84 DLGP(10) DLGP(5) Baseline2 Baseline1 Figure 5.21: Total static power consumption (mW ) of dierent benchmark circuit generated by DLGP- based DCM, Baseline1 (FPB), and Baseline2 (FPB + retiming). For better exhibition purposes, data for priority, voter, i10, and s13207 is scaled down by a factor of 10. pulse-repeating gates are inserted on the hyper-edges which are cut by the DLGP algorithm. Line 4 takes care of balancing POs, and nally, line 5 inserts the required splitters. 5.2.5 Experimental Results We implemented the DLGP algorithm inside an open source logic synthesis and verication tool called ABC [38]. An SFQ library of gates as in [45] is used. This library consists of the following gates: and2 with 12 JJs, or2 with 8 JJs, xor2 with 8 JJs, DFF with 7 JJs, splitter with 3 JJs, JTL with 2 JJs, and not with 9 JJs. Several benchmark circuits from ISCAS [59], EPFL [37], and some arithmetic circuits are used to test the eectiveness of our proposed algorithm in reducing the overhead of FPB. The complexity of these circuits ranges from the voter circuit with 1002 I/Os, 13758 nodes, 27516 edges, and 13758 cubes to 9sym with 10 I/Os, 1 node, 9 edges, and 87 cubes. Tables 5.2, and Figures 5.21, 5.22 show experimental results for DLGP-based DCM with two values forp, 10 and 5. For a better comparison, two baselines are considered; Baseline1 is the FPB, and Baseline2 is the FPB+retiming algorithm. As seen, DLGP-based DCM provides substantial 136 0 5 10 15 20 25 30 35 DLGP(10) DLGP(5) Baseline2 Baseline1 Figure 5.22: Area (mm 2 ) of dierent benchmark circuit generated by DLGP-based DCM, Baseline1 (FPB), and Baseline2 (FPB + retiming). For better exhibition purposes, data for priority, voter, i10, and s13207 is scaled down by a factor of 10. savings in total static power consumption, total number of Josephson junctions (#JJs), area, run- time, and DFF count (#DFFs). Area and #JJs are for gates, DFFs and splitters. The overhead of AND gates (in pulse-repeating gates) and second clock in DLGP-based DCM are considered in the experimental results of DLGP. #DFFs for DLGP includes NDRO DFFs inserted to the boundary of parts and DRO DFFs used for PO balancing. In DLGP-based DCM and both baselines, the cut-based technology mapping of ABC (command \map") which involves a delay optimization pass followed by a few area optimization passes is employed. Other than what are mentioned so far, no other optimization function is used. DLGP-based DCM for priority benchmark circuit consumes 1:7 and 51% fewer #JJs com- pared with Baseline1 and Baseline2, respectively when p=10. For the same, DLGP-based DCM provides 2:2 and 73% reduction on total area, and 8:9 and 4:3 reduction on #DFFs compared with Baseline1 and Baseline2, respectively. Also, DLGP-based DCM reduces total static power consumption for priority circuit by 4:36, and 1:96, respectively compared with Baseline1 and Baseline2. For p=5 the amount of improvements are less than these values. For example, for the same circuit, the saving on total static power consumption, #JJs, area, and #DFFs compared 137 with Baseline1 is reduced to 64%, 69%, 63%, and 3:9, respectively. For some benchmark cir- cuits such as mult8 and voter, #JJs and area of DLGP-based DCM is higher than Baseline2 when p=5. This shows that for these circuits the overhead of DCM is more than its provided saving on path balancing DFFs. However, for all tested benchmark circuits, DLGP-based DCM with p=10 reduces total static power consumption, #JJs, and area considerably compared with both Baseline1 and Baseline2. On average for all 15 benchmark circuits, the saving on total static power consumption, area, #JJs, and #DFFs for DLGP-based DCM whenp=10 is 1:33, 89%, 81%, and 8:6, respectively over Baseline1 and 71%, 31%, 35%, and 4:9, respectively over Baseline2. The reason behind not seeing the huge saving of #DFFs in the total area is the overhead of second clock and also the AND switches used in the pulse-repeating gates. DLGP-based DCM also decreases the run-time signicantly. For example forc432 benchmark circuit, the run-time is decreased by 2:1 and 5:6 when p=5 compared with Baseline1 and Baseline2, respectively. The main reason behind larger run-time values for baselines is requirement of inserting many DFFs plus performing retiming, which both are slow processes specially for large benchmark circuits. The above savings come with a cost, that is, the peak throughput of circuits generated by the DLGP-based DCM is p + 2 times lower than that of circuits generated by the FPB method. This is because the eective frequency of these circuits is set by the macro (slow) clock compared with the frequency of circuits generated by FPB which operate at the frequency of the micro (fast) clock. Note that the actual throughput is typically much less than the peak throughput (due to instruction data dependencies, program branches, etc.); so some peak throughput loss may be acceptable. In addition, due to the following property of SFQ circuits, the throughput loss will be less than the aforementioned value after place-and-route; in SFQ circuits, the delay of interconnects are typically larger than the delay of gates, hence, the longest interconnect usually determines the worst case delay. Therefore, since DLGP-based DCM reduces the total gate count signicantly, it will also reduce the length of the longest interconnect, resulting in faster local 138 Figure 5.23: Post place-and-rout results of ISCAS c432 benchmark circuit synthesized by our proposed DCM. Dimensions of the chip are 5150m 4330m. Note that synthesis and place-and-rout of the slow clock tree is not done here (this will not have a large impact on the reported dimensions since the number of sink nodes for the slow clock is very small for this circuit). clock frequencies compared to the baseline. This helps recover some of the lost peak throughput of the DLGP-based DCM. For example, the throughput loss for ISCAS c432 circuit generated by DLGP-based DCM (p=5) after place-and-route is reduced from 7 to 6:3 compared with the FPB method. A more advanced wire-routing method such as what is presented in [85] can help reduce this throughput gap further. Fig. 5.23 shows post place-and-route of ISCAS c432 benchmark circuit, which is synthesized by our proposed DCM method. 5.2.6 Conclusions This section introduces a new graph partitioning problem called Depth-bounded Levelized Graph Partitioning (DLGP). In DLGP, there is a depth constraint on the resulting sub-graphs of each part. We showed that by transforming the DLGP problem into a Depth-bounded Chain Graph Partitioning (DCGP) problem, an optimal solution which minimizes the total cut set is achieved using the dynamic programming algorithm. It is shown that DLGP algorithm together with 139 our new dual clocking method can reduce power consumption, total area, and other important evaluation metrics of superconducting single ux quantum circuits signicantly. For example, experimental results show that if the depth constraint is equal to ve, reduction in static power consumption, JJ count, and DFF count will be as high as 64%, 69%, and 3:9, respectively. 140 Table 5.2: Experimental results for DLGP-based DCM, Baseline1 (FPB), and Baseline2 (FPB + retiming). Run-time is in second. For DLGP-based dual clocking method (DCM), two cases of p=10 and p=5 are considered. #DFFs #JJs Run-Time circuits Baseline1 Baseline2 DLGP(5) DLGP(10) Baseline1 Baseline2 DLGP(5) DLGP(10) Baseline1 Baseline2 DLGP(5) DLGP(10) i10 12832 8382 3219 1872 170126 125626 124138 81713 3.66 7.24 0.248 0.227 9sym 359 151 80 19 7445 5365 6805 4390 0.028 0.049 0.014 0.014 c1908 1004 683 282 144 14783 11575 13169 8739 0.038 0.109 0.013 0.013 c1355 614 442 193 119 9434 7718 8739 6149 0.041 0.057 0.033 0.033 c432 847 655 224 118 12288 10368 10734 7124 0.0248 0.053 0.008 0.008 c880 1345 772 362 187 18561 12831 14658 9483 0.045 0.0135 0.008 0.008 c3540 2848 1220 776 282 47253 30937 43437 26897 0.302 0.549 0.028 0.028 s13207 6165 4087 1795 571 114337 85862 106346 60766 1.356 2.28 0.331 0.252 s5378 2124 1543 645 255 47955 40355 50766 34053 0.39 0.553 0.083 0.082 s382 170 166 56 9 4500 4250 4448 2750 0.011 0.011 0.009 0.009 voter 18491 11114 7204 3732 496238 373559 447044 355144 28.42 33.67 0.785 0.755 int2 oat 539 277 117 39 9539 6919 7770 5140 0.014 0.034 0.006 0.006 priority 41771 22286 8562 4225 433581 238731 257252 158568 18.78 338.03 0.154 0.134 add16 235 200 104 50 5411 5061 5726 3911 0.012 0.0016 0.0093 0.0094 mult8 2301 734 513 235 28159 12489 19208 11117 0.100 0.412 0.023 0.023 141 Chapter 6 Synthesizing Sequential Circuits in SFQ Technology This is the rst dissertation that provides full algorithmic and tool support for synthesizing sequential circuits in superconducting SFQ technology. Synthesizing sequential circuits in this technology is a very challenging task which involves leveling directed cyclic graphs, handling feedback loops, and ensuring and maintaining the full path balancing property throughout the whole process. This enables synthesizing interesting circuits such as nite state machines and pipeline circuits. We present a solid denition for level of a node in a directed cyclic graph which captures the longest path length from leaf nodes to the target node (that is needed for ensuring the full path balancing property) and guarantees that we will not be trapped inside innite loops. Then, we present a polynomial time algorithm for level assignment of nodes in these graphs based on the new level denition and show how it can be used for synthesizing sequential SFQ circuits. As a case study, we present synthesis of a counter circuit using the state-of-the-art SFQ technology and verify its correct functionality in simulations. The power consumption of a 3-bit version of this counter is 69:11W and 2:21W using RSFQ and ERSFQ cells, respectively. Also, its local clock frequency is 54.95GHz (throughput of 9.16GHz) which is several times higher than the typical CMOS clock frequencies, i.e. 4GHz. Finally, more experimental results on sequential circuits, e.g. ISCAS89, will be presented. 142 6.1 Introduction Design and implementation of SFQ circuits have been mainly manual with limited automation mostly by using some CMOS Computer-Aided Design (CAD) tools with minimal changes. Re- cently, a good amounts of research have been done as part of a funded project by the US govern- ment, called SuperTools program, to make this process fully automated [47,67]. For example, an SFQ specic placement tool which works based on cell grouping and super-cells placement [133], an SFQ routing tool which oers via count minimization [96], automatic static timing analy- sis [161], clocking solutions [143], test [153], and verication [41] techniques for SFQ circuits are presented. For the logic synthesis of SFQ circuits, there have been also a few papers which are published recently. In [21], authors presented a majority-based logic synthesis tool for synthesizing Adiabatic Quantum Flux Parametron (AQFP) [145] circuits, which are a family of SFQ devices. In [120, 121, 123, 125], SFQ specic technology-independent optimization functions and path balancing technology mapping algorithms and software tools are presented for synthesizing and mapping SFQ circuits eciently. These synthesis algorithms and tools are focused on combinational circuits and do not provide much support for synthesizing sequential circuits specially when there is a feedback loop in the circuit. The synthesis ow in these papers is as follows: rst, technology- independent optimizations including balanced factorization and rewriting [120] are performed on the input circuit, then the circuit is mapped and fully path balanced [121,123,125] and a standard retiming algorithm similar to [92] is employed to further reduce the total memory element count. Next, the memory elements, e.g. latches, are replaced with SFQ Destructive Read Out (DRO) DFFs, and nally, splitters are inserted at outputs of gates with more than one fanout. In [122], authors have presented a Dual Clocking Method (DCM) for realization of SFQ cir- cuits. Given a linear pipeline architecture, DCM sandwiches each individual combinational block between two logic bands namely repeat and garbage collecting bands. A repeat band consists of 143 SFQ Non-Destructive Read Out (NDRO) DFFs and a garbage collecting band is built of 2-input AND gates. DCM can save a lot of area by removing the need for full path balancing. However DCM can realize pipeline circuits which have memory elements, but it can neither handle sequen- tial circuits without nice linear pipeline architectures nor circuits with feedback loops. In the current work, we are providing needed algorithmic and tool support for synthesizing all dierent kinds of sequential circuits in SFQ technology including those with feedback loops, circuits with linear pipeline architecture, and circuits with memory elements in random places of the circuit. This requires a new level denition and a level assignment algorithm for dening and calculating levels of nodes in a directed cyclic graph; it also requires making lengths of all feedback loops in the circuit equal in terms of the clocked circuit element count, which are done in this chapter. The rest of this chapter is organized as follows: Section 6.2 presents leveling directed cyclic graphs. Section 6.3 makes use of the presented level assignment in Section 6.2 to provide full support for synthesizing sequential circuits in SFQ technology. Section 6.4 gives experimental results and nally, Section 6.5 concludes the manuscript. 6.2 Leveling Directed Cyclic Graphs 6.2.1 Terminology, Notation, and Basic Denitions G=(V,E): A directed graph (digraph) with vertex set of V and edge set of E. Node or Vertex: A member of set V. Node and vertex are used interchangeably in this chapter. Edge: A member of set E. n(V ) : Cardinality of set V . Leaf node: A node with in-degree of 0. The set of all leaf nodes of G is denoted by L(G). Root node: A node with out-degree of 0. The set of all root nodes of G is denoted by R(G). Internal node: A node that is not a root or leaf node. Level of a node: A non-negative integer number assigned to the node. It is denoted by l(v i ) for 144 node v i . Height of a digraph: The largest level among all nodes in the digraph. Width of a digraph: The maximum number of nodes with the same level in the digraph. Immediate fanin of v i : Node v j is an immediate fanin of node v i when there is a directed edge e=(v j ;v i ) in E connecting v j to v i . Immediate fanout of v i : Nodev j is an immediate fanout of node v i when there is a directed edge e=(v i ;v j ) in E connecting v i to v j . Transitive Fanin Cone of v i (TFI(v i )): A set of nodes v j which are reachable by walking in opposite direction of edges starting from node v i until leaf nodes are reached. Transitive Fanout Cone of v i (TFO(v i )): A set of nodesv j which are reachable by walking in the direction of edges starting from node v i until root nodes are reached. Directed path (path for short)=(V 0 s!t ;s;t;G): A directed path with vertex setV 0 s!t starting from source node s to target node t (s6=t) in digraph G=(V,E) is dened as the subset V 0 s!t =fv 1 = s;v 2 ;...;v k g (k 1) of set V where there are no vertex repetitions, and there exists an edge e=(v i ;v i+1 ) in E connecting v i tov i+1 for 1ik 1, and v k is an immediate fanin of t. Note that when t is an immediate fanout of s, V 0 s!t =fsg. Length of a path (V 0 s!t ;s;t;G): n(V 0 s!t ). Feedback loop (loop for short): A cyclic directed path (V 0 s!s ;s;G) containing at least two vertices such that the rst vertex v 1 is s and the last vertex v k is an immediate fanin of s. Note that we do not consider self-loops i.e., if n(V 0 s!s 1), (V 0 s!s ;s;G) =;. Connected graph: A graph in which by starting from any node and walking on the edges of the graph (without considering edge directions), any other node can be reached. 6.2.2 Prior Work In [9], a complexity analysis for leveling cyclic digraphs and algorithms for level assignment in these graphs are presented. In the complexity analysis part, a few decision problems, including 145 the below three problems, are dened. Problem 1: Let k2N. Does there exist a leveling of G with height at most k? Problem 2: Let w2N. Does there exist a leveling of G with width at most w? Problem 3: Let w;k2 N. Does there exist a leveling of G with height at most k and width at most w? As in [9], the rst two problems are easy, because the rst one can be solved using the longest path search algorithm [78] and the second one is trivially solved by setting w=1 and using indices of nodes as their levels. The third problem in acyclic digraphs is, however, shown to be NP-hard because the known NP-hard precedence constraint scheduling problem can be reduced to this leveling problem [69]. For a given height and width values k;w2 N, the authors call : V !f1; 2;:::;kg a level assignment function and G = (V;E; ) a cyclic k-level graph. They present three heuristic level assignment techniques based on Breadth First Search (BFS), Minimum Spanning Tree (MST), and Force Based methods. We will explain the rst one which is the most relevant one to our work. In the BFS-based level assignment algorithm, rst, an arbitrary node v i is selected and its level is set to 1 ((v i )=1). Next, a directed BFS is performed from node v i to reach a node v j and its level is set to (v j )=next((v i )), where next(l)=(l mod k)+1 ; if there are already w nodes with levels equal to next((v i )), then the next level (+1) with fewer than w nodes will be assigned to v j . In this algorithm, having an edge e=(v i ;v j ) does not impose any constraints on the leveling of nodes v i and v j . Therefore, node v j may have smaller, larger, or even the same level as node v i . 6.2.3 Our Work In graphs that represent Very Large Scale Integrated (VLSI) circuits, it is desirable to assign a level to a target node, which has the meaning of the length of the longest path from any leaf nodes to the target node. For example, the logic level in combinational circuits with no feedback 146 loops is dened as length of the longest path in terms of the gate count from any primary input to the target gate [123]. In fact, the level of a node is not typically lower than levels of nodes in its transitive fanin cone. This is absent in the level assignment algorithms presented in [9]. Moreover, in dening and calculating levels for nodes in a cyclic graph, getting trapped in a loop should be avoided. We provide below a new denition for level of a node in a cyclic digraph which satises all of these desirable features. First, we need to dene the notion of a feedback node in a cyclic digraph. Denition 1 (feedback node): Let V 00 L(G)!vi =fV 0 1 ;V 0 2 ;:::;V 0 m g denote the set of vertex sets of all paths from all leaf nodes to a node v i . Node v i is a feedback node when V 0 j \TFO(v i )6=;, 8 1jm. In other words, in every path from every leaf node to v i , there is at least one node that is reachable from v i by starting at v i and walking in the direction of edges. Denition 2 (level of a node in cyclic digraphs): Let G=(V,E) be a cyclic digraph and U denote the set of all feedback nodes in G. Let V 00 L(G)!vi =fV 0 1 ;V 0 2 ;:::;V 0 m g be the set of vertex sets of all paths from leaf nodes in G to a nodev i . Ifv i is not inU, thenV 0 j 's must additionally satisfy the condition that V 0 j \U =;,8 1jm. We call V 000 i V 00 L(G)!vi the set of vertex sets of all valid paths from all leaf nodes to v i . Level of v i is then dened as the maximum of lengths of its valid paths (i.e., the maximum value of all n(V 0 i )'s in V 000 i ). Corollary 6.2.0.1 The level of all leaf nodes in G is 0. Example 16 Consider the cyclic digraph shown in Figure 6.1. Levels of leaf nodes a, b, and c are 0. Vertex sets of all paths from leaf nodes to each internal node are as follows: V 00 L(G)!v1 :ffbgg V 00 L(G)!v2 :ffcg,fb, v 1 , v 3 , v 4 , v 5 g,fa, v 4 , v 5 gg V 00 L(G)!v3 :ffb, v 1 g,fc, v 2 g,fa, v 4 , v 5 , v 2 gg V 00 L(G)!v4 :ffag,fb, v 1 , v 3 g,fc, v 2 , v 3 gg V 00 L(G)!v5 :ffa, v 4 g,fb, v 1 , v 3 , v 4 g,fc, v 2 , v 3 , v 4 gg 147 v 1 v 2 v 3 v 4 v 5 a b c Figure 6.1: The cyclic digraph in Example 16. The triangular shapes shown in the left labeled by a, b, and c are leaf nodes, and the one shown in the right side is the root node of the graph. Levels of internal nodes in this graph following the Denition 2 will be: v1: 1, v2: 1, v3: 2, v4: 3, v5: 4. The TFO's of internal nodes are: TFO(v 1 ) =fv 3 , v 4 , v 5 , v 2 g TFO(v 2 ) =fv 3 , v 4 , v 5 g TFO(v 3 ) =fv 4 , v 5 , v 2 g TFO(v 4 ) =fv 5 , v 2 , v 3 g TFO(v 5 ) =fv 2 , v 3 , v 4 g It is seen that among all nodes, only for node v 5 the intersection between its TFO set and vertex set of any path from any leaf node to v 5 is never an empty set. Therefore, only v 5 is a feedback node in this graph. For nodes v 1 to v 4 , the valid path sets based on the the Denition 2 are: V 000 1 =ffbgg V 000 2 =ffcgg V 000 3 =ffb,v 1 g,fc, v 2 gg V 000 4 =ffag,fb, v 1 , v 3 g,fc, v 2 , v 3 gg. Since v 5 is a feedback node, its valid path set will be the same as set of all paths from leaf nodes to this node, i.e., V 000 5 =V 00 L(G)!v5 =ffa, v 4 g,fb, v 1 , v 3 , v 4 g,fc, v 2 , v 3 , v 4 gg. 148 Therefore, based on the Denition 2, levels of internal nodes in this graph will be as follows: l(v 1 ): 1, l(v 2 ): 1, l(v 3 ): 2, l(v 4 ): 3, l(v 5 ): 4. A naive algorithm for calculating levels of all nodes in G based on Denition 2 is as follows: (i) nd the set of all feedback nodes in G, (ii) nd all paths from leaf nodes to a target node v i that do not go through a feedback node, (iii) nd the maximum length of all paths found in step 2, and (iv) repeat steps 2 and 3 for all nodesv i in G. This algorithm has an exponential time complexity; assuming that there are k leaf nodes, n internal nodes, and the maximum out-degree in the graph is p, the complexity will be O(kp n ). To resolve the exponential time complexity problem of the above algorithm, we present below a dynamic programming based algorithm that requires a topological sorting of all nodes in G. It is well-known that standard topological sorting algorithms does not work for cyclic graphs. In fact, one of the applications of standard topological sorting is to detect cycles in graphs and the algorithm stops when the rst cycle is detected. Therefore, we must also revise the standard topological sorting algorithm such that it can sort nodes of a given cyclic digraph and order them in a way that is required by our leveling algorithm. In the revised topological sorting algorithm, rst, nodes with in-degrees of 0 are visited. Then, the visited nodes are removed (from a copy of the graph) and in-degrees of their immediate fanout nodes are updated (i.e., in-degrees of these fanout nodes are reduced by 1). As a result, if v i is an immediate fanin of v j , then v i will be visited rst, and therefore, it will come before v j in the topological ordering. However, dierent from the standard algorithm in our topological sorting algorithm, if nodev i andv j have the same in-degrees, the one that has at least one already visited immediate fanin node will be visited rst. Algorithm 8 shows the pseudo code of our topological sorting algorithm. In lines 4 and 5, in-degrees of nodes are calculated and nodes with in-degrees of 0 are placed in a FIFO queue. The while loop in line 6 goes through nodes which have not been visited yet. We enter the else in line 10 in case of encountering a loop; in lines 12-14, those nodes with at least one immediate fanin that is already visited will be added to the Temp set. In 149 Algorithm 8: Topological sorting of cyclic digraphs Input: G=(V,E), a cyclic digraph Output: V toposort 1 V toposort =fg 2 V unvisited = V 3 Q 0 =fg 4 Calculate in-degrees of all nodes in V . 5 Enqueue(Q 0 , all nodes with in-degree 0) 6 while V unvisited 6=; do 7 if Q 0 6=; then 8 v i =Dequeue(Q 0 ) 9 Append(V toposort , v i ) 10 else // enter here only in case of a loop. 11 Temp =fg 12 for v j in V unvisited do 13 if at least one immed. fanin of v j is in V toposort then 14 Append(Temp, v j ) 15 for v k in Temp do 16 if no immed. fanins of v k is in Temp then 17 Append(V toposort , v k ) 18 v i =v k 19 break 20 Remove v i from V unvisited 21 Update in-degrees of immed. fanouts of v i 22 Enqueue(Q 0 , all nodes with in-degree 0) 23 return V toposort the for loop in line 15, a node that does not have an immediate fanin in Temp will be selected and added to V toposort . Given that in-degree of nodes in VLSI circuits is usually less than ve (constant) and by using hash-tables for implementing V toposort and Temp, the complexities of the for loops in lines 12 and 15 will beO(n). This makes the overall complexity of the while loop, and therefore, the complexity of the whole sorting algorithm O(n 2 ). Example 17 In sorting nodes of the graph shown in Figure 6.1, rst, leaf nodes a, b, and c are visited and removed from (a copy of) the graph. This makes the in-degree of node v 1 equal to 0, therefore, it will be the next node to be visited. Then, there will be nodes v 2 to v 5 , each of which with an in-degree of 1, forming a feedback loop. The for loop in line 12 of the pseudo code will add 150 v 2 to v 4 to Temp. Notice that node v 5 will not be added to Temp because none of its immediate fanins have already been visited (i.e., none of its immediate fanins is already in V toposort ). Next, the for loop in line 15 will pick nodev 2 and add it toV toposort because it is the (only) node without any of its immediate fanins in Temp. This will remove v 2 from the graph and make the in-degree of v 3 equal to 0, which makes it the next node to be visited. Subsequently, nodes v 3 , v 4 and nally v 5 will be visited. Therefore, the topological order of internal nodes of this graph is: v 1 , v 2 , v 3 , v 4 , v 5 . To guarantee that Algorithm 8 will always be able to give a topological ordering for nodes of a cyclic digraph, we require that each feedback loop has at least one feedback node, which is the case in latch-based VLSI circuits (note that in this chapter we use the terms latch and ip- ops interchangeably). More precisely, there always exists a feedback latch in any and every loop in a VLSI circuit. This latch serves as the at least one feedback node in any loop. Lemma 6.2.1 Algorithm 8 nds a topological sorting of nodes of any connected directed graph G=(V,E) which contains at least one leaf node. Proof: The proof for the case of having an acyclic digraph is easy: Nodes with in-degree of 0 will be visited rst; removing these nodes will always generate new nodes with in-degree of 0 until all nodes are visited. Notice that for an acyclic digraph we will not enter the else in line 10 of the pseudo code. For a cyclic digraph, the worst case is that after a considered leaf node in the graph is visited and removed from the graph, the sorting algorithm hits a feedback loop and cannot decide about which node to visit next. Let's assume thatL is the rst encountered feedback loop (i.e., it is closest to the leaf nodes of the graph). There will be at least one node v 1 2 L which either has a leaf node as one of its immediate fanins or can reach a leaf node by walking back in the opposite direction of edges without getting trapped in a feedback loop; this is because the graph is connected and it was assumed thatL is the rst feedback loop. Therefore, by the time we 151 v 3 a b c v 1 v 2 v 4 Figure 6.2: A cyclic digraph used in proof of Lemma 6.2.1. reach v 1 , at least one of its immediate fanins are already visited. This makes v 1 eligible to be visited, therefore, breaking the loop and allowing other nodes of the loop to be visited. With a similar reasoning it can be shown that other feedback loops and the rest of nodes will be visited without getting trapped in any loops. Now assume that there are other nodes likev 2 ;v 3 2L that also have one of their immediate fanins visited (Figure 6.2). The question is can the algorithm decide on order of visiting these node? The for loop in line 12 will add v 1 v 3 to Temp. Consider a case where v 1 ;v 2 ; and v 3 are connected in a way that each of them has one other as its immediate fanin and itself is an immediate fanin of the third node (Figure 6.2 without having v 4 ). This means that none of these nodes will be eligible to be picked in the for loop in line 15. However, this case cannot happen because of the existence of at least one feedback node in any feedback loop includingL . A feedback node will never be added to Temp because for this to happen at least one of its immediate fanins must have been visited and visiting that fanin, which has to be part of the feedback loop, will break the feedback loop, i.e., we do not enter the else in line 10. The feedback node not being in Temp makes its immediate fanout node (v 1 in Figure 6.2) eligible to be visited (line 16). Visitingv 1 will break the feedback loop and makes it possible for the algorithm to visit other nodes of the loop. 152 Algorithm 9: Leveling cyclic digraphs Input: G=(V,E), a cyclic digraph Output: A network pNtk representing graph G with correct level assignments 1 pNtk = Load G into a network data structure. 2 vNodes = Pointer to vector of nodes in pNtk. 3 vNodes = Do topological sorting of nodes in vNodes using Algorithm 8. 4 Set levels of leaf nodes in vNodes to 0. 5 for each non-leaf node v i in vNodes do 6 L max = 0 7 for each v j in immed. fanins of v i do 8 if v j :level > L max then 9 L max = v j :level 10 v i :level = L max + 1 11 return pNtk Having a topological sorting of nodes in a given cyclic digraph, the following polynomial time algorithm can be used to assign levels to each node of the graph. (i) set levels of all leaf nodes to 0, (ii) visit nodes in the topological order, and (iii) set level of a node equal to the maximum level among its immediate fanins plus one. This is similar to the well-know leveling algorithms used for leveling acyclic digraphs [38]. The main dierence is using a topological sorting algorithm for cyclic digraphs in our method which is absent from previous works. Algorithm 9 shows the pseudo code of the presented leveling algorithm. In lines 1 and 2, the given graph is brought to a network structure and a pointer to its nodes is stored in vNodes. In line 3, the presented topological sorting algorithm for cyclic digraphs is used for sorting nodes of the given cyclic digraph. Line 4 sets levels of all leaf nodes of the graph to 0, and the for loop in line 5 goes through nodes of the graph in topological order and sets levels of nodes equal to the maximum level among its immediate fanins plus one. The level values are stored inside the data structure of nodes, therefore, the returned pointer to the network will have the information about correct level assignments. The complexity of lines 5-9 is linear time and therefore, the overall complexity of the algorithm is determined by line 3 to be O(n 2 ). 153 Lemma 6.2.2 Let nodes of a cyclic digraph G(V,E) be topologically sorted by using Algorithm 8 and visited in that order. An immediate fanin v j of node v i will be visited before v i unless v j is a feedback node of some feedback loop in G. Proof: Ifv i andv j are not part of a feedback loop, it is easily seen from the denition of topological sorting thatv i will be visited only after its immediate fanins including v j have been visited. If v i and v j are part of a feedback loop, line 16 of Algorithm 8 guarantees that v i will not be visited as long as v j has not been visited. Notice that if v j is a feedback node in a loop, then v i can be visited before v j . For example, in Figure 6.2, node v 1 will be visited before v 4 . Lemma 6.2.3 Level of node v i is L 0 max + 1 where L 0 max is the maximum of level values of im- mediate fanin nodes of v i . Proof (by contradiction): Based on Denition 2, length of the longest path from any leaf node to immediate fanin nodes of v i will be L 0 max . Consider that there is a path from a leaf node to v i with length L 0 max + where 2. This means that there will be at least one path with length L 0 max + 1 from a leaf node to some immediate fanin of v i . Therefore, the maximum level among immediate fanin nodes will be L 0 max + 1 > L 0 max , which contradicts with the assumption that L 0 max is the maximum level of immediate fanins of v i . Theorem 6.2.4 Algorithm 9 gives the correct level values (based on Denition 2) for nodes of a connected cyclic digraph with at least one leaf node. Proof: Algorithm 9 utilizes Algorithm 8 for sorting nodes and therefore visiting them in the level assignment. Based on Lemma 6.2.1, Algorithm 8 will always be able to provide a topological sort for the graph in this theorem, and based on Lemma 6.2.2 immediate fanins of a node v i will be visited before the node itself and therefore they will have correct level values once v i is being visited. Lemma 6.2.3 guarantees this will give correct level values for node v i . Therefore, the theorem is proven. 154 (a) (b) (c) Figure 6.3: Three dierent categories of sequential circuits to be synthesized in SFQ technology. Cloud- like shapes are combinational logic, squares denote individual memory elements (latches), and rectangles are columns of memory elements. (a) A linear nice pipeline architecture with three combinational logic blocks, (b) A circuit with memory elements in random places, and (c) A circuit that has (linear) pipeline latches, latches placed in random locations in the circuit, and latches in feedback loops. 6.3 Synthesizing Sequential Circuits in SFQ Technology The general ow for synthesizing sequential SFQ circuits is similar to those presented in [121,123], which target SFQ circuits with an underlying directed acyclic graph representation and comprise of steps of performing technology-independent optimizations, path balancing technology mapping, adding full path balancing latches, and performing standard retiming [92] to minimize the latch count, and nally replacing latches with SFQ DRO DFFs and inserting splitters. The main dierences are in the full path balancing step and its preceding level assignment step; we use the level assignment algorithm presented in Section 6.2 and a combination of full path balancing and partial path balancing methods [122] as well as a new method for balancing lengths of feedback loops. We classify sequential circuits into three categories and discuss each in a separate Section below. Figure 6.3 shows an example circuit for each of these categories. 6.3.1 Pipeline Circuits These circuits have at least two combinational blocks (two stages of a linear pipeline) separated by a column of memory elements, which we call architectural pipeline latches; any wire going from 155 one stage to the other one has to go through one of these pipeline latches. There are two methods for synthesizing these circuits in the SFQ technology. 6.3.1.1 Dual Clocking Method In this case one micro (fast) clock and one macro (slow) clock will be used. Original gates and path balancing DFFs will get the fast clock and latches at the pipeline boarder (which will be replaced with SFQ DRO DFFs) will get the slow clock. There can be two ways for supporting the dual clocking architecture: full path balancing, and partial path balancing. In the former case, each stage (individual block) is modeled as a big combinational logic block and is fully path balanced similar to prior art references [123,125]. Next, the sequential depth of each individual block (i.e., the maximum number of clocked SFQ cells on any input-output path for the block), which also represents the highest level in that block, is determined. Let D max denote the maximum depth of all individual blocks in the circuit. For a stage i with depth D i , D max D i path balancing SFQ DRO DFFs should be inserted at each of its primary outputs. This operation will make the depths of all individual blocks equal to D max . The last step can be skipped if a clock gating method similar to that presented in [114] is used to gate clock pulses of individual blocks with shorter depths once the valid data is generated at their outputs. Note that it takes D i fast clock cycles for an individual block of depth D i to have valid data at its outputs and because of clock gating in block i no spurious outputs will be generated and fed into block i + 1 during the nal D max D i clock cycles. In case of the partial path balancing, input repetition is required for correct operation of the SFQ circuit. For this purpose, pipeline latches should be replaced by NDROs and a garbage collecting SFQ AND gate should be inserted at each output of an individual block (see Figure 1 in [122]). In the dual clocking method presented above, the local (fast) clock may be hidden so that at the system level only the slow clock is observed, which makes the SFQ circuit look similar to an architecturally pipelined CMOS circuit. Also, the dual clocking method can save a lot of 156 area by eliminating many path balancing DFFs; however, it can degrade the pick throughput of the circuit (interested reader may refer to [122] for details). 6.3.1.2 Single Clocking Method In this method, only a fast clock will be used and the circuit will be fully path balanced which will resolve the peak throughput degradation issue associated with the dual clocking method at the expense of large area overhead of the path balancing DFFs. The single clocking method can also be done using two dierent approaches: removing architectural pipeline latches, or keeping architectural pipeline latches. In the rst approach, after removing the said pipeline latches, the circuit will turn into a large combinational block with primary inputs being the inputs of the rst stage of the pipeline, and primary outputs being the outputs of the last pipeline stage. Next, this combinational block is fully path balanced. In the second approach, architectural pipeline latches are kept and get a logic level assignment similar to any other clocked logic cells in the SFQ circuit. Subsequently, the whole circuit is fully path balanced. Keeping the pipeline latches might have some benets in circuit verication because it will preserve the overall pipeline architecture of the circuit as envisioned by the system designers. It will, however, potentially increase the input-to-output latency of the full circuit because of the redundant pipeline latches that have been preserved. 6.3.2 Circuits with Arbitrary Latch Placement For these circuits, rst, the level assignment algorithms presented in Section 6.2 may be used to assign correct levels to all cells (both clocked logic cells and latches) in the circuit. Next, the second approach presented in Section 6.3.1.2, i.e. keeping architectural pipeline latches can be used to fully path balance the circuit. 157 6.3.3 Circuits with Feedback Loops Similar to the previous category, we should use the algorithms presented in Section 6.2 for level assignment. Also, for having correct functionality, lengths of all feedback loops in the circuit should be the same. Therefore, if there is a feedback loop with a length smaller than the maximum loop length in the circuit, some latches should be inserted to shorter loops for equalization. The challenge is how this should be done such that it does not result in violating the full path balancing property for the rest of the circuit. As mentioned in Section 6.2, in each feedback loop in a VLSI circuit there is a feedback latch. A feedback latch has always one fanin and every feedback loop has to go through a feedback latch. These properties will be used in the leveling of these circuits and for equalizing lengths of feedback loops. To assign levels to circuits of the third category while equalizing lengths of feedback loops in these circuits, we present the following algorithm: (i) Use Algorithm 9 for assigning levels to all nodes, (ii) Temporarily set levels of feedback latches to 0 and only update levels of feedback nodes accordingly, (iii) Perform full path balancing which means if level of node v i and one of its fanin nodes v j are L i and L j , respectively, insert (L i -L j -1) latches between v j and v i , (iv) Reset levels of feedback latches to the values calculated by Algorithm 9, and nally, (v) Find the maximum level among feedback latches (L max ), and insert L max -L i latches between a feedback latch with levelL i and its fanin node. The third step equalizes lengths of all loops going through a particular feedback latch; this length will be the same as the level of the feedback latch. The fth step equalizes lengths of all loops in the circuit. We refer to the whole process as balancing a VLSI circuit with feedback loops. Algorithm 10 shows the pseudo code for balancing a VLSI circuit with feedback loops. In line 1, the cyclic digraph representing the given VLSI circuit is loaded into a network structure. In line 2, levels of all internal nodes are calculated using Algorithm 9; as seen before, this has a complexity of O(n 2 ). Line 3 nds all feedback latches and temporarily sets their levels to 0; the 158 Algorithm 10: Balancing a VLSI circuit with feedback loops Input: G=(V,E), a cyclic digraph representing a VLSI circuit Output: A network pNtk representing graph G and with correct level values and equalized loops 1 pNtk = Load G into a network data structure. 2 pNtk = Use Algorithm 9 to assign levels to nodes. 3 fLatches = Find feedback latches and set their levels to 0. 4 Update levels of feedback nodes (excluding feedback latches). 5 vNodes = Pointer to vector of nodes in pNtk. // full path balancing: 6 for each node v i in vNodes do 7 for each fanin node v j do 8 if v j .level + 1 <v i .level then 9 Add (v i .level - v j .level - 1) latches on the edge between v j to v i . 10 Reset levels of feedback latches back to values calculated in line 2. // equalizing lengths of all loops going through different feedback latches: 11 L max = Find the maximum level of all feedback latches. 12 for each latch v i in fLatches do 13 if v i .level <L max then 14 Insert (L max - v i .level) latches on the edge between v i and its fanin node. 15 return pNtk complexity of this operation is O(n 2 ), because it takesO(n) to go through all nodes and for each node to check if it is in a loop a BFS or Depth First Search (DFS) should be done (which have linear time complexities). In line 4, Algorithm 9 (without doing sorting) should be used to update the levels of feedback nodes after setting levels of feedback latches to 0; the complexity of this operation isO(n). In lines 6-9, a full path balancing operation is performed which has linear time complexity (note that because the maximum in-degree of logic cells is xed and known a priori, the for loop in line 7 will be done in constant time). After this step, lengths of all feedback loops going through the same feedback latch will be the same. Next, lines 10-14 equalize lengths of all feedback loops in the circuit. Complexities of lines 10, 11, and 12 are all linear in the circuit size. Therefore, the total complexity of Algorithm 10 will be quadratic, i.e. O(n 2 ). Lemma 6.3.1 Algorithm 10 equalizes lengths of all loops going through the same feedback latch. 159 v 6 Latch V 7 v 1 v 2 v 3 v 4 v 5 Figure 6.4: Part of a VLSI circuit with showing all feedback loops going through the feedback latch v7. As Lemma 6.3.1 states, after applying Algorithm 10, lengths of all of these loops will be the same. Proof : For simplicity and without loss of generality, as shown in Figure 6.4, let's assume that there are three loops going through a feedback latch named v 7 . Since all these loops will go throughv 7 and becausev 7 has only one fanin, these loops will have to eventually meet at a node like v 6 . Before that, the rst and the second loops will meet at a node like v 5 . Now since in the step three of the above algorithm level of v 7 is set to 0 and because of the property of full path balancing, there will be equal number of gates and/or path balancing latches seen in both of these loops up to node v 5 . Since the rest of the path for these two loops till reaching v 7 will be the same, therefore, these two loops will have the same length. With similar reasoning and considering the fact that all loops will eventually meet at a single node that fans into the feedback latch (v 6 ), lengths of all loops will be the same. Example 18 In the circuit shown in Figure 6.5, there are two feedback loops, which go through the feedback latch with Id of 7. These loops are (for now please ignore two added squares that serve as balancing latches): 7! 3! 2! 4! 5 and 7! 6! 5. Algorithm 10 will add the said balancing latches to the circuit. This makes lengths of both loops equal to ve. 160 1 not 3 not 4 and2 2 or2 6 or2 5 and2 7 Latch Count[0] en Figure 6.5: A VLSI circuit with one feedback latch and two feedback loops (7! 3! 2! 4! 5 and 7! 6! 5). The green squares are balancing latches which will be added by Algorithm 10, resulting in equalizing lengths of both loops. This is part of a real circuit. Lemma 6.3.2 After applying Algorithm 10 to a circuit, lengths of all loops going through a feedback latch v i will be the same as the level of this latch. Proof : Let fanin node ofv i bev j with level ofl(v j )=L j . Consider a loop likeL . This loop starts from latch v i , goes through dierent nodes and eventually arrives at v j . Based on the Denition 2, considering the property of full path balancing, and since at step three of the above algorithm level of v i is temporarily set to 0, we should have seen L j nodes (including v i and excluding v j ) and/or balancing latches in this loop. Counting v j itself, the length will be L j +1 which is equal to the level of the feedback latch v i . According to Lemma 6.3.1 this applies to all loops going through v i and therefore lengths of all of them will be the same as level of v i . Example 19 After applying Algorithm 10 to the circuit shown in Figure 6.5, levels of nodes and the feedback latch will be: l(v 1 ): 1, l(v 2 ): 2, l(v 3 ): 1, l(v 4 ): 3, l(v 5 ): 4, l(v 6 ): 1, l(v 7 ): 5. As seen in Example 18, lengths of both loops in this circuit will be equal to ve, which is the same as the level of the feedback latch. Theorem 6.3.3 Algorithm 10 equalizes lengths of all loops in the circuit where the loop length is the maximum level among all feedback latches in the circuit. Proof : Based on Lemmas 6.3.1 and 6.3.2 lengths of all loops going through a particular feedback latch will be the same as level of this latch. The only remaining thing that needs proof is that 161 lengths of loops going through dierent feedback latches will also be the same. Since in line 14 of Algorithm 10 the balancing latches are inserted between a feedback latch v i and its only fanin node, all loops that were going through this feedback latch will go through the added balancing latches as well. This means that lengths of all of these loops will increase by L max L i in which L max is the maximum level among all feedback latches and L i is the level of the feedback latch v i . Therefore, lengths of all loops going through v i will be the same as maximum loop length in the circuit. It is clearly seen that this applies to other loops going through other feedback latches as well. Therefore, Algorithm 10 will make lengths of all loops equal and based on Lemma 6.3.2 that equal value will be L max . Once the process of balancing the circuit is completed, either one of the following clocking methods may be used: 6.3.3.1 Single Clocking Method All gates and DFFs receive a fast clock, new signals are applied to the primary input ports of the circuit only every L max cycle. and outputs are sampled at the same rate as the input initiation rate. 6.3.3.2 Dual Clocking Method Feedback latches receive a slow clock, garbage collecting AND gates are inserted between feedback latches and their only immediate fanin nodes. These AND gates will get the slow clock as one of their inputs. The ratio of frequencies of slow and fast clocks is set to 1 Lmax+1 . If partial path balancing is the design approach to be used, then the feedback latches will be replaced by NDROs. 162 Figure 6.6: The Verilog code for the counter circuit, yosys-counter, used in Section 6.4.1. 6.4 Experimental Results After parsing an input circuit, balanced factorization and rewriting algorithms are applied [120] and then path balancing technology mapping [121, 123, 125] is done. For comparison, we used supergates of level three [106] for generating technology mapped results as well. Using supergates, it is possible to look deeper into the circuit and therefore provide better mapping solutions specially in terms of the maximum loop length for an input circuit with feedback loops. After performing the technology mapping, the circuit is given to Algorithm 10 for level assignment, path balancing and feedback loop equalization. A user can select to perform standard retiming [92] in order to further reduce the total latch count; in this case some of the original latches (such as pipeline latches) may be removed during the retiming process. Finally, latches will be replaced with SFQ DRO DFFs and splitters will be added to outputs of gates with more than one fanout. In the following, we discuss synthesis of a counter circuit with providing detailed results followed by some experimental results on ISCAS89 [17] sequential benchmark circuits. 163 2 not 3 and2 4 Latch Count en rst 1 xor2 Figure 6.7: A 1-bit version of the counter in Figure 6.6 after applying technology-independent optimiza- tions, technology mapping and also Algorithm 10. This circuit has only one feedback loop of length three as follows: 4! 1! 3. Note that the worst stage delay in this implementation is determined by the intrinsic delay of an and2 gate to be 8.4ps. Figure 6.8: Waveforms for showing correct functionality of the 1-bit counter in Figure 6.7. At rst, since the reset signal (rst) is 1, the counter is reset and count becomes 0. Then, the enable signal (en) becomes 1 and causes the count to get a value of 1 after three clock cycles. Note that for this implementation we should apply inputs and do sample outputs every three fast clock cycles. For each pulse on en pin every three clock cycle, the value on count output will be inverted, i.e. if it is 0, it will become 1 and vice versa. This is how a 1-bit counter is supposed to work. The counter preserves its previous counted value when there is no pulse on en. 6.4.1 Case Study: A Counter Circuit We have used a design for counter with an enable, en, a reset, rst, and a clock, clk, inputs. The Verilog code for this counter circuit, which is called yosys-counter 1 is shown in Figure 6.6. By default, it is a 3-bit counter, however, in this section we present synthesis results for 1-4-bit versions of this counter. The operation of this counter in CMOS technology is as follows: at the posedge of the clock, if the reset signal is 1, the counter's output will get a value of 0. Otherwise, as long as the enable signal is 1, the counter will count up by 1 at each clock cycle until it over ows and then it starts over. The counter will preserve its already counted value in case of having 0 on the enable input pin at the positive edge of the clock. The operation of this counter in superconducting SFQ technology will be similar but with some dierences which are explained 1 https://github.com/YosysHQ/yosys/blob/master/examples/cmos/counter.v 164 1 not 3 not 4 and2 2 or2 6 or2 5 and2 8 and2 9 not 10 or2 11 or2 12 and2 13 and2 14 not 7 Latch 15 Latch Count[0] Count[1] en rst Path balancing DFF Primary input Primary output Figure 6.9: A 2-bit version of the counter in Figure 6.6 after applying technology-independent opti- mizations, technology mapping and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. This circuit has four feedback loops as follows with the maximum length of ve for the rst one: 7 ! 3 ! 2 ! 4 ! 5, 7 ! 6 ! 5, 15! 14! 11! 13, and 15! 10! 12! 13. in the following: once the clock pulse arrives, the output will be reset to 0 if there was a pulse on the reset pin before arrival of this clock pulse, otherwise, the counter will count up by 1 if a pulse is present on the enable pin before arrival of the clock pulse; if no pulse arrives on the enable pin before arrival of the clock pulse, the counter will preserve its current counted value. Also, similar to CMOS, the counter will count up by 1 until it reaches the maximum value and then it over ows and starts over. As you will see in this section, the implementation of this counter in SFQ technology requires that inputs are applied every few clock cycle and they are collected with the same rate. We have used an SFQ cell library [45] that contains the following gates: and2 with 16 JJs, or2 with 14 JJs, xor2 with 18 JJs, not with 12 JJs, DRO DFF with 11 JJs, NDRO DFFs with 165 Figure 6.10: Waveforms for showing correct functionality of the 2-bit counter in Figure 6.9. At rst, since the reset signal (rst) is 1, the counter is reset and both count[0] and count[1] become 0. Then, the enable signal (en) becomes 1 and causes the count[0] to get a value of 1 after ve clock cycles. Note that for this implementation we should apply inputs and do sample outputs every ve fast clock cycles. The counter counts up by 1 for every new pulse on en until it reaches 11 and then it over ows to 00 and starts over. The counter preserves its previous counted value when there is no pulse on en. 18 JJs, and splitter with 6 JJs. These cells are available in both Rapid Single Flux Quantum (RSFQ) [95] and Energy-ecient RSFQ (ERSFQ) [84] technologies. After applying technology-independent optimization functions, performing technology map- ping, and applying Algorithm 10, the 1-bit, 2-bit, and 3-bit versions of this counter will be as in Figure 6.7, Figure 6.9, and Figure 6.11, respectively. The 1-bit counter has only one feedback loop of length three (4! 1! 3) and does not require any loop length equalization or path balancing unlike the 2-bit and 3-bit versions. The green squares in Figures 6.9, 6.11 are path balancing latches and are added by Algorithm 10. The 2-bit counter has four feedback loops two of which go through the feedback latch 7, and the other two go through the feedback latch 15. The 3-bit counter has six feedback loops: two of them go through the feedback latch 7, two of them go through the feedback latch 15, and the other two go through the feedback latch 23. The maximum loop lengths are ve and six for the 2-bit and 3-bit versions, respectively. In the 2-bit version, after equalizing lengths of feedback loops that go through the same feedback latches (lines 1-9 of Algorithm 10), there will be no need for equalizing lengths of loops that go through dierent feedback latches, because they are all already at the maximum value. However, in the 3-bit version, a latch will be added between node 5 and latch 7 (Figure 6.11) to bring the lengths of loops going through the feedback latch 7 to the maximum loop length in the circuit, which is six. The 4-bit version of the counter has four feedback latches and eight feedback loops with the maximum feedback loop length of seven. Figures. 6.8, 6.10, 6.12, 6.13 show waveforms for 166 verifying correct functionalities of 1-4-bit versions of the counter. Note that for 1-bit, 2-bit, and 3-bit implementations of the counter shown in Figures 6.7, 6.9, and 6.11 inputs should be applied every 3, 5, and 6 fast clock cycles, respectively, and the correct outputs get collected with the same rate. This factor for the 4-bit counter is 7. If supergates are used in the technology mapping step, the 1-4-bit version of the counter will be as in Figures 6.14, 6.15, 6.17, 6.19. The 1-bit version, Figure 6.14, will still have only one feedback latch and one feedback loop with length of one. It won't need nay path balancing and/or loop length equalization. Comparing implementations in Figure 6.7 and Figure 6.14, the worst stage delay will be lower in the later implementation. This is because the worst case stage delay is determined by and and2 gate in the former implementation with intrinsic delay of 8.4ps and in the later one is determined by an or2 gate with the intrinsic delay of 6.1ps. Given that both circuits have the same maximum level for feedback latches, therefore, the throughput of the second implementation will be better. Comparing Figure 6.9 with Figure 6.15 and Figure 6.11 with Figure 6.17, it is seen that using supergates provides a lot of improvements in total node count, total DFFs, and more importantly it resulted in reducing the maximum level among feedback latches and hence the maximum loop lengths in the latter circuits. This means the circuits generated by using supergates will have better maximum throughput as well. Using supergates resulted in decreasing the number of feedback loops in 2-4-bit versions of the counter. For example, the 2-bit version will have only two feedback loops (4! 1! 3 and 8! 6! 7) instead of four feedback loops in the previous version that is shown in Figure 6.9. For the 3-bit version, the number of feedback loops decreased from six to three as is shown in Figure 6.17: 3! 1! 2, 9! 6! 8, and 13! 11! 12. For the 4-bit version, the number of feedback loops is decreased from eight to ve as is shown in Figure 6.19: 3! 1! 2, 9! 6! 8, 13! 11! 12, 22! 21! 20! 19, and 22! 18! 17! 20! 19. Figures 6.16, 6.18, 6.20 show the waveforms for verifying correct functionality of 2-4-bit versions of the counter generated by using supergates. Notice that in these waveforms, inputs are applied with a faster rate compared with those of Figures 6.10, 6.12, 6.13 167 Table 6.1: Experimental results for 1-4 -bit versions of the counter circuit presented in Section 6.4.1, generated by applying technology-independent optimizations, technology mapping and also Algorithm 10. Lmax is the maximum level among feedback latches which is the same as the maximum loop length in the circuit. The 1-3-bit versions are shown in Figures 6.7, 6.9, 6.11. Circuit Area (m 2 ) #JJs Power(W ) Freq. (GHz) L max RSFQ ERSFQ local clk throughput 1-bit counter 19500 65 6.50 0.21 75.76 25.25 3 2-bit counter 104000 307 37.39 1.19 75.76 15.15 5 3-bit counter 190000 541 69.11 2.21 54.95 9.16 6 4-bit counter 305000 866 11.38 3.56 54.35 7.76 7 Table 6.2: Experimental results for 1-4 -bit versions of the counter circuit presented in Section 6.4.1, generated by applying technology-independent optimizations, technology mapping (with using supergates) and also Algorithm 10. Lmax is the maximum level among feedback latches which is the same as the maximum loop length in the circuit. These circuits are shown in Figures 6.14, 6.15, 6.17, 6.19. Circuit Area (m 2 ) #JJs Power(W ) Freq. (GHz) L max RSFQ ERSFQ local clk throughput 1-bit counter 21000 72 7.31 0.23 163.93 54.64 3 2-bit counter 66500 217 24.39 0.78 119.04 29.76 4 3-bit counter 122000 397 44.72 1.43 54.95 10.99 5 4-bit counter 198000 642 73.17 2.34 54.35 10.87 5 (also outputs are collected with similar faster rates as well). Note that waveforms for the 1-bit counter in Figure 6.14 will be the same as the waveforms shown in Figure 6.8. Table 6.1 shows total area (including area of gates, path balancing DFFs, and splitters), Josephson junction count (#JJs), maximum local clock frequency, and throughput for 1-4 -bit versions of the counter presented in this section. Table 6.2 shows the same for the circuits generated by using supergates. The local clock frequency is determined by the largest stage delay post-synthesis which is equal to the intrinsic delay of a gate plus the delay of a splitter tree added to its output. As mentioned in Section 6.3.3.1, outputs should be collected every L max fast clock cycle whereL max is the maximum length among feedback latches. This determines the throughput of the circuit. In other words, the throughput is calculated by dividing the maximum local clock frequency by L max . As seen, the circuits that are generated by using supergates have 168 Table 6.3: Experimental results for the rst 15 benchmark circuits in ISCAS89 [17] suite. Lmax is the maximum loop length in the circuit. Circuit Area (m 2 ) #JJs Power(W ) Freq. (GHz) L max RSFQ ERSFQ local clk throughput s27 188000 527 66.66 2.13 62.11 8.87 7 s298 1344000 3698 491.05 15.7 35.46 3.94 9 s344 2054000 5475 748.77 23.95 35.46 3.55 10 s349 2054000 5475 748.77 23.95 35.46 3.55 10 s382 1782000 4934 650.4 20.8 35.46 3.55 10 s386 1587500 4580 582.11 18.62 35.46 3.94 9 s400 1862000 5144 678.86 21.71 35.46 3.55 10 s420.1 2007500 5661 727.64 23.27 42.74 4.27 10 s444 1854000 5148 676.42 21.63 35.46 3.55 10 s510 2434000 7085 890.24 28.47 35.46 3.22 11 s526 2281500 6365 834.14 26.68 35.46 3.94 9 s641 4560500 11462 1651.2 52.81 35.46 2.09 17 s713 4538500 11421 1643.07 52.55 35.46 2.09 17 s820 3455000 9797 1264.22 40.43 35.46 2.96 12 s832 3385500 9641 1239.01 39.63 35.46 2.96 12 s838.1 4549500 12710 1652.02 52.83 42.74 3.89 11 much better maximum throughput values, because they have smaller values for the maximum level among feedback latches. As in Table 6.2, the 4-bit counter can operate with a high throughput of 10.87GHz. To better compare the area, #JJs and power consumption of counters generated by regular mapper with those that are generated using supergates, we depicted some of the data in Tables 6.1, 6.2 in Fig. 6.21 and Fig. 6.22. 6.4.2 ISCAS89 Benchmark Circuits We have experimented on sequential benchmark circuits of ISCAS89 [17]. Using the same RSFQ and ERSFQ cells as in Section 6.4.1, the experimental results for the rst 15 benchmarks circuits will be as listed in Table 6.3. As seen, these results include total area (gates, path balancing 169 DFFs, and splitters), Josephson junction count (#JJs), maximum local clock frequency, and throughput. The results range from 188000 to 4549500 m 2 , and from 527 to 12710, for area and #JJs, respectively. The power consumption ranges from 66.66 to 1652.02 W in case of using RSFQ cells and from 2.13 to 52.83 W in case of using ERSFQ cells. Also, the local clock frequency and throughput values are as high as 62.11 and 8.87GHz, respectively for the s27 benchmark circuit. Each element in the last column of Table 6.3 shows length of the longest feedback loop in each reported benchmark circuit, which is the same as the maximum level among feedback latches in the circuit; after applying Algorithm 10, lengths of all loops will be the same as this value. 6.5 Conclusions In this chapter we presented an algorithm for sorting nodes of a directed cyclic graph and then used that to provide a polynomial time algorithm for level assignment in these graphs. This allowed us to provide full algorithmic and software tool support for synthesizing sequential circuits in the Superconducting Single Flux Quantum (SFQ) technology. For the rst time we could synthesis a nite state machine, a counter circuit, in SFQ technology with post-synthesis power consumption as low as 2:21W , clock frequency of 54.95GHz, and throughput of 9.16GHz. Correct functionality of 1-4-bit versions of this counter circuit is veried in exhaustive simulations. Finally, more experimental results on ISCAS89 sequential benchmark circuits were presented. 170 1 not 3 not 4 and2 2 or2 6 or2 5 and2 8 and2 9 not 10 or2 11 or2 12 and2 13 and2 14 not 7 Latch 15 Latch Count[0] Count[1] en rst 16 and2 17 and2 18 and2 23 Latch 20 or2 19 or2 21 or2 22 not Count[2] Path balancing DFF Primary input Primary output Figure 6.11: A 3-bit version of the counter in Figure 6.6 after applying technology-independent opti- mizations, technology mapping and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. This circuit has six feedback loops as follows with the maximum length of six for the last one: 7 ! 3 ! 2 ! 4 ! 5, 7 ! 6 ! 5, 15! 14! 10! 12! 13, 15! 11! 13, 23! 20! 18, and 23! 22! 21! 19! 17! 18. Figure 6.12: Waveforms for showing correct functionality of the 3-bit counter shown in Figure 6.11. Since the maximum level among feedback latches is six in this implementation, new inputs should be applied every six clock cycles, and every six clock cycles a new correct output will be generated. The counter counts up by 1 for every pulse on en until it reaches 111 and then it over ows to 000 and starts over. The counter preserves its previous counted value when there is no pulse on the en pin. 171 Figure 6.13: Waveforms for showing correct functionality of the 4-bit counter generated by applying technology-independent optimizations, regular technology mapping (without using supergates) and also Algorithm 10. New inputs should be applied every seven clock cycles, and every seven clock cycles a new correct output will be generated. The counter counts up by 1 for every pulse on en until it reaches 1111 and then it over ows to 0000 and starts over. The counter preserves its previous counted value when there is no pulse on the en pin. 3 xor2 4 Latch Count en rst 1 or2 2 or2 Figure 6.14: A 1-bit version of the counter in Figure 6.6 after applying technology-independent opti- mizations, technology mapping (with using supergates) and also Algorithm 10. This circuit has only one feedback loop of length three as follows: 4! 1! 3. Note that the worst stage delay in this implemen- tation is determined by the intrinsic delay of an or2 gate to be 6.1ps while in the implementation shown in Figure 6.7 it is determined by an and2 gate to be 8.4ps. Having the same maximum level for feedback latches, therefore, this implementation has better throughput (see Tables 6.1, 6.2). 172 5 not 3 and2 1 xor2 4 Latch 8 Latch Count[0] Count[1] en rst Path balancing DFF Primary input Primary output 2 and2 6 xor 7 and2 Figure 6.15: A 2-bit version of the counter in Figure 6.6 after applying technology-independent optimiza- tions, technology mapping (using supergates) and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. Note that compared to the implementation shown in Figure 6.9, this implementation will have a better throughput (see Tables 6.1, 6.2) because the maximum level for feedback latches in it is smaller by one. This circuit has two feedback loops as follows: 7! 4! 1! 3, and 8! 6! 4. Figure 6.16: Waveforms for showing correct functionality of the 2-bit counter in Figure 6.15. At rst, since the reset signal (rst) is 1, the counter is reset and both count[0] and count[1] become 0. Then, the enable signal (en) becomes 1 and causes the count[0] to get a value of 1 after four clock cycles. Note that we should apply inputs and do sample outputs every four fast clock cycles. The counter counts up by 1 for every new pulse on en until it reaches 11 and then it over ows to 00 and starts over. The counter preserves its previous counted value when there is no pulse on en. 173 2 and2 1 xor2 4 and2 5 not 3 Latch 9 Latch Count[0] Count[1] en rst 13 Latch Count[2] Path balancing DFF Primary input Primary output 6 xor2 8 and2 7 and2 10 or2 11 or2 12 xor2 Figure 6.17: A 3-bit version of the counter in Figure 6.6 after applying technology-independent optimiza- tions, technology mapping (using supergates) and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. Note that compared to the implementation shown in Figure 6.11, this implementation will have a better throughput (see Tables 6.1, 6.2) because the maximum level for feedback latches in it is smaller by one. This circuit has three feedback loops as follows: 3! 1! 2, 9! 6! 8, and 13! 11! 12. Figure 6.18: Waveforms for showing correct functionality of the 3-bit counter shown in Figure 6.17. Since the maximum level among feedback latches is ve in this implementation, new inputs should be applied every ve clock cycles, and every ve clock cycles a new correct output will be generated. The counter counts up by 1 for every pulse on en until it reaches 111 and then it over ows to 000 and starts over. The counter preserves its previous counted value when there is no pulse on the en pin. 174 2 and2 1 xor2 4 and2 5 not 3 Latch 9 Latch Count[0] Count[1] en rst 13 Latch 6 xor2 8 and2 7 and2 10 or2 11 or2 12 xor2 Path balancing DFF Primary input Primary output Count[2] 14 and2 15 and2 17 xor2 18 or2 16 or2 19 xor2 20 or2 21 or2 22 Latch Count[3] Figure 6.19: A 4-bit version of the counter in Figure 6.6 after applying technology-independent opti- mizations, technology mapping (using supergates) and also Algorithm 10. The green squares are added by Algorithm 10 for balancing paths and for equalizing lengths of feedback loops. This circuit has ve feedback loops as follows: 3 ! 1 ! 2, 9 ! 6 ! 8, 13 ! 11 ! 12, 22 ! 21 ! 20 ! 19, and 22! 18! 17! 20! 19. 175 Figure 6.20: Waveforms for showing correct functionality of the 4-bit counter shown in Figure 6.19. New inputs should be applied every ve clock cycles, and every ve clock cycles a new correct output will be generated. The counter counts up by 1 for every pulse on en until it reaches 1111 and then it over ows to 0000 and starts over. The counter preserves its previous counted value when there is no pulse on the en pin. Figure 6.21: Area and #JJs results for 1-4 bit counters in two cases of with (super) and without (map) using supergates during technology mapping. Area results belong to the left Y-axis which is in logarithmic scale and #JJs belong to the right Y-axis with normal linear scale. 0 1 2 3 4 5 6 0 20 40 60 80 100 120 1-bit 2-bit 3-bit 4-bit RSFQ (map) RSFQ (super) ERSFQ (map) ERSFQ (super) Figure 6.22: Static power consumption results for 1-4 bit counters with (super) and without (map) using supergates in two technologies: RSFQ and ERSFQ. The power results for RSFQ belong to the left Y-axis while the power results for ERSFQ belong to the right Y-axis. 176 Chapter 7 qSyn: A Synthesis Tool for Superconducting SFQ Logic Circuits qSyn is a behaivioral and logic synthesis tool developed for synthesizing superconducting SFQ circuits. This tool is mainly written in C programming language, but C++ and python are also employed. qSyn is divided into three sub-directories, qYosys: for parsing input Verilog/BLIF circuits and performing behavioral synthesis, qABC : for performing SFQ specic logic synthesis as well as logic equivalence checking (qEC ), and qView: for viewing small synthesized circuits, which contain splitters and path balancing DFFs. qSyn is part of a bigger tool called qPALACE, Physical and Logical Aware Compiler Engine for Single Flux Quantum Logic, that receives a high-level design in Verilog/BLIF formats and maps it to an SFQ chip. In the following sections, each of the said three sub-directories are explained in more details and several examples with the corresponding expected results are provided for demonstration purposes. 7.1 qYosys qYosys is based on an open source framework for Verilog RTL synthesis [154]. Yosys has extensive Verilog-2005 support and has a basic set of behavioral synthesis algorithms. We have made changes to Yosys to make it compatible with our SFQ specic logic synthesis tool, qABC. In 177 addition, as part of qYosys, we have developed a mini synthesis process for synthesizing DFFs with asynchronous resets in the SFQ technology. In the following, we will explain some of the behavioral commands of Yosys that are used for synthesizing SFQ circuits. Also, we will give more details on our mini synthesis process. 7.1.1 Behavioral Synthesis Commands In this Section, the following commands will be explained: hierarchy, proc, memory, opt, and techmap; the reference manual of Yosys is used for this purpose. For more details on these commands please refer to the tool's manual. • hierarchy: This command checks the design hierarchy and if it is necessary it instantiates parametrized versions of the modules in the design. If there is only one module in the design, this command won't do anything. However, as a general rule a synthesis script should always contain this command as the rst command after reading the input les. We usually use it with -top switch that is followed by name of the top module. This will remove modules that are not used in the specied top module. • proc: This command converts processes that are Yosys' internal representations of Verilog always- and initial- blocks to circuits of multiplexers and storage elements such as DFFs. • memory: This command transforms Yosys' internal representations of arrays and array ac- cesses to multi-port block memories, and then maps this block memories to address decoders and DFFs, unless the -nomap switch is used. In case of using -nomap switch, the multi-port block memories stay in the design and can then be mapped to architecture-specic memory primitives using other commands. • opt: This command is the Yosys' built-in optimizer. It can perform some simple optimiza- tions such as const-folding and removing unconnected parts of the design (dangling parts of 178 the design). It is common practice to call opt after each major step in the synthesis proce- dure. In cases where too much optimization is not appreciated (for example when analyzing a design), it is recommended to call clean instead of opt. • techmap: This command turns a high-level circuit with coarse grain cells such as wide adders and multipliers to a ne-grain circuit of simple logic primitives and single-bit storage elements. The command does that by substituting the complex cells by circuits of simpler cells. It is possible to provide a custom set of rules for this process in the form of a Verilog source le. 7.1.2 Mini Synthesis Process The mini synthesis process that is part of qYosys is developed to synthesize Yosys' sub-circuits that implement DFFs with asynchronous resets. The mini synthesis process is developed using python programming language and should be invoked when all behavioral synthesis and optimizations of Yosys are nished. There are eight dierent types of DFFs with asynchronous resets generated by Yosys as is shown in Figure 7.1. Since in SFQ circuits the edge sensitivity is not there, we will actually have four dierent types (the code should still handle eight types of these DFFs because its input is a CMOS style circuit generated by Yosys). Among all of these types of DFFs, only those with reset level (RstLvl) of 0 and reset value (RstVal) of 0 (two cases in Figure 7.1 which are mapped to top left circuit that is shown in Figure 7.2) can be handled using SFQ asynchronous (dynamic) AND gates. For other cases, we have to use inverters either in inputs of this AND gate or at its output (Figure 7.2). Since this inverter receives clock pulses, then the reset operation will not be asynchronous anymore. Therefore, currently our mini synthesis engine can synthesize only two out eight types of DFFs shown in Figure 7.1 and for the rest of them, it will not fail but it will convert a DFF with asynchronous resent into a DFF with synchronous reset. For handling 179 Figure 7.1: Eight dierent sub-circuits generated by Yosys for DFFs with asynchronous resets. RstLvl is the reset level or the value that should be put on the reset pin in order to activate the reset operation. RstVal is the reset value to which the circuit will be reset to. other types, we will need to have asynchronous (dynamic) inverter gates, which will be discussed as part of the future work of this dissertation. Figure 7.2 shows four dierent mapping solutions generated by our mini synthesis engine mapping Yosys' sub-circuits that implement DFFs with asynchronous resets into asynchronous and/or synchronous circuits. In these mapping solutions, SFQ DRO DFFs, SFQ inverters, and SFQ 2-input asynchronous AND gates are employed. In the top left and the bottom right circuits show in Figure 7.2, the reset operation is active low; this means that in normal operation modes of the circuit when the reset operation is not going to happen, there has to be continues pulses (representing \logic-1") on the reset pin. This might not be desired by the circuit and/or RTL designers and it may result in increasing dynamic activities and hence the corresponding power consumptions. Therefore, using De Morgan's law, we convert (using SFQ OR gates and inverters) these two circuits into those that do not require having continues pulses on reset pins in normal operation modes. 7.2 qABC qABC is our logic synthesis engine for synthesizing SFQ circuit. It is developed inside the UC Berkeley's latest open source logic synthesis and verication tool called ABC [38]. We have added 180 in reset out $_DFF_NN0, $_DFF_PN0 $_DFF_NP0, $_DFF_PP0 in reset out $_DFF_NP1, $_DFF_PP1 in reset out $_DFF_NN1, $_DFF_PN1 in reset out Figure 7.2: Four dierent mappings that are generated by our mini synthesis process for synthesizing Yosys' sub-circuits that implement DFFs with asynchronous resets. The names under each circuit match with those listed in the last column of the table shown in Figure 7.1. Figure 7.3: Options for balanced factorization and rewriting commands in qABC. the following sub-directories to the src directory of ABC: qIndep, qMap, qSeq, qUtils, qMisc, qBox, and qEC. • qIndep: This directory has two sub-directories: BalancedRefactor and BalancedRewrite, which contain source codes for implementing our balanced factorization (brf command) and balanced rewriting (brw command) algorithms. Figure 7.3 shows dierent options for these two technology-independent optimization commands. • qMap: This directory has eight sub-directories: PBMap, BalancedMap, addLatchRetime, adjustFanouts, adjustFanoutsPI, CorrectLevels, BalanceFLatch, ChangeName, repLatches, InsertSplitters, printLevels, Utils. PBMap and BalancedMap have source codes for implementations of our path balancing technology mappers (commands PBMap or qmap and BM ). Figure 7.4 shows dierent options for these commands. 181 Figure 7.4: Options for path balancing technology mapping commands (qmap and BM) in qABC. addLatchRetime (addLr command) adds path balancing latches to the mapped circuit and performs standard retiming if it is desired by the user. It supports two modes of full and partial path balancing and also it can do graph partitioning using our depth-bounded levelized (DLGP) algorithm. To keep the original latches in the circuit (for verication purposes if it is desired by a user) we have disabled the retiming operation in case of having an input circuit with at least one latch. Also, retiming operation can mess up the circuit in case of using DCM method for implementing the circuit. A user can still choose to perform retiming by using -F switch, but in this case, the correct structure and functionality of the circuit will not be guaranteed anymore. Figure 7.5 shows dierent options for this command. adjustFanouts command adjusts fanout count of gates and DFFs and guarantees to limit the maximum fanout count of any node to a value that is given by the user. For this purpose, for those nodes that have more than the given maximum number of fanouts, node duplication will be performed. For example, if the maximum allowed fanouts for a node is four and there is a node with nine fanouts, then two copies of this node will be added to the network and each of them will get four out of nine fanouts of the original node. This way, the original node will have only one fanout left which is below the given maximum value. 182 Figure 7.5: Options for addLr command in qABC that adds full/partial path balancing latches and performs standard retiming. In addition, this command implements our depth-bounded graph partitioning algorithm (DLGP) as well as our dual clocking method (DCM) for realizing SFQ circuits. This can propagate the violation in fanout count to fanins of the target node and eventually into the primary inputs and therefore create a new problem. To solve the new problem, adjustFanoutsPI command should be used. If there is a primary input with more than the given maximum number of fanouts, this command will add bunch of parallel DFFs and distribute fanouts of this primary input among them. To make sure that the full path balancing property is still maintained after this operation, it is required to add one DFF to the output of each other primary input. Figure 7.6 shows options for these two fanout adjusting commands. 183 Figure 7.6: Options for adj and adjPI commands in qABC that adjust fanout counts of nodes, latches and primary inputs and guarantee limiting them to a maximum value given by the user. Figure 7.7: Options for CorrectLevels or cl command in qABC that assigns correct levels to gates, latches, primary inputs, and primary outputs. CorrectLevels or cl command assigns correct level values to gates, latches, primary inputs, and primary outputs. It supports both combinational and sequential circuits and those with feedback loops. This command can remove all latches before assigning levels if it is desired by the user. It also can remove buers between two consecutive latches if it is selected by the user; these buers are normally added by ABC when there is no gate between two consecutive latches that is not needed in SFQ circuits, hence, are removed by the cl command if the \-B" switch is selected. Figure 7.7 shows options of this command. BalanceFLatch command adds some latches to inputs of feedback latches with smaller level values compared to the maximum level among all feedback latches in the circuit. It can also add garbage collecting SFQ 2-input AND gates to the inputs of the feedback latches; this is required if the dual clocking method with input repetition is used for realizing an SFQ circuit with feedback loops. Figure 7.8 shows options of this command. ChangeName command changes names of nodes, primary inputs, and primary outputs to some standard names. This is useful when some nodes have long names and it is hard to track them specially when the graph of the circuit is needed to be shown. Figure 7.9 shows 184 Figure 7.8: Options for BalanceFLatch or b command in qABC that balances feedback latches in an SFQ circuit with feedback loops. It can also add garbage collecting AND gates if the \-a" switch is selected. Figure 7.9: Options for ChangeName command in qABC that changes names of internal nodes, latches, primary inputs, and primary outputs to some standard names. options for this command. repLatches or repL command replaces latches with SFQ DRO DFFs. This command does not accept any arguments. InsertSplitters or InsS command inserts splitters at outputs of gates with more than one fanouts. If the fanout count for a gate is more than two, then a balanced or linear tree of splitters will be inserted at the output of the gate. This command also generates a \.levels" le containing names of nodes together with their logic levels. The \.levels" le is used in the next steps of the design ow for SFQ circuits including in the placement step. Figure 7.10 shows options for this command. Utils sub-directory contains many useful utility functions that are used in other places. • qSeq: This directory contains three sub-directories namely: IsSeq, qCut, and SPORTloop which are designed to operate on sequential circuits. IsSeq is a command to detect if a given circuit is sequential or not. In other words, it is designed to detect if a given circuit has any original latches or not. For this purpose, it prints out the number of latches in the given circuit. Therefore, if the printed number is more than 0, then the given circuit is a sequential one. 185 Figure 7.10: Options for InsS command in qABC that inserts splitter trees at outputs of gates with more than one fanouts. qCut command is designed to create a two stage pipeline circuit out of the given combina- tional circuit. It will cut the circuit by adding latches at the level that is given by the user. Figure 7.11 shows options of this command. SPORTloop sub-directory contains commands and source codes for detecting all feedback loops in a given circuit as well as nding the maximum length among these loops and equalizing lengths of all loops by adding some SFQ DRO DFFs to shorter loops. • qUtils: This directory has two sub-directories: misc and io. The rst one contains many dierent utility functions that are used throughout the whole logic synthesis and technol- ogy mapping processes. The second directory has the source codes for qWrite blif or qwb command. This command is designed to write a synthesized, mapped, and path balanced SFQ circuit with inserted splitters, into a BLIF format. It supports circuits containing DRO DFFs with fast clock, DRO DFFs with slow clock, and NDRO DFFs. It also can use the real name of splitters (as opposed to the previous old command that uses DFF name for splitters) if the \-s" switch is selected by the user. Figure 7.13 shows options of this command. • qMisc: This is the miscellaneous directory of qABC that contains dierent things such as functions for printing some statistics about the circuit. These statistics include: gate count, DRO DFF count, NDRO DFF count, splitter count, maximum level, maximum fanout count, total node count, total area, total Josephson junction count, worst stage delay, and static power consumption. The command's name is SPORTPrintStats and its variations are 186 Figure 7.11: Options for qCut command in qABC that creates a two stage pipeline circuit out of a given combinational circuit. Figure 7.12: Options for SPORTPrintStats command and its variations sprss and sprsl in qABC. This commands is designed to print some statistics about the circuit. sprss for printing a brief stat and sprsl for a more detailed stat report. Figure 7.12 shows options of this command. • qBox: This is for synthesizing circuits with black and white boxes. Currently it has two sub-directories: const and black. The rst sub-directory implements command const which replaces zero gates and one gates (ABC's implementations for constant-0 and constant-1, respectively) with primary inputs. The second sub-directory implements command black which handles black boxes; it models black boxes of the circuit as a single unied big black box and generates a description le containing inputs and outputs of this box and estimated horizontal and vertical sizes of the box. This command should be called at the end of logic synthesis and technology mapping and when all steps including full path balancing and splitter insertion are done. Figures 7.14, 7.15 show options for these two commands. • qEC : This is the directory with source code for performing formal verication of SFQ cir- cuits. Currently, it only supports verifying combinational circuits. For more details on qEC please see [41]. 187 Figure 7.13: Options for qWrite blif or qwb command in qABC. This command is designed to write a synthesized, mapped, path balanced SFQ circuit with inserted splitters into a BLIF format. Figure 7.14: Options for const command in qABC. This command is designed to replace gate zeros and gate ones (ABC's implementations for constant-0 and constant-1, respectively) with primary inputs. 7.3 qView qView is designed to generate graphical representations of SFQ circuits that are synthesized, mapped, path balanced, and with splitters inserted at the outputs of gates with more than one fanouts. For generating graphical views for these circuits, a \.blif" and the corresponding \.levels" le that are generated by qABC are needed. Having these les, the following command should be invoked: python3 qView.py circuit-name.blif In the next section, examples of how to view graphs of SFQ circuits will be provided (e.g. see Example 20 and Figure 7.19). 7.4 Demonstrative Examples In this section, ve examples are provided to walk the reader through synthesizing dierent circuits in SFQ technology using our tool qSyn. Example 20 This example is for synthesizing and mapping a 4-bit Kogge-Stone adder (KSA4). We assume that the user has a Verilog representation of this adder (KSA4.v). Also, an SFQ 188 Figure 7.15: Options for black command in qABC. This command is designed to handle black boxes. library of gates in the standard genlib format [130] will be used (SFQ.genlib). So, in the rest of this example we assume that KSA4.v and SFQ.genlib are located inside the main directory of qABC. Now, you need to open a terminal and type ./qABC and press the Enter button. To read the library of gates, you should type: read library SFQ.genlib. To read the input circuit, you should type: read verilog KSA4.v. If you want to perform SFQ specic technology-independent optimizations such as balanced factorization and rewriting, you should use one of our optimization scripts such as blnc syn2. Then, qmap, and addLr -L -R -v should be used to perform SFQ specic technology mapping, and to add path balancing latches and to perform retiming. If you wish to limit the fanout count of each gate after technology mapping, you should use adj -C num, where num is the fanout count limit. Circuits generated after qmap, and addLr -L -R can be displayed by using the show -g, and show -s commands, respectively. They are shown in Figure 7.16, and Figure 7.17, respectively. Finally, we need to replace latches with SFQ DRO DFFs and insert splitters to outputs of gates or primary inputs with more than one fanouts. This can be done using repL and InsS -B commands, respectively. The last command receives KSA4 as the name of the circuit too. The \-B" after \InsS" command makes sure to generate a balanced splitter tree. If you do not want to have a balanced splitter tree and prefer a linear tree, you should remove \-B" when calling this command. Now, to print some statistics about the circuit the sprsl command should be used; these statistics include logical depth, gate and DFF count, splitter count with and without clock splitters, area, and total JJ count. After invoking the discussed commands up to now, you will see the messages shown in Figure 7.18. To write the nal circuit into a BLIF format, the following command should be used: write blif KSA4.blif. 189 Figure 7.16: Synthesized and mapped circuit for KSA4 by using qmap command. Logic gates, and primary inputs/outputs are shown by circles, and triangles, respectively. To see the nal mapped circuit with splitters and path balancing DFFs, we need to go back to the qSyn directory and perform the following command: python3 qView.py qABC/KSA4.blif. This will generate the circuit shown in Figure 7.19. The last argument in this command is for giving the path of the nal circuit that we created before (KSA4.blif). Example 21 For the circuit with Verilog code as shown in Figure 7.20, after invoking blnc syn2 and qmap commands, the circuit will be as shown in Figure 7.21a. By invoking the level correction command (cl -B), this circuit will be transformed into Figure 7.21b. Then, the full path balancing command will trun it into the shown circuit in Figure 7.21c. Finally, after replacing latches with SFQ DRO DFFs (repL command), inserting splitters (InsS -B regtest), and writing it into a BLIF format, the nal circuit can be viewed using qView package. The nal circuit displayed by qView will be as shown in Figure 7.22. Example 22 Yosys-counter: In this example a Verilog simulation tool called qVsim and some converters from qPALACE package are used. The Verilog code for this counter is shown in Figure 6.6 in Chapter 6. In the default setting, it is a 3-bit counter, however, in this example we present 190 Figure 7.17: Synthesized, mapped, and fully path balanced circuit for KSA4 by using qmap and addLr -L -R commands. Logic gates, path balancing latches, and primary inputs/outputs are shown by circles, rectangles, and triangles, respectively. synthesis ow together with simulation results for 2-bit, 3-bit, and 4-bit counters. In the following we go through the synthesis and simulation ow for the 2-bit counter (counter2). For 3-bit and 4-bit versions, the ow is similar. First, counter.v should be read by qYosys and be converted into a format that qABC can parse. For this purpose, the following commands should be used in qYosys environment: read verilog counter:v proc;opt; techmap; write blif counter2:blif Now, counter2.blif should be copied to the qABC directory. After entering the qABC environ- ment and reading the library and the counter2 circuit, the following commands should be in- voked. blnc syn2; qmap; cl -B -v; b -b -v; addLr -L -R; repL; InsS -B counter2; qWrite blif counter2.blif. The generated circuit after cl -B -v and addLr -L -R commands will be as shown in Figure 7.23 and Figure 7.24, respectively. To see the graph representation of a circuit with latches in qABC environment you can type: show -s. 191 Figure 7.18: Messages printed to the command prompt in the qABC environment when synthesizing the KSA4 benchmark circuit (Example 20). Now, to simulate the nal generated netlist for counter2 post-synthesis using qVsim, we should do the following: First, counter2.blif and counter2:levels should be copied to the following direc- tory: qPALACE/converters/blif2bookshelf/src. Then, the following command should be invoked: :=blif2bookshelf counter2:blif. Next, all les with the prex of \counter2" should be copied to the following directory and the following command should be invoked: directory: qPALACE/converters/bookshelf2def. command: :=bookshelf2def counter2:auxlef ::=::=qLib=1:4:1=lef=lef 2 metals:lef. Now copy all les with prex ofcounter2 to qPALACE/qVsim directory and invoke the following command: python3 qVsim:py counter2LOGIC -LIB ../../../qLib/1.4.1. The last argument in the above command is the path of the library (library version 1.4.1 is used in this example). The above command runs qVsim on the synthesized 2-bit counter circuit and performs post-synthesis simulations. Currently, qVsim generates a testbench with random values for inputs of the circuit. If you want to change the testbench generated by qVsim (to have cus- tomized test patterns), the line in qVsim.py that invokes TestBenchGen.py should be commented 192 Figure 7.19: Synthesized, mapped, and fully path balanced circuit for KSA4. Logic gates and DFFs, and primary inputs/outputs are shown by circles, and triangles, respectively. In this graph, the splitters are also inserted and shown. This graph is generated by using qView.py script in the qSyn directory. out and the counter2 TB:v tesbench le should be modied manually to re ect the customized test pattern. Finally, the following command should be invoked again: python3 qVsim:py counter2LOGIC -LIB ../../../qLib/1.4.1. We could generate the waveform shown in Figure 6.10 which veries the correct functionality for this 2-bit counter as is explained below: In the waveform shown in Figure 6.10, since the reset signal (rst) is set to 1, the counter is reset and bothcount[0] andcount[1] bits become 0. Then the enable signal (en) becomes 1 and causes the count[0] bit to get a value of 1 after 5 clock cycles. Note that we should apply inputs and do sample outputs with a factor of 5 clock cycles. This factor can be obtained by checking the 193 Figure 7.20: Verilog code of the circuit used in Example 21. (a) (b) (c) Figure 7.21: Test circuit with Verilog shown in Figure 7.20. (a) after applying blnc syn2 and qmap commands, (b) after applying cl -B command, (c) after applying addLr -L -R command. log le of qSyn (it will be printed into the screen if you are entering commands one by one in the qABC environment). Look for a message such as the following one: The maximum level among feedback latches is : 5 Every 5 fast clock cycles, a new set of inputs should be applied to the circuit. After 5 clock cycles, enable signal becomes 0, therefore, the counter should preserve the previous value (01), which is the case for our synthesized circuit according to the shown waveform. Then Figure 7.22: Final mapped circuit for Example 21 generated using qView. 194 Figure 7.23: A 2-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; . To see the graph representation of a circuit with latches in qABC environment type show -s. Figure 7.24: A 2-bit counter circuit generated by invoking the following commands. blnc syn2; qmap; cl -B -v; b -b -v; addLr -L -R;. To see the graph representation of a circuit with latches in qABC environment type show -s. every 5 clock cycles (en is 1 every 5 clock cycles from this point on), the counter counts up by 1 until it over ows to 00 and then it starts over. In the last 10 clock cycles, the counter preserves its last counted value because it did not receive a pulse on the enable pin, which is a correct functionality according to the Yosys counter Verilog code (Figure 6.6). Figures. 7.25, 7.26, 6.12 show synthesized and mapped circuits for 3-bit counter and simulation waveform for it. Figures. 7.27, 7.28, 6.13 show the same for 4-bit counter. According to these waveforms, functionalities of the 3-bit and 4-bit counter circuits synthesized by qSyn are also veried. 195 Figure 7.25: A 3-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; . Figure 7.26: A 3-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; b -b -v; addLr -L -R;. Example 23 This example is for synthesizing sequential pipeline circuits. There are multiple method for synthesizing them. The assumption in this example is that the input circuit does not have any feedback loops. Circuits with feedback loops are covered in Example 22. 1. removing all latches: If all non-feedback latches are removed and then all nodes get a correct level values, the full path balancing that comes after this removal operation can guarantee the correct circuit operation. In this case, there will be only one clock. The commands that can be used are as follows: blnc syn2; qmap; adj -C 8; cl -B -d; b -b; addLr -L -R -n 0. Note that the cl command for this category has to be used with \-d". Also, the last command can 196 Figure 7.27: A 4-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; . Figure 7.28: A 4-bit counter circuit generated by invoking the following commands in qABC: blnc syn2; qmap; cl -B -v; b -b -v; addLr -L -R;. be replaced with addLr -L -R. Figure 7.29a and 7.29b show a 4-input pipeline XOR circuit and the resulting circuit after applying the above said commands. 2. keeping pipeline latches: In this case, unlike the previous one, those latches that are not in a feedback loop will also be kept. For this purpose, the following commands should be invoked: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -n 1. This will generate a circuit such as Figure 7.29c given the circuit in Figure 7.29a as an input. 3. dual clocking method with keeping pipeline latches: In this case, the pipeline latches will get a slow clock and all other gates and DFFs will get a fast clock. Frequency of slow clock will 197 be 1/D max of that of the fast clock in which D max is the maximum depth among blocks of the given linear pipeline circuit. For this method to work properly, depth of all blocks should become the same, i.e. equal to D max . You may use the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -n 2. This will turn the circuit in Figure 7.29a to the circuit shown in Figure 7.29d. 4. partial path balancing: For this purpose, our Dual Clocking Method (DCM) should be used. NDROs will be inserted at each primary input and also pipeline latches will be replaced with NDROs. Reset signal of these NDROs will get a slow clock signal and their set pin will come from a garbage collecting AND2 gate (except for those NDROs that are connected the primary inputs of the circuit). Also, garbage collecting AND2 gates need to be inserted before each primary output of the circuit. Then, depending on the value of imbalance factor, each block will be partially or fully path balanced. You may use the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -i 1 -p. Note that the imbalance factor is given by \-i num" (num=1 here) and by default the imbalance factor is assumed to be 0. This will create the circuit that is shown in Figure 7.30. Example 24 This example is for nding optimal parts for a given combinational circuit and re- alizing it using our Dual Clocking Method. For this purpose our Depth-Bounded Levelized Graph Partitioning (DLGP) algorithm should be used. Given a 2-bit adder (KSA2), the following com- mands should be used blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -g 3;. The number that follows \-g" determines the depth of each part in the DLGP algorithm. Given a synthesized and mapped circuit for KSA2 (after applying blnc syn2; qmap; commands) that is shown in Figure 7.31, the DLGP alorithm will generate the circuit that is shown in Figure 7.32. Example 25 To show the ability of our qView tool in generating graphs for larger circuits, we are presenting this example for a 4-bit array multiplier. First, the BLIF format description of 198 this multiplier (arrmult4.blif) should be copied to the qABC main directory. Then, the following commands should be used: blnc syn2; qmap; adj -C 8; addLr -L -R -v; repL; InsS -B arrm4; qwb arrm4; Then, you should go back to the qSyn directory and run the following command: python3 qView.py qABC/arrm4.blif This will generate the graph that is shown in Figure 7.33. 199 (a) (b) (c) (d) Figure 7.29: (a) A pipelined 4-input XOR circuit with two stages of pipeline. The rectangles are latches and circle are logic gates. The rst stage of pipeline in this circuit is consists of only one XOR gate (15) and the second stage is consists of two XOR gates (16 and 17).(b) After applying the following commands: blnc syn2; qmap; adj -C 8; cl -d -B; b -b; addLr -L -R -n 0; , (c) After applying the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -n 1;, (d) After applying the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -n 2;. 200 Figure 7.30: A pipelined 4-input XOR circuit after applying the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -i 1 -p;. Figure 7.31: A 2-bit adder circuit after applying the following commands: blnc syn2; qmap;. Figure 7.32: A 2-bit adder circuit after applying the following commands: blnc syn2; qmap; adj -C 8; cl -B; b -b; addLr -L -R -i 1 -g 3;. 201 Figure 7.33: Synthesized, mapped, and fully path balanced circuit for a 4-bit array multiplier. Logic gates and DFFs, and primary inputs/outputs are shown by circles, and triangles, respectively. In this graph, the splitters are also inserted and shown. This graph is generated by using qView.py script in the qSyn directory. 202 Chapter 8 Conclusions and future works In this dissertation, after providing preliminary knowledge on Single Flux Quantum (SFQ) logic, basic terminology and concepts in logic synthesis and graph partitioning, we presented our algo- rithms that are developed for ecient synthesis and mapping of SFQ logic circuits. This include two technology-independent optimizations functions namely balanced factorization and rewriting, which are designed to take balancing requirement of SFQ circuits into account, hence, to gener- ate better solutions for these circuits at the technology-independent phase of the logic synthesis. Then, we presented our path balancing technology mapping algorithms for minimizing total path balancing DFF count, total area, and the product of worst stage delay and the depth of the circuit. These technology mappers provide provably optimal solutions for mapping trees and they are very eective in generating optimized mapping solutions for general Directed Acyclic Graphs. Next, a dual clocking method and a depth-bounded levelized graph partitioning algorithm for realiza- tion of SFQ circuits were presented, which eliminate the need for expensive full path balancing step in the standard approach for design and implementation of SFQ circuits. Afterwards, we presented the required theoretical bases and algorithmic support for synthesizing sequential SFQ circuit specially those with feedback loops. For each of these algorithms and methods, extensive experimental results which show superiority of our methods for realization of SFQ circuits when compared to the baseline approaches. Also, we presented exhaustive simulation results that verify 203 the correct functionality of circuits generated by our methods. Finally, our designed software tool for synthesizing SFQ circuits, qSyn, is presented and many demonstrative examples with expected results are shown. As per the future work, we are suggesting the following: • Designing technology-independent optimization functions for synthesizing SFQ circuits (other than balanced factorization and rewriting) that take the important balancing property of these circuits into account and generate a more balanced technology-independent networks. Specically, the \don't care" optimization functions that are designed for CMOS circuits are suggested to be modied to specialized for synthesizing SFQ circuits. We believe that this will further improve the quality of the results that we already have. • Designing a depth-bounded graph partitioning algorithm that unlike the DLGP algorithm does not have the level constraint constraint (putting nodes with the same level into the same parts). We suggest to use a combination of network ow and dynamic programming algorithms for this purpose. We believe that this will improve the best results that are already achieved by the DLGP algorithm. • Designing asynchronous SFQ NOT gates and incorporating them in the mini synthesis process that was presented for synthesizing Yosys's sub-circuits that implement DFFs with asynchronous resets. This will make it possible to synthesize all dierent types of such sub-circuits and map them into fully asynchronous circuits. • Designing more complex logic gates such as AOI21, AOI22, OAI21, OAI22 in SFQ technol- ogy and incorporating them in the technology mapping process. This will help improving many important metrics such as total node count, total path balancing DFF count, total area, depth of the circuit, and the peak throughput. Also, an SFQ NAND gate and a better SFQ NOT gate with fewer number of Josephson junctions and less area can have a considerable improving eect on these important evaluation metrics. 204 Acronyms AIG : And Inverter Graph AQFP : Adiabatic Quantum Flux Parametron BLIF : Berkeley Logic Interchange Format BM : Balanced Map BMD : Balanced Map Delay CAD : Computer Aided Design CMOS : Complementary Metal Oxide Semiconductor DAG : Directed Acyclic Graph DBCP : Doubly Balanced Connected graph Partitioning DCG : Directed Cyclic Graph DCGP : Depth-Bounded Chain Graph Partitioning DCM : Dual Clocking Method DFF : D-Flip-Flop DLGP : Depth-Bounded Levelized Graph Partitioning DP : Dynamic Programming DRO : Destructive Read Out eSFQ : energy-ecient Single Flux Quantum EDA : Electronic Design Automation ERSFQ : Energy-ecient Rapid Single Flux Quantum 205 FinFET : Fin shape Field Eect Transistor FPB : Full Path Balancing FPGA : Field Programmable Gate Array GP : Graph Partitioning GPP : Graph Partitioning Problem IVC : I-V Characteristic JJ : Josephson Junction JTL : Josephson Transmission Line KSA : Kogge-Stone Adder K-L : Kernighan-Lin LUT : Look Up Table NDRO : Non-Destructive Read Out NP : Not Polynomial NPB : No Path Balancing PBMap : Path Balancing Mapper PI : Primary Input PO : Primary Output POS : Product Of Sum PPB : Partial Path Balancing PSD : Product of the worst-case Stage delay with the logical Depth of the circuit PTL : Passive Transmission Line RQL : Reciprocal Quantum Logic RSFQ : Rapid Single Flux Quantum RTL : Register Transfer Level SCE : SuperConducting Electronics SFQ : Single Flux Quantum 206 SOC : System-on-Chip SOI : Silicon On Insulator SOP : Sum Of Product TCAD : Technology Computer Aided Design TFF : T-Flip-Flop TFI : Transitive FanIn cone TFO : Transitive FanOut cone VLSI : Very Large Scale Integration system 207 Reference List [1] Kanak Agarwal, Kevin Nowka, Harmander Deogun, Dennis Sylvester, and Dennis Sylvester. Power gating with multiple sleep modes. In Proceedings of the 7th International Symposium on Quality Electronic Design, pages 633{637. IEEE Computer Society, 2006. [2] Charles J Alpert, Jen-Hsin Huang, and Andrew B Kahng. Multilevel circuit partition- ing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17(8):655{667, 1998. [3] Charles J Alpert and Andrew B Kahng. Recent directions in netlist partitioning: a survey. Integration, the VLSI journal, 19(1-2):1{81, 1995. [4] Luca Amar u, Pierre-Emmanuel Gaillardon, and Giovanni De Micheli. BDS-MAJ: A BDD- based logic synthesis tool exploiting majority logic decomposition. In Proceedings of the 50th Annual Design Automation Conference, page 47. ACM, 2013. [5] Luca Amaru, Pierre-Emmanuel Gaillardon, and Giovanni De Micheli. Majority-inverter graph: A new paradigm for logic optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 35(5):806{819, 2016. [6] Konstantin Andreev and Harald Racke. Balanced graph partitioning. Theory of Computing Systems, 39(6):929{939, 2006. [7] Robert L Ashenhurst. The decomposition of switching functions. In Proceedings of an international symposium on the theory of switching, April 1957, 1957. [8] Christine Axnix, G Bayer, H B ohm, Joachim von Buttlar, Mark S Farrell, L Cranton Heller, Jerey P Kubala, SE Lederer, Raymond Mansell, A Nunez Mencias, et al. Ibm z13 rmware innovations for simultaneous multithreading and i/o virtualization. IBM Journal of Re- search and Development, 59(4/5):11{1, 2015. [9] Christian Bachmaier, Franz J Brandenburg, Wolfgang Brunner, and Gerg o Lov asz. Cyclic leveling of directed graphs. In International Symposium on Graph Drawing, pages 348{359. Springer, 2008. [10] Ricardo A Baeza-Yates. Text-retrieval: Theory and practice. In IFIP Congress (1), vol- ume 12, pages 465{476, 1992. [11] Luca Benini and Giovanni De Micheli. A survey of boolean matching techniques for li- brary binding. ACM Transactions on Design Automation of Electronic Systems (TODAES), 2(3):193{226, 1997. [12] Per Bjesse and Arne Boralv. DAG-aware circuit compression for formal verication. In Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design, pages 42{49. IEEE Computer Society, 2004. 208 [13] Robert K Brayton. Factoring logic functions. IBM Journal of research and development, 31(2):187{198, 1987. [14] Robert K Brayton, Gary D Hachtel, Curt McMullen, and Alberto Sangiovanni-Vincentelli. Logic minimization algorithms for VLSI synthesis, volume 2. Springer Science & Business Media, 1984. [15] Robert K Brayton, Gary D Hachtel, and Alberto L Sangiovanni-Vincentelli. Multilevel logic synthesis. Proceedings of the IEEE, 78(2):264{300, 1990. [16] Robert K Brayton, Richard Rudell, Alberto Sangiovanni-Vincentelli, and Albert R Wang. MIS: A multiple-level logic optimization system. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 6(6):1062{1081, 1987. [17] Franc Brglez, David Bryan, and Krzysztof Kozminski. Combinational proles of sequential benchmark circuits. In IEEE International Symposium on Circuits and Systems,, pages 1929{1934. IEEE, 1989. [18] Aydn Bulu c, Henning Meyerhenke, Ilya Safro, Peter Sanders, and Christian Schulz. Recent advances in graph partitioning. In Algorithm Engineering, pages 117{158. Springer, 2016. [19] PI Bunyk, A Oliva, VK Semenov, M Bhushan, KK Likharev, JE Lukens, MB Ketchen, and WH Mallison. High-speed single- ux-quantum circuit using planarized niobium-trilayer josephson junction technology. Applied physics letters, 66(5):646{648, 1995. [20] Wayne P Burleson, Maciej Ciesielski, Fabian Klass, and Wentai Liu. Wave-pipelining: a tutorial and research survey. IEEE Transactions on very large scale integration (vlsi) systems, 6(3):464{474, 1998. [21] Ruizhe Cai, Olivia Chen, Ao Ren, Ning Liu, Caiwen Ding, Nobuyuki Yoshikawa, and Yanzhi Wang. A majority logic synthesis framework for adiabatic quantum- ux-parametron super- conducting circuits. In Proceedings of the 2019 on Great Lakes Symposium on VLSI, pages 189{194. ACM, 2019. [22] K. Chaudhary and M. Pedram. Computing the area versus delay trade-o curves in tech- nology mapping. IEEE Trans. on Computer Aided Design, 14(12):1480{1489, 1995. [23] Kamal Chaudhary and Massoud Pedram. A near optimal algorithm for technology mapping minimizing area under delay constraints. In Proceedings of the 29th DAC, pages 492{498. IEEE Computer Society Press, 1992. [24] Kamal Chaudhary and Massoud Pedram. Computing the area versus delay trade-o curves in technology mapping. IEEE Transactions on Computer-Aided Design of Integrated Cir- cuits and Systems, 14(12):1480{1489, 1995. [25] Deming Chen and Jason Cong. Daomap: A depth-optimal area optimization mapping algorithm for fpga designs. In Proceedings of the 2004 IEEE/ACM International conference on Computer-aided design, pages 752{759. IEEE Computer Society, 2004. [26] Sao-Jie Chen and Chung-Kuan Cheng. Tutorial on vlsi partitioning. VLSI design, 11(3):175{218, 2000. [27] Shuang Chen, Yanzhi Wang, Xue Lin, Qing Xie, and Massoud Pedram. Performance predic- tion for multiple-threshold 7nm-nfet-based circuits operating in multiple voltage regimes using a cross-layer simulation framework. In 2014 SOI-3D-Subthreshold Microelectronics Technology Unied Conference (S3S), pages 1{2. IEEE, 2014. 209 [28] W Chen, AV Rylyakov, Vijay Patel, JE Lukens, and KK Likharev. Rapid single ux quan- tum t- ip op operating up to 770 ghz. IEEE Transactions on Applied Superconductivity, 9(2):3212{3215, 1999. [29] Jason Cong and Yuzheng Ding. Flowmap: An optimal technology mapping algorithm for delay optimization in lookup-table based fpga designs. IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, 13(1):1{12, 1994. [30] Jason Cong and Yuzheng Ding. An optimal technology mapping algorithm for delay op- timization in lookup-table based fpga designs. In International Conference On Computer Aided Design (ICCAD), 2003. [31] L Cotton. Maximum rate pipelining systems. In Procs. AFIPS Spring Joint Computer Conference, 1969, 1969. [32] Mikhail Dorojevets, Christopher L Ayala, Nobuyuki Yoshikawa, and Akira Fujimaki. 16-bit wave-pipelined sparse-tree rsfq adder. IEEE Transactions on Applied Superconductivity, 23(3):1700605{1700605, 2013. [33] Mohammad Javad Dousti, Majid Ghasemi-Gol, Mahdi Nazemi, and Massoud Pedram. Thermtap: An online power analyzer and thermal simulator for android devices. In 2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pages 341{346. IEEE, 2015. [34] Ronald G Dreslinski, Michael Wieckowski, David Blaauw, Dennis Sylvester, and Trevor Mudge. Near-threshold computing: Reclaiming moore's law through energy ecient inte- grated circuits. Proceedings of the IEEE, 98(2):253{266, 2010. [35] Niklas E en and Niklas S orensson. An extensible SAT-solver. In International conference on theory and applications of satisability testing, pages 502{518. Springer, 2003. [36] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and Doug Burger. Dark silicon and the end of multicore scaling. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pages 365{376. IEEE, 2011. [37] Amaru et. al. The ep combinational benchmark suite, 2017. [38] Mishchenko et. al. ABC : A system for sequential synthesis and verication. Berkeley Logic Synthesis and Verication Group, 2019. [39] Guy Even, Joseph Naor, Satish Rao, and Baruch Schieber. Fast approximate graph parti- tioning algorithms. SIAM Journal on Computing, 28(6):2187{2214, 1999. [40] Amir H Farrahi and Majid Sarrafzadeh. Complexity of the lookup-table minimization prob- lem for fpga technology mapping. IEEE TCAD, 13(11):1319{1332, 1994. [41] Arash Fayyazi, Shahin Nazarian, and Massoud Pedram. qEC: A logical equivalence checking framework targeting SFQ superconducting circuits. In 2019 IEEE International Supercon- ductive Electronics Conference (ISEC), pages 1{3. IEEE, 2019. [42] Charles M Fiduccia and Robert M Mattheyses. A linear-time heuristic for improving net- work partitions. In Papers on Twenty-ve years of electronic design automation, pages 241{247. ACM, 1988. [43] T Filippov, M Dorojevets, A Sahu, A Kirichenko, C Ayala, and O Mukhanov. 8-bit asyn- chronous wave-pipelined rsfq arithmetic-logic unit. IEEE Transactions on Applied Super- conductivity, 21(3):847{851, 2011. 210 [44] Timur V Filippov, Anubhav Sahu, Alex F Kirichenko, Igor V Vernik, Mikhail Dorojevets, Christopher L Ayala, and Oleg A Mukhanov. 20 ghz operation of an asynchronous wave- pipelined rsfq arithmetic-logic unit. Physics Procedia, 36:59{65, 2012. [45] C. Fourie. Rsfq cell library, 2018. [46] Coenrad J Fourie. Digital superconducting electronics design tools|status and roadmap. IEEE Transactions on Applied Superconductivity, 28(5):1{12, 2018. [47] Coenrad Johann Fourie, Kyle Jackman, Matthys M Botha, Sasan Razmkhah, Pascal Febvre, Christopher Lawrence Ayala, Qiuyun Xu, Nobuyuki Yoshikawa, Erin Patrick, Mark Law, et al. Cold ux superconducting EDA and TCAD tools project: Overview and progress. IEEE Transactions on Applied Superconductivity, 29(5):1{7, 2019. [48] Eby G Friedman. Clock distribution networks in synchronous digital integrated circuits. Proceedings of the IEEE, 89(5):665{692, 2001. [49] Kris Gaj, Eby G Friedman, and Marc J Feldman. Timing of multi-gigahertz rapid single ux quantum digital circuits. Journal of VLSI signal processing systems for signal, image and video technology, 16(2-3):247{276, 1997. [50] Kris Gaj, Quentin P Herr, Victor Adler, Andy Krasniewski, Eby G Friedman, and Marc J Feldman. Tools for the computer-aided design of multigigahertz superconducting digital circuits. IEEE transactions on applied superconductivity, 9(1):18{38, 1999. [51] Michael R Garey, David S Johnson, and Larry Stockmeyer. Some simplied np-complete problems. In Proceedings of the sixth annual ACM symposium on Theory of computing, pages 47{63. ACM, 1974. [52] Olivier Goldschmidt and Dorit S Hochbaum. A polynomial algorithm for the k-cut problem for xed k. Mathematics of operations research, 19(1):24{37, 1994. [53] Rudolf Gross, Achim Marx, and Frank Deppe. Applied superconductivity: Josephson eect and superconducting electronics. De Gruyter, 2015. [54] MVSIS Group et al. MVSIS: Multi-valued logic synthesis system. [55] Winston Haaswijk, Mathias Soeken, Luca Amar u, Pierre-Emmanuel Gaillardon, and Gio- vanni De Micheli. A novel basis for logic rewriting. In 2017 22nd Asia and South Pacic Design Automation Conference (ASP-DAC), pages 151{156. Ieee, 2017. [56] Gary D Hachtel and Fabio Somenzi. Logic synthesis and verication algorithms. Springer Science & Business Media, 2006. [57] Lars Hagen and Andrew B Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE transactions on computer-aided design of integrated circuits and systems, 11(9):1074{1085, 1992. [58] William W Hager, Dzung T Phan, and Hongchao Zhang. An exact algorithm for graph partitioning. Mathematical Programming, 137(1-2):531{556, 2013. [59] Mark C Hansen, Hakan Yalcin, and John P Hayes. Unveiling the iscas-85 benchmarks: A case study in reverse engineering. IEEE Design & Test of Computers, 16(3):72{80, 1999. [60] Yoshihito Hashimoto, Shinichi Yorozu, Yoshio Kameda, and Vasili K Semenov. A design approach to passive interconnects for single ux quantum logic circuits. IEEE transactions on applied superconductivity, 13(2):535{538, 2003. 211 [61] Scott Hauck and Gaetano Borriello. An evaluation of bipartitioning techniques. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 16(8):849{866, 1997. [62] Bruce Hendrickson and Robert W Leland. A multi-level algorithm for partitioning graphs. SC, 95(28):1{14, 1995. [63] Quentin P Herr, Anna Y Herr, Oliver T Oberg, and Alexander G Ioannidis. Ultra-low-power superconductor logic. Journal of applied physics, 109(10):103903, 2011. [64] Timothy M Hollis. Invited talk: Early estimation of on-chip clock jitter accumulation|a brief tutorial. In Microelectronics And Electron Devices (WMED), 2014 IEEE Workshop On, pages 1{1. IEEE, 2014. [65] D Scott Holmes, Andrew L Ripple, and Marc A Manheimer. Energy-ecient superconduct- ing computing-power budgets and requirements. IEEE Transactions on Applied Supercon- ductivity, 23(3):1701610{1701610, 2013. [66] Laurent Hyal and Ronald L Rivest. Graph partitioning and constructing optimal decision trees are polynomial complete problems. IRIA. Laboratoire de Recherche en Informatique et Automatique, 1973. [67] IARPA. SuperTools program, 2018. [68] Sasan Iman and Massoud Pedram. Logic extraction and factorization for low power. In Proceedings of the 32nd annual ACM/IEEE Design Automation Conference, pages 248{ 253. ACM, 1995. [69] David S Johnson. The NP-completeness column: an ongoing guide. Journal of Algorithms, 6(1):145{159, 1985. [70] David S Johnson, Cecilia R Aragon, Lyle A McGeoch, and Catherine Schevon. Optimization by simulated annealing: an experimental evaluation; part i, graph partitioning. Operations research, 37(6):865{892, 1989. [71] JSIM. The berkeley superconducting spice simulator, 2017. [72] M Jurczak, N Collaert, A Veloso, T Homann, and S Biesemans. Review of nfet technology. In 2009 Ieee International Soi Conference, pages 1{4. IEEE, 2009. [73] George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hyper- graph partitioning: applications in vlsi domain. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 7(1):69{79, 1999. [74] N. Katam. Sport lab sfq logic circuit benchmark suite, 2017. [75] Naveen Katam, Alireza Shafaei, and Massoud Pedram. Design of complex rapid single- ux- quantum cells with application to logic synthesis. In 16th International Superconductive Electronics Conference, ISEC 2017. [76] Naveen Katam, Alireza Shafaei, and Massoud Pedram. Design of multiple fanout clock distribution network for rapid single ux quantum technology. In 22nd Asia and South Pacic Design Automation Conference (ASP-DAC), pages 384{389. IEEE, 2017. [77] Naveen Kumar Katam and Massoud Pedram. Logic optimization, complex cell design and retiming of single ux quantum circuits. IEEE Transactions on Applied Superconductivity, 2018. 212 [78] Michael Kaufmann and Dorothea Wagner. Drawing graphs: methods and models, volume 2025. Springer, 2003. [79] Brian W Kernighan and Shen Lin. An ecient heuristic procedure for partitioning graphs. The Bell system technical journal, 49(2):291{307, 1970. [80] Kurt Keutzer. DAGON: technology binding and local optimization by dag matching. In 24th Conference on Design Automation, pages 341{347. IEEE, 1987. [81] S Kim, J Kim, and S-Y Hwang. New path balancing algorithm for glitch power reduction. IEE Proceedings-Circuits, Devices and Systems, 148(3):151{156, 2001. [82] Alex F Kirichenko, Saad Sarwana, and Igor V Vernik. Ersfq-zero static power dissipa- tion single ux quantum logic. In Government Microcircuit Appl. and Critical Techn. Conf.(GOMACTech-12), pages 319{322. [83] Alex F Kirichenko, Igor V Vernik, John A Vivalda, Rick T Hunt, and Daniel T Yohannes. Ersfq 8-bit parallel adders as a process benchmark. IEEE Trans. Appl. Supercond, 25(3):1, 2015. [84] DE Kirichenko, Saad Sarwana, and AF Kirichenko. Zero static power dissipation biasing of rsfq circuits. IEEE Transactions on Applied Superconductivity, 21(3):776, 2011. [85] Nobutaka Kito, Kazuyoshi Takagi, and Naofumi Takagi. A fast wire-routing method and an automatic layout tool for rsfq digital circuits considering wire-length matching. IEEE Transactions on Applied Superconductivity, 28(4):1{5, 2018. [86] V Koshelets, K Likharev, V Migulin, O Mukhanov, G Ovsyannikov, V Semenov, I Ser- puchenko, and A Vystavkin. Experimental realization of a resistive single ux quantum logic circuit. IEEE Transactions on Magnetics, 23(2):755{758, 1987. [87] Stephane Lafon and Ann B Lee. Diusion maps and coarse-graining: A unied framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE transactions on pattern analysis and machine intelligence, 28(9):1393{1403, 2006. [88] Dominique LaSalle and George Karypis. A parallel hill-climbing renement algorithm for graph partitioning. In Parallel Processing (ICPP), 2016 45th International Conference on, pages 236{241. IEEE, 2016. [89] Eugene L Lawler. An approach to multilevel boolean minimization. Journal of the ACM (JACM), 11(3):283{295, 1964. [90] Eugene L Lawler, Karl N Levitt, and James Turner. Module clustering to minimize delay in digital networks. IEEE Transactions on Computers, 100(1):47{57, 1969. [91] Eric Lehman, Yosinori Watanabe, Joel Grodstein, and Heather Harkness. Logic decompo- sition during technology mapping. IEEE TCAD, 16(8):813{834, 1997. [92] Charles E Leiserson and James B Saxe. Retiming synchronous circuitry. Algorithmica, 6(1):5{35, 1991. [93] Harry R Lewis. Computers and intractability. a guide to the theory of np-completeness, 1983. [94] KK Likharev, OA Mukhanov, and VK Semenov. Resistive single ux quantum logic for the josephson-junction digital technology. In SQUID'85, pages 1103{1108. Walter de Gruyter, Berlin, 1985. 213 [95] KK Likharev and VK Semenov. RSFQ logic/memory family: A new josephson-junction technology for sub-terahertz-clock-frequency digital systems. IEEE Transactions on Applied Superconductivity, 50(1), 1991. [96] Ting-Ru Lin, Tim Edwards, and Massoud Pedram. qGDR: A Via minimization oriented routing tool for large-scale superconductive single ux quantum circuits. IEEE Transactions on Applied Superconductivity, 2019. [97] Huiqun Liu and DF Wong. Network ow based circuit partitioning for time-multiplexed fpgas. In Proceedings of the 1998 IEEE/ACM international conference on Computer-aided design, pages 497{504. ACM, 1998. [98] Huiqun Liu, Kai Zhu, and DF Wong. Circuit partitioning with complex resource constraints in fpgas. In Proceedings of the 1998 ACM/SIGDA sixth international symposium on Field programmable gate arrays, pages 77{84. ACM, 1998. [99] Fr ed eric Mailhot and Giovanni De Micheli. Technology mapping using boolean matching and don't care sets. In Design Automation Conference, 1990., EDAC. Proceedings of the European, pages 212{216. IEEE, 1990. [100] Valavan Manohararajah, Stephen Dean Brown, and Zvonko G Vranesic. Heuristics for area minimization in lut-based fpga technology mapping. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25(11):2331{2340, 2006. [101] Frank McSherry. Spectral partitioning of random graphs. In Foundations of Computer Science, 2001. Proceedings. 42nd IEEE Symposium on, pages 529{537. IEEE, 2001. [102] Henning Meyerhenke, Peter Sanders, and Christian Schulz. Parallel graph partitioning for complex networks. IEEE Transactions on Parallel and Distributed Systems, 28(9):2625{ 2638, 2017. [103] Alan Mishchenko, Robert Brayton, Stephen Jang, and Victor Kravets. Delay optimization using sop balancing. In Computer-Aided Design (ICCAD), 2011 IEEE/ACM International Conference on, pages 375{382. IEEE, 2011. [104] Alan Mishchenko, Satrajit Chatterjee, and Robert Brayton. Integrating logic synthesis, technology mapping, and retiming. In Proc. IWLS'05. Citeseer, 2006. [105] Alan Mishchenko, Satrajit Chatterjee, and Robert Brayton. DAG-aware AIG rewriting a fresh look at combinational logic synthesis. In Proceedings of the 43rd annual Design Automation Conference, pages 532{535. ACM, 2006. [106] Alan Mishchenko, Satrajit Chatterjee, Robert Brayton, Xinning Wang, and Timothy Kam. Technology mapping with boolean matching, supergates and choices. 2005. [107] Alan Mishchenko, Sungmin Cho, Satrajit Chatterjee, and Robert Brayton. Combinational and sequential mapping with priority cuts. In ICCAD, pages 354{361, 2007. [108] G. E. Moore. Cramming more components onto integrated circuits. Electronics, 38(8):114{ 117, 1965. [109] Matthew W Moskewicz, Conor F Madigan, Ying Zhao, Lintao Zhang, and Sharad Ma- lik. Cha: Engineering an ecient SAT solver. In Proceedings of the 38th annual Design Automation Conference, pages 530{535. ACM, 2001. [110] O Mukhanov, V Semenov, and K Likharev. Ultimate performance of the rsfq logic circuits. IEEE Transactions on Magnetics, 23(2):759{762, 1987. 214 [111] Oleg A Mukhanov. Energy-ecient single ux quantum technology. IEEE Transactions on Applied Superconductivity, 21(3):760{769, 2011. [112] Saburo Muroga. Logic design and switching theory. John Wiley & Sons, 1979. [113] T Narama. Study of large fan-out splitter and yield evaluation circuit for large-scale adia- batic quantum ux parametron circuit. PhD thesis, master thesis, March, 2016. [114] Shahin Nazarian, Arash Fayyazi, and Massoud Pedram. qCG: A low-power multi-domain SFQ logic design and verication framework. In 2019 IEEE 37th International Conference on Computer Design (ICCD), pages 446{449. IEEE, 2019. [115] Mark EJ Newman. Community detection and graph partitioning. EPL (Europhysics Let- ters), 103(2):28003, 2013. [116] Nikhil R Pal and Sankar K Pal. A review on image segmentation techniques. Pattern recognition, 26(9):1277{1294, 1993. [117] Peichen Pan and Chih-Chang Lin. A new retiming-based technology mapping algorithm for lut-based fpgas. In Proceedings of the sixth international symposium on Field programmable gate arrays, pages 35{42. ACM, 1998. [118] Ghasem Pasandi and Sied Mehdi Fakhraie. A 256-kb 9t near-threshold sram with 1k cells per bitline and enhanced write and read operations. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 23(11):2438{2446, 2014. [119] Ghasem Pasandi and Massoud Pedram. A graph partitioning algorithm with application in synthesizing single ux quantum logic circuits. arXiv preprint arXiv:1810.00134, 2018. [120] Ghasem Pasandi and Massoud Pedram. Balanced factorization and rewriting algorithms for synthesizing single ux quantum logic circuits. In Proceedings of the 2019 on Great Lakes Symposium on VLSI (GLSVLSI), pages 183{188, 2019. [121] Ghasem Pasandi and Massoud Pedram. A dynamic programming-based path balancing technology mapping algorithm targeting area minimization. In Proc. IEEE/ACM Int. Conf. Comput. Aided Des. (ICCAD), 2019. [122] Ghasem Pasandi and Massoud Pedram. An ecient pipelined architecture for superconduct- ing single ux quantum logic circuits utilizing dual clocks. IEEE Transactions on Applied Superconductivity, 30(2):1{12, 2019. [123] Ghasem Pasandi and Massoud Pedram. PBMap: A path balancing technology mapping algorithm for single ux quantum logic circuits. IEEE Transactions on Applied Supercon- ductivity, 29(4):1{14, 2019. [124] Ghasem Pasandi, Mackenzie Peterson, Moises Herrera, Shahin Nazarian, and Massoud Pe- dram. Deep-PowerX: A Deep learning-based framework for low-power approximate logic synthesis. In 2020 ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED '20), Boston, MA, USA. DOI: 10.1145/3370748.3406555, 2020. [125] Ghasem Pasandi, Alireza Shafaei, and Massoud Pedram. SFQmap: A technology mapping tool for single ux quantum logic circuits. In International Symposium on Circuits and Systems (ISCAS). IEEE, May 27, 2018. [126] Massoud Pedram and Jan M Rabaey. Power aware design methodologies. Springer Science & Business Media, 2002. 215 [127] SV Polonsky, VK Semenov, and DF Schneider. Transmission of single- ux-quantum pulses along superconducting microstrip lines. IEEE transactions on applied superconductivity, 3(1):2598{2600, 1993. [128] Rajmohan Rajaraman and DF Wong. Optimum clustering for delay minimization. IEEE transactions on computer-aided design of integrated circuits and systems, 14(12):1490{1495, 1995. [129] Sumit Roy, Harm Arts, and Prithviraj Banerjee. Powershake: A low power driven clustering and factoring methodology for boolean expressions. In Proceedings of the conference on Design, automation and test in Europe, pages 967{968. IEEE Computer Society, 1998. [130] Richard Rudell. Genlib: Combinational gate specication. Berkeley University, http://www. eecs. berkeley. edu/ alanmi/publications/other/SIS paper genlib. pdf. [131] Ellen M Sentovich, Kanwar Jit Singh, Luciano Lavagno, Cho Moon, Rajeev Murgai, Alexan- der Saldanha, Hamid Savoj, Paul R Stephan, Robert K Brayton, and Alberto Sangiovanni- Vincentelli. SIS: A system for sequential circuit synthesis. 1992. [132] Soheil Nazar Shahsavani, Ting-Ru Lin, Alireza Shafaei, Coenrad J Fourie, and Massoud Pedram. An integrated row-based cell placement and interconnect synthesis tool for large sfq logic circuits. IEEE Transactions on Applied Superconductivity, 27(4):1{8, 2017. [133] Soheil Nazar Shahsavani, Alireza Shafaei, and Massoud Pedram. A placement algorithm for superconducting logic circuits based on cell grouping and super-cell placement. In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pages 1465{1468. IEEE, 2018. [134] Soheil Nazar Shahsavani, Bo Zhang, and Massoud Pedram. Accurate margin calculation for single ux quantum logic cells. 2018. [135] Gary Smith. Updates of the ITRS design cost and power models. In 2014 IEEE 32nd International Conference on Computer Design (ICCD), pages 161{165. IEEE, 2014. [136] Mathias Soeken, Luca Gaetano Amar u, Pierre-Emmanuel Gaillardon, and Giovanni De Micheli. Exact synthesis of majority-inverter graphs and its applications. IEEE Trans- actions on Computer-Aided Design of Integrated Circuits and Systems, 36(11):1842{1855, 2017. [137] Mathias Soeken and Michael Kirkedal Thomsen. White dots do matter: rewriting reversible logic circuits. In International Conference on Reversible Computation, pages 196{208. Springer, 2013. [138] Saleh Soltan, Mihalis Yannakakis, and Gil Zussman. Doubly balanced connected graph par- titioning. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1939{1950. Society for Industrial and Applied Mathematics, 2017. [139] Vassos Soteriou, Noel Eisley, and Li-Shiuan Peh. Software-directed power-aware intercon- nection networks. ACM Transactions on Architecture and Code Optimization (TACO), 4(1):5, 2007. [140] DB Strukov, A Mishchenko, and R Brayton. Maximum throughput logic synthesis for stateful logic: A case study. In Reed-Muller 2013 Workshop, 2013. 216 [141] Kai Sun, Qianchuan Zhao, Da-Zhong Zheng, Jin Ma, and Qiang Lu. A two-phase method based on obdd for searching for splitting strategies of large-scale power systems. In Power System Technology, 2002. Proceedings. PowerCon 2002. International Conference on, vol- ume 2, pages 834{838. IEEE, 2002. [142] Ramy N Tadros and Peter A Beerel. A robust and self-adaptive clocking technique for SFQ circuits. IEEE Transactions on Applied Superconductivity, 28(7):1{11, 2018. [143] Ramy N Tadros and Peter A Beerel. Optimizing (HC) 2 LC, a robust clock distribution network for SFQ circuits. IEEE Transactions on Applied Superconductivity, 30(1):1{11, 2019. [144] Ramy N Tadros and Peter A Beerel. A robust and tree-free hybrid clocking technique for RSFQ circuits{CSR application. In 16th International Superconductive Electronics Confer- ence, Sorento, Italy, June 2017. [145] Naoki Takeuchi, Dan Ozawa, Yuki Yamanashi, and Nobuyuki Yoshikawa. An adiabatic quantum ux parametron as an ultra-low-power logic device. Superconductor Science and Technology, 26(3):035010, 2013. [146] Naoki Takeuchi, Yuki Yamanashi, and Nobuyuki Yoshikawa. Energy eciency of adiabatic superconductor logic. Superconductor Science and Technology, 28(1):015003, 2014. [147] T. N. Theis and H. S. P. Wong. The end of moore's law: A new beginning for information technology. Computing in Science Engineering, 19(2):41{50, Mar 2017. [148] Vivek Tiwari, Pranav Ashar, and Sharad Malik. Technology mapping for low power. In Design Automation, 1993. 30th Conference on, pages 74{79. IEEE, 1993. [149] Vivek Tiwari, Pranav Ashar, and Sharad Malik. Technology mapping for low power in logic synthesis. Integration, the VLSI Journal, 20(3):243{268, 1996. [150] Chi-Ying Tsui, Massoud Pedram, and Alvin M Despain. Technology decomposition and mapping targeting low power dissipation. In Proceedings of the 30th international Design Automation Conference, pages 68{73. ACM, 1993. [151] Hirendu Vaishnav and Massoud Pedram. Delay optimal partitioning targeting low power vlsi circuits. In Proceedings of the 1995 IEEE/ACM international conference on Computer- aided design, pages 638{643. IEEE Computer Society, 1995. [152] Mark H Volkmann, Anubhav Sahu, Coenrad J Fourie, and Oleg A Mukhanov. Experimental investigation of energy-ecient digital circuits based on esfq logic. IEEE Trans. Appl. Supercond, 23(3):1301505, 2013. [153] Fangzhou Wang and Sandeep Gupta. Automatic test pattern generation for timing verica- tion and delay testing of RSFQ circuits. In 2019 IEEE 37th VLSI Test Symposium (VTS), pages 1{6. IEEE, 2019. [154] Cliord Wolf. Yosys open synthesis suite, 2016. [155] Qiuyun Xu, Christopher L Ayala, Naoki Takeuchi, Yuki Murai, Yuki Yamanashi, and Nobuyuki Yoshikawa. Synthesis ow for cell-based adiabatic quantum- ux-parametron structural circuit generation with hdl back-end verication. IEEE Transactions on Applied Superconductivity, 27(4):1{5, 2017. 217 [156] Shigeru Yamashita, Katsunori Tanaka, Hideyuki Takada, Koji Obata, and Kazuyoshi Tak- agi. A transduction-based framework to synthesize rsfq circuits. In Proceedings of the 2006 Asia and South Pacic Design Automation Conference (ASP-DAC), pages 266{272. IEEE Press, 2006. [157] Saeyang Yang. Logic synthesis and optimization benchmarks user guide: version 3.0. Mi- croelectronics Center of North Carolina (MCNC), 1991. [158] Chingwei Yeh, Chin-Chao Chang, and Jinn-Shyan Wang. Technology mapping for low power. In Design Automation Conference, 1999. Proceedings of the ASP-DAC'99. Asia and South Pacic, pages 145{148. IEEE, 1999. [159] S Yorozu, Y Kameda, H Terai, A Fujimaki, T Yamada, and S Tahara. A single ux quantum standard logic cell library. Physica C: Superconductivity, 378:1471{1474, 2002. [160] Nobuyuki Yoshikawa and Junichi Koshiyama. Top-down rsfq logic design based on a binary decision diagram. IEEE transactions on applied superconductivity, 11(1):1098{1101, 2001. [161] Bo Zhang, Fangzhou Wang, Sandeep Gupta, and Massoud Pedram. A statistical static timing analysis tool for superconducting single- ux-quantum circuits. In 2019 IEEE Inter- national Superconductive Electronics Conference (ISEC), pages 1{5. IEEE, 2019. [162] Jin S Zhang, Subarna Sinha, Alan Mishchenko, Robert K Brayton, and Malgorzata Chrzanowska-Jeske. Simulation and satisability in logic synthesis. computing, 7:14, 2005. 218
Abstract (if available)
Abstract
Superconducting Single Flux Quantum (SFQ) devices have been proven to be great candidates to provide high speed solutions post-silicon and post- Complementary Metal Oxide Semiconductor (CMOS) when the scaling of the minimum feature size is coming to an end. Also, these Niobium-based devices are extremely low-power and despite their cryo-cooling overhead, they still consume significantly less amounts of power compared to the state-of-the-art silicon-based devices. SFQ circuits while operating at liquid-helium temperature (≈4K), have switching delay of around 1ps and switching energy of about 10⁻¹⁹ J. Frequencies as high as 770GHz for a T-Flip-Flop and 750GHz for a digital frequency divider are reported in this technology. ❧ Design and implementation of SFQ circuits have been mainly manual with limited automation mostly by using some CMOS Computer-Aided Design (CAD) tools with minimal changes. Recently, a good amount of research has been done as part of a funded project (called SuperTools program) by the U.S. government to develop a superconducting circuit design flow with a comprehensive set of Electronic Design Automation (EDA) and Technology Computer Aided Design (TCAD) tools for Very-Large-Scale Integration (VLSI) design of Superconducting Electronics (SCE). The work presented in this dissertation is part of that effort. ❧ SFQ logic circuits have several unique properties and specific requirements that should be taken into account when CAD tools are being developed to automate their design flow. For the logic synthesis, new technology-independent and technology mapping algorithms should be developed to improve important evaluation metrics of SFQ circuits such as total Josephson junction count, total area, total power consumption, peak throughput, and local clock frequency during this automation process. ❧ In this regard, we will be presenting a technology-independent optimization flow for SFQ circuits using a combination of different optimization functions including our proposed balanced factorization and rewriting algorithms. This technology-independent flow will be followed by our novel path balancing technology mapping algorithm for mapping SFQ logic circuits into networks of logic gates. Our proposed path balancing technology mapping algorithm has three main modes: (i) minimizing the total number of path balancing D-Flip-Flops (DFFs), (ii) minimizing the total area including area of logic gates and path balancing DFFs, and (iii) minimizing the product of the worst stage delay and the depth of the circuit. ❧ Next, we will present a Dual Clocking Method (DCM) for realization of SFQ circuits that removes the need for the expensive full path balancing step in the standard design and implementation process of SFQ circuits. The proposed DCM is accompanied by a new depth-bounded levelized graph partitioning algorithm that is one of the contributions of this dissertation
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
Asset Metadata
Creator
Pasandi, Ghasem
(author)
Core Title
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/15/2020
Defense Date
08/10/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
adiabatic quantum flux parametron (AQFP),algebraic expression,Boolean expression,DAG,dual clocking method,energy-efficient rapid single flux quantum (ERSFQ),energy-efficient SFQ,graph partitioning,heuristic,high-speed,logic synthesis,low-power,macro clock,micro clock,Moore’s law,OAI-PMH Harvest,optimal solution,reciprocal quantum logic (RQL),refactoring,rewriting,single flux quantum (SFQ),superconducting electronics (SCE),technology mapping
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Gupta, Sandeep (
committee member
), Savla, Ketan (
committee member
)
Creator Email
ghassempasandi@gmail.com,pasandi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-370813
Unique identifier
UC11666280
Identifier
etd-PasandiGha-8958.pdf (filename),usctheses-c89-370813 (legacy record id)
Legacy Identifier
etd-PasandiGha-8958.pdf
Dmrecord
370813
Document Type
Dissertation
Rights
Pasandi, Ghasem
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
adiabatic quantum flux parametron (AQFP)
algebraic expression
Boolean expression
DAG
dual clocking method
energy-efficient rapid single flux quantum (ERSFQ)
energy-efficient SFQ
graph partitioning
heuristic
high-speed
logic synthesis
low-power
macro clock
micro clock
Moore’s law
optimal solution
reciprocal quantum logic (RQL)
refactoring
rewriting
single flux quantum (SFQ)
superconducting electronics (SCE)
technology mapping