Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Automatic conversion from flip-flop to 3-phase latch-based designs
(USC Thesis Other)
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AUTOMATIC CONVERSION FROM FLIP-FLOP TO 3-PHASE LATCH-BASED DESIGNS by Huimei Cheng A Dissertation Presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulllment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) August 2020 Copyright 2020 Huimei Cheng Dedication This dissertation is dedicated to my beloved family, my father Guiming Xu, my mother Jijin Cheng, my grandmother Xiaoyun Zhang for their continuous love and support. ii Acknowledgements First and foremost, I am deeply indebted to Professor Peter A. Beerel, my doctoral advisor. He has been giving me endless supports since my rst day in graduate school. I especially appreciate his enthusiastic dedication to my research and and his availability even during weekends and while traveling. Professor Beerel always believes in me and makes me feel condent in my skills. Without his guidance and support, I cannot become the person I am today. Besides my advisor, I would also like to extend my sincere thanks to the rest of my dissertation committee members, Professor Sandeep Gupta and Professor Aiichiro Nakano, for their time, interest, and invaluable comments. I am also grateful to the other two members of my qualifying committee, Professor Massoud Pedram and Professor Pierluigi Nuzzo, for their time and insightful questions. Their crucial remarks shaped my nal dissertation. I am extremely grateful to my senior Hsin-Ho Huang who brought me into the asynchronous elds and encouraged me to conduct research. In addition, I would like to thank my senior Dylan Hand, Georgios Dimou, Yang Zhang, Tiansong Cui, Ji Li, and Fangzhou Wang. They have provided insightful feedback and invaluable support to my work. Moreover, I wish to thank my teammates in our research group, Sourya Dey, Souvik Kundu, Dake Chen, Moises Herrerabuitrago, Hsiao-Lun Wang, who have helped me develop my ideas. Many ideas in the thesis are the results of hours of rewarding discussions with them. More specially, Xi Li, Yichen Gu, and Zhiyong Yang implemented the conventional master-slave designs and iii helped me with the experimental results. I would also like to thank my colleagues Ting-Ru Lin, Bo Zhang, Mutian Zhu, and Qisheng Fu for the inspiring discussions related this dissertation and for making my life in USC joyful and exciting. I would also like to express the deepest gratitude to Professor Shahin Nazar- ian, who introduced me to the VLSI/CAD world and provided endless technology support. I would like to extend my gratitude to our IT assistant Soowang Park. He always helps me with the tool installation and license extension. I also ap- preciate the help of the USC Ming Hsieh Department of Electrical and Computer Engineering sta, Annie Yu and Diane Demetras, for their support and help. Finally, I am deeply thankful to my parents, my family, and my boyfriend for all their love, support, and encouragement. I specially dedicate this dissertation to the memory of my mother Jijin Cheng, whose role in my life was, and remains immense. Huimei Cheng Los Angeles, California June 2020 iv Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures ix Abstract xi Chapter 1: Introduction 1 1.1 FF-Based and Latch-Based Designs . . . . . . . . . . . . . . . . . . 2 1.2 Conventional Approaches to Latch Conversion . . . . . . . . . . . . 5 1.2.1 Pulsed Latch Approach . . . . . . . . . . . . . . . . . . . . . 5 1.2.2 Master-Slave Latch Approach . . . . . . . . . . . . . . . . . 6 1.3 Contributions of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: 3-Phase Latch Timing 11 2.1 Review of Multi-Phase Latch Timing Model . . . . . . . . . . . . . 12 2.2 Reasonable Constraints for Latch Conversion . . . . . . . . . . . . . 15 2.3 Special Case of Linear Pipelines . . . . . . . . . . . . . . . . . . . . 17 2.4 Suciency of 3-Phase Timing . . . . . . . . . . . . . . . . . . . . . 19 2.5 Optimality of 3-Phase Timing . . . . . . . . . . . . . . . . . . . . . 20 Chapter 3: Converting Designs to 3-Phase Latch-Based Designs 24 3.1 Conversion Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 3-Phase Post-Latch Approach . . . . . . . . . . . . . . . . . 26 3.1.2 3-Phase Pre-Latch Approach . . . . . . . . . . . . . . . . . . 31 3.2 The Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Modied Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.1 Work-Around Retiming for Master-Slave Latch-Based Designs 38 3.3.2 Modied Retiming for 3-Phase Latch-Based Designs . . . . . 40 3.4 Clock Gating (CG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 v 3.4.1 Review of Clock Gating Techniques . . . . . . . . . . . . . . 43 3.4.2 Look Ahead Clock Gating . . . . . . . . . . . . . . . . . . . 46 3.4.3 Date Driven Clock Gating (DDCG) . . . . . . . . . . . . . . 51 3.4.4 Special Flip-Flops and Reset . . . . . . . . . . . . . . . . . . 53 3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Chapter 4: Latch-Based Retiming 64 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.2 Review of Graph-Based Min-Area Retiming . . . . . . . . . . . . . 65 4.3 Retiming 3-Phase Latch-Based Designs . . . . . . . . . . . . . . . . 70 4.3.1 The Problem Statement . . . . . . . . . . . . . . . . . . . . 70 4.3.2 Retiming in 3-Phase Latch-Based Designs . . . . . . . . . . 71 4.4 Retiming Considering Clock-Gating . . . . . . . . . . . . . . . . . . 74 4.4.1 Retiming 3-phase post latch designs . . . . . . . . . . . . . . 75 4.4.2 Retiming 3-phase pre latch designs . . . . . . . . . . . . . . 79 4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 100 Chapter 5: Summary 101 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2 Conclusions and Possible Next Steps . . . . . . . . . . . . . . . . . 102 Bibliography 105 vi List of Tables 3.1 Number of registers (FFs or latches) in the original ip- op (FF), converted master-slave latch (M-S), and proposed 3-phase (3-P) latch based designs withp 2 latches placed post- and pre-p 1 =p 3 latches 56 3.2 Total Area (m 2 ) in the original ip- op (FF), converted master- slave latch (M-S), and proposed 3-phase (3-P) latch based designs with p 2 latches placed post- and pre- p 1 =p 3 latches . . . . . . . . . . 57 3.3 Power dissipation (mW ) based on simulation of ip- op (FF), master- slave latch (M-S), and 3-phase latch-based designs with p 2 latches placed post- and pre- p 1 =p 3 latches. \Best" is the higher saving in either post- or pre-latch approach. We run the ISCAS designs on pseudo-random input streams, CEP designs on the open-source provided self-check programs, Plasma on \pi", RISC-V on \rv32ui- v-simple", and ARM-M0 on the \hello world" program. . . . . . . . 59 3.4 Run-time comparison for FF-, master-slave, 3-phase pre-latch and 3-phase post-latch-based designs . . . . . . . . . . . . . . . . . . . . 62 4.1 The statistics of the # of enable nets for potential p2 latches in the 3-phase latch-based designs . . . . . . . . . . . . . . . . . . . . . . 86 4.2 The statistics of the # of enable nets for potential p2 latches in the 3-phase pre latch-based designs . . . . . . . . . . . . . . . . . . . . 87 4.3 Comparasons on the number of registers among the original FF, con- ventional master-slave latch (M-S), the original 3-phase post latch (O3 Post), the original 3-phase pre latch (O3 Pre), the GR 3-phase post latch (GR3 Post), and the GR 3-phase pre latch-based designs. The saves (%) for GR3 Post latch-based designs are compared with FF, M-S, and the O3 Post designs and The saves (%) for GR3 Pre latch-based designs are compared with FF, M-S, and the O3 Pre designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4 The statistics of enable nodes after new retiming on the GR 3-phase post latch (GR3 Post) and the GR 3-phase pre latch-based designs . 89 4.5 Area decomposition (m 2 ) of ip- op (FF), master-slave latch (M- S), original 3-phase post (O3Post), GR 3-phase post, original 3- phase pre (O3Post) and GR 3-phase pre latch-based designs . . . . 90 vii 4.6 Comparasons on area (m 2 ) among the original FF, conventional master-slave latch (M-S), the original 3-phase post latch (O3 Post), the original 3-phase pre latch (O3 Pre), the GR 3-phase post latch (GR3 Post), and the GR 3-phase pre latch-based designs. The saves (%) for GR3 Post latch-based designs are compared with FF, M-S, and the O3 Post designs and The saves (%) for GR3 Pre latch-based designs are compared with FF, M-S, and the O3 Pre designs. . . . 90 4.7 Power dissipation (mW ) based on back-annotated gate-level simu- lation of ip- op (FF), master-slave latch (M-S), original 3-phase post (O3Post), GR 3-phase post, original 3-phase pre (O3Post) and GR 3-phase pre latch-based designs. . . . . . . . . . . . . . . . . . . 92 4.8 Comparasons on Power dissipation (mW ) among the original FF, conventional master-slave latch (M-S), the original 3-phase post latch (O3 Post), the original 3-phase pre latch (O3 Pre), the GR 3-phase post latch (GR3 Post), and the GR 3-phase pre latch-based designs. The saves (%) for GR3 Post latch-based designs are com- pared with FF, M-S, and the O3 Post designs and The saves (%) for GR3 Pre latch-based designs are compared with FF, M-S, and the O3 Pre designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.9 Power Saves (%) for original 3-phase post (O3Post), GR 3-phase post, original 3-phase pre (O3Post) and GR 3-phase pre latch-based designs, compared to ip- op (FF) and master-slave latch (M-S) designs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.10 Area Savings (%) in Combinational logic and Sequential Elements w.r.t. ip- op (FF) designs achieved by master-slave latch (M-S), original 3-phase post (O3Post), GR 3-phase post, original 3-phase pre (O3Post) and GR 3-phase pre latch-based designs . . . . . . . . 95 4.11 Power Savings (%) in Combinational logic and Sequential Elements w.r.t. ip- op (FF) designs achieved by master-slave latch (M-S), original 3-phase post (O3Post), GR 3-phase post, original 3-phase pre (O3Post) and GR 3-phase pre latch-based designs . . . . . . . . 95 4.12 Run-time (sec) comparison for FF-, master-slave, the original 3- phase post latch, and GR 3-phase post latch-based designs . . . . . 98 4.13 Run-time comparison for FF-, master-slave, the original 3-phase pre latch, and GR 3-phase pre latch-based designs . . . . . . . . . . . . 99 viii List of Figures Figure 1.1 A typical latch . . . . . . . . . . . . . . . . . . . . . . . . . 2 Figure 1.2 A typical ip- op is composed of a pair of latches. . . . . . 3 Figure 1.3 A tightly (a) and loosely (b) coupled pulsed latch . . . . . 5 Figure 1.4 (a) Conversion from ip- op to master-slave latches. clk m represents the master clock andclk s is the slave clock. (b) A 3-stage pipeline in FF-based designs. (c) The converted master-slave latch-based designs. (d) Retime slave latches to take advantage of the time borrowing . . . . . . . . . . . 7 Figure 2.1 Clock phase and its local time zone . . . . . . . . . . . . . 11 Figure 2.2 Hold and setup timing constraints . . . . . . . . . . . . . . 13 Figure 2.3 Converting a linear FF-based pipeline (a) to a 2-phase latch- based pipeline (b) and to a 3-phase latch-based pipeline (c) 18 Figure 2.4 Example non-linear pipeline requiring 4-phase clocking to support the minimum number of extra latches . . . . . . . 22 Figure 3.1 An example of (a) a 3-stage FF-based pipeline with a feed- back loop and (b) the converted 3-phase latch-based design 30 Figure 3.2 (a) An illustrated FF-based example. (b) post-latch cong- uration requires at least two p 2 latches. (c) pre-latch con- guration needs only one p 2 latch. The position to insert latches are pointed by red arrows. . . . . . . . . . . . . . . 32 Figure 3.3 The design ow . . . . . . . . . . . . . . . . . . . . . . . . 36 Figure 3.4 Enabled (a) to gated clock (b) transformation . . . . . . . 36 Figure 3.5 Duplicated clock gating logic for phase conversion . . . . . 37 Figure 3.6 (a) A ip- op based design. (b) Replace FFs with two FFs and retime to achieve a clock period of T/2. (c) Replace the retimed FFs with latches at clock period of T . . . . . 39 Figure 3.7 The 3-phase clocks are mapped to clk and clkbar for mod- ied retiming . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Figure 3.8 (a) A ip- op based design. (b) Replace back-to-back latch group to two FFs and single latch group to one FF. (c) Con- vert the retimed FF-based design to 3-phase latch-based design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 ix Figure 3.9 (a) Clock gatep 2 latches whose leading latches share a com- mon enable EN (3-phase post). (b) Timing diagram of EN (3-phase post). (c) Clock gate p 2 latches whose successor latches share a common enable EN (3-phase pre). (d) Tim- ing diagram of EN (3-phase pre). . . . . . . . . . . . . . . 45 Figure 3.10 The original and modied clock gating cells (a-d) and the situations to apply these modications in (e1) 3-phase post- latch and (e2) 3-phase pre-latch-based designs . . . . . . . 48 Figure 3.11 (a) Simplied clock gating cells cause (b) hazards to FF- based designs. (c) Modied CG cells (M2) are safe in 3- phase latch-based designs. (d) EN stays stable after reach- ing CG thus hazard free. . . . . . . . . . . . . . . . . . . . 50 Figure 3.12 (a) Single and (b) multi bit data-driven clock gating (DDCG) 52 Figure 3.13 Power dissipation (mW ) of RISC-V and Arm-M0 running Dhrystone and Coremark with ip- op, converted master- slave latch, and proposed 3-phase latch with dierent p 2 insertion positions (Post and Pre) . . . . . . . . . . . . . . 61 Figure 4.1 Sharing of fanout and fanin registers: (a) fanout register sharing, (b) fanin register sharing . . . . . . . . . . . . . . 69 Figure 4.2 An example of Retiming . . . . . . . . . . . . . . . . . . . 72 Figure 4.3 Example where 3-phase latch retiming has a larger search space than traditional retiming. (a) Traditional retiming cannot retime the p 2 latches through the rst AND gate. (b) In 3-phase retiming, the min-cut is found at the output of the rightmost AND gate. (c) The retimed graph has fewer latches than the original design. . . . . . . . . . . . . 73 Figure 4.4 An illustrative circuit . . . . . . . . . . . . . . . . . . . . . 77 Figure 4.5 Reduced retiming graph with added edges and nodes rep- resent the impact of clock-gating for the circuit shown in Fig. 4.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Figure 4.6 An illustrative 3-phase pre circuit . . . . . . . . . . . . . . 81 Figure 4.7 Reduced retiming graph with added edges and nodes rep- resent the impact of clock-gating for the circuit shown in Fig. 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 Figure 4.8 Power dissipation (mW ) of RISC-V and Arm-M0 running Dhrystone and Coremark in ip- op (FF), converted master- slave latch (M-S), original 3-phase post latch (O3Post), the proposed graph-based retiming 3-phase Post latch (GR3Post), original 3-phase pre latch (O3Pre), and the proposed graph- based retiming 3-phase pre latch (GR3Pre) based designs . 97 x Abstract The growing use of portable/wireless electronic systems and Internet-of-Things (IoT) applications motivates the desire for small energy-ecient designs. How- ever, with the ending of Moore's law, the market must adapt to a relatively xed technology base which will make improvements in area and energy-eciency more challenging. This thesis presents a novel way to reduce power consumption and increase energy eciency by revisiting a decision made in the early days of Very Large Scale Integration (VLSI). As VLSI design emerged, two devices the edge-triggered ip- ops (FFs) or level-sensitive latches were identied as viable means of synchronization and state storage. Compared to FFs, latches have the advantages of time borrowing, skew and jitter tolerance, smaller cell area, and lower capacitance. Latch-based designs can thus consume lower power and area than FF-based designs, particularly when process variation is considered. However, the VLSI community gravitated to using FFs because they more easily support the synchronous paradigm captured in most RTL specications. In fact, over the past four decades, the computer-aided-design xi industry focused on building sophisticated software tools that compiled these RTL specications to a mixture of combinational logic and FFs with limited support for latch-based alternatives. This thesis shows how these ows can be easily extended to yield latch-based designs that signicantly reduce power consumption with no loss in performance and no increase in area. The approach taken involves the introduction of a novel 3-phase clocking scheme that requires a remarkably small number of latches as storage elements to achieve the same performance as FF-based equivalents. In fact, to enable the use of this scheme to both new and legacy designs, this thesis develops a novel conversion algo- rithm that takes any traditional FF-based designs and converts it to a more ecient 3-phase latch-based design. Moreover, our proposed ow for 3-phase latch-based designs is based on commercial synthesis and physical design tools supplemented with a few custom optimization functions, making the adoption of the proposed approach much easier. We performed extensive experiments to evaluate the ap- proach, including synthesis and place-and-route in a modern technology node. Our resulting latch-based designs save an average of 21.0% and 21.4% compared to more traditional FF based alternatives across a board range of benchmarks that include three CPU designs. More specically, this thesis addresses the following topics: 1. Development of an automated conversion ow that converts any synchronous RTL specication with a single clock domain to a 3-phase latch-based design. xii 2. Development of optimization strategies for latch insertion, clock duty cycle, retiming, and clock-gating. 3. Implementation of placement, clock tree synthesis, and routing physical de- sign ows for latch-based designs using commercial tools. 4. Evaluation of the benets of the 3-phase latch-based designs, including its impact on the number of registers, area, and power dissipation. xiii Chapter 1 Introduction The growing use of portable/wireless electronic systems and Internet-of-Things (IoT) applications motivates the desire for small energy-ecient designs. With the high demand for portable operation, where mobile phones need to pack all complex signal processing modules and units without consuming much power, the strict limitation on power dissipation must be met by the VLSI chip designers. In addition, reducing the power consumption is important in the way to minimize the size, weight allocated to batteries, and the cost for packaging and cooling. During the past ve decades, the growth of VLSI had been predicted by Moore's law, an empirical observation that the number of transistors in a dense integrated circuit (IC) has doubled about every two years. However, with the ending of Moore's law, the market must adapt to a relatively xed technology base which will make improvements in area and energy-eciency more challenging [1]. This thesis presents a novel way to reduce power consumption and increase energy eciency by revisiting a decision made in the early days of Very Large Scale Integration. The clock networks and sequential elements are most power consuming compo- nents of a digital VLSI chip [2]. Thus, reductions in the clock and sequential power 1 consumption have a deep impact on the total power consumption. One of two de- vices: edge-triggered ip- ops (FFs) or level-sensitive latches are typically used as synchronization and state storage. Compared to FFs, latches have the advantages of time borrowing, skew and jitter tolerance, smaller cell area, and lower capaci- tance. Latch-based designs can thus consume lower power and area than FF-based designs [3{5], particularly when process variation is considered [6]. The following sections review the characteristics of FF and latch-based designs, explaining the challenges of adopting latch-based designs, before outlining the solutions proposed in this thesis. 1.1 FF-Based and Latch-Based Designs Latches have a transparent state during which the output data captures the input value and holds the value at the end of the transparent state. A typical latch is illustrated in Figure 1.1. Figure 1.1: A typical latch 2 Latches are also called level-sensitive because their output follows their input as long as they are transparent, usually when their clock input is at a logic high level. The transparent state also functions as time borrowing window and the time borrowing characteristic of a latch provides latches with relaxed timing constraints and more exibility in position and size. For example, if data arrives at the latch stage earlier than the closing edge, as early as latch opens, data being able to propagated through gives more time to the downstream stages. Figure 1.2: A typical ip- op is composed of a pair of latches. Flip- ops, on the other hand, store input data value at the rising (or falling) edge of a clock signal but hold the value at the other timings. Figure 1.2 shows a positive edge-triggered ip- op where two latches are connected in series and a clock signal clk is connected to the clk of the latches, one directly, and one through an inverter. When clk is low, the rst latch (also called the master latch and labeled as LTm in the gure) is transparent and the output follows the primary input D. When clk changes to 1, the master latch is disabled but the second latch, named the slave latch and labeled as LTs in the gure, is transparent. At that time, the data from the output of the master latch is transferred through the slave latch and 3 to the output. The slave latch is transparent during the high period of clk, but its output changes only at the edge of clk signal changing from 0 to 1. This is because the master latch is disabled, keeping the input to the slave latch xed. When clk is low the output to the slave latch will not change since the slave latch is disabled. Therefore, the output of a positive edge-triggered ip- op captures the input data only at a rising edge of the clock and holds the data at other times. Since latches generally have smaller cell area and lower capacitance than FFs, latch-based designs can lead to lower power and area than FF-based designs. Latch-based designs also have the benets of fewer glitches, and relaxed timing constraints thanks to time borrowing [4], especially when process, voltage, and temperature (PVT) variations are considered. In a classical FF-based design, hold- time should be ensured on all short paths in the fastest corner (super-threshold, fast process). However, it is often dicult to optimize paths in both very slow and very fast corners [5]. A latch-based design, particularly with non-overlapping clock phases, is therefore suitable for achieving wide supply range (WSR) without the penalty of buer insertion and/or hold guard bands increasing. Latch-based designs are also critical for architecturally-agnostic timing resilient designs [7, 8] which can remove unnecessary margins associated with PVT variations and make near-threshold computing more practical. 4 A basic challenge to adopting any form of latch-based design is that most RTL specications are designed using edge sensitive FFs. Approaches to automatically converting a FF- to latch-based design are thus attractive. 1.2 Conventional Approaches to Latch Conver- sion There are two popular conversion approaches that have been explored in the lit- erature. 1.2.1 Pulsed Latch Approach One approach is to adopt pulsed latches by simply replacing every ip- op with a pulsed latch. As an intermediate between latches and edge-triggered ip- ops, a pulsed latch consists of a latch with a short-duration of transparency window produced by a pulse generator [9,10]. Figure 1.3: A tightly (a) and loosely (b) coupled pulsed latch 5 To minimize energy overhead, multi-bit pulsed-latch schemes are proposed to share pulse generators among several latch cells [11]. Pulsed-latches, however, must be used carefully because they are subject to hold problems and pulse width variations that are challenging to predict, control, and mitigate (see e.g., [12]). To avoid buer insertion for hold xing and challenges when considering PVT variations, pulsed latch approaches will not be considered in this thesis. 1.2.2 Master-Slave Latch Approach In the second approach, every ip- op is replaced with master-slave latches con- trolled by two clock phases [5], as illustrated in Figure 1.4, or with low active latches and high active latches that have the same clock phase [13]. In Figure 1.4(a), every ip- op is replaced with a pair of back-to-back level-clocked latches. Figure 1.4(c) shows the resulting latch-based design converted from the FF-based design in Fig- ure 1.4(b). The timing diagrams are shown on the bottom of each subgure. Optimization of the two-phase latch-based designs has also been given some at- tention in the literature. For example, [4,13] take advantage of the time borrowing to boost performance and/or reduce area and power consumption. Figure 1.4(d) illustrates an optimized latch-based design, where a specic phase of latches are relocated to balance the delay of pipeline stages and thus benet from the time borrowing. The location of master latches are xed at the locations of the original FFs to simplify formal equivalence checking and avoid challenges associated with 6 Figure 1.4: (a) Conversion from ip- op to master-slave latches. clk m represents the master clock and clk s is the slave clock. (b) A 3-stage pipeline in FF-based designs. (c) The converted master-slave latch-based designs. (d) Retime slave latches to take advantage of the time borrowing 7 changing the circuit's initial state [14]. [4] also explores the opportunity of using a mix of master-slave latches and FFs, by converting the back-to-back connected latches back to ip- ops. Advantages for converting to two-phase latch-based designs can also be mi- grated to asynchronous and resilient designs by using bundled-data asynchronous controllers to generate the 2 clock phases. The converted designs have been demon- strated to have similar performance (if not higher), low power dissipation, robust- ness to PVT variations, and average-case performance [8, 15{18]. Moreover, re- timing algorithms of timing-resilient latch-based designs have been developed to consider not only the number of latches required but also the impact of the amount of needed error-detecting logic [19]. Two-phase designs are inherently more robust than pulsed-latch designs, but this thesis shows can be overly restrictive compared to multi-phase latch-based designs [20]. 1.3 Contributions of Thesis This work provides an automated conversion ow that converts FF-based designs into robust multi-phase latch-based designs requiring fewer latches than two-phase designs with the same performance. In particular, an FF-based design is con- verted to a 3-phase latch-based design using a novel Integer Linear Program (ILP) that minimizes the number of required latches. The ow includes optimization 8 strategies for latch insertion, clock duty cycle, retiming, and clock-gating. Post place-and-route results demonstrate that the resulting 3-phase latch-based designs can save an average of 21% and 21.4% power on a variety of ISCAS, CEP, and CPU benchmark circuits, compared to their more traditional FF and master-slave based alternatives. Our contributions are: • Generalize reasonable constraints for conversions of FF to latch-based de- signs. The constraints are designed to make the application (e.g. reset states, verication, and testing) of latch-based designs easier, and ensure the same throughput as FF-based designs. • Create a RTL-to-GSDII ow for 3-phase latch-based designs based on a com- mercial physical design ow. The scheme will also ensure the performance of the resulting latch-based design is the same as the FF-based design. • Design a variety of ILPs to minimize the number of latches required in the conversion and explore multiple design alternatives for latch insertion, clock duty cycle, retiming, and clock-gating. • Evaluate the potential for savings in the number of latches, area, and power on a suite of benchmark circuits, including ISCAS89 benchmark circuits, CEP sub-modules, and three CPU designs, a 3-stage MIPS Open-Core CPU Plasma, a RISC-V Rocket Core, and an ARM-M0 core. 9 1.4 Organization The remainder of the thesis is organized as follows. Chapter 2 describes the design constraints we adopt in our conversion algorithm and the area-performance trade- os they represent. Chapter 3 presents our ILP-based conversion algorithm and describes an automated conversion design ow, including the optimization in latch insertion, clock duty cycle, retiming, and clock gating for latch-based designs. Chapter 4 takes advantage of a special property of 3-phase latch-based designs to propose an improved graph-based retiming algorithm and extend this graph-based retiming algorithm to consider both the number of latches and the impact of clock gating. Chapter 5 concludes the work, reviewing the thesis's contributions and describing possible future work. 10 Chapter 2 3-Phase Latch Timing This chapter rst reviews the timing model of multi-phase latch-based designs. Section 2.2 generalizes the minimal constraints for a FF to latch conversion. Based on the constraints, a 3-phase clock system is proved to be optimal in linear pipelines in Section 2.3. Lastly, this chapter discusses a case where the 3-phase clock is not optimal. We assume, without loss of generality, that all ip- ops are rising-edge triggered and latches are transparent when the clock is high. Figure 2.1: Clock phase and its local time zone In latch-based designs, there are two states in a clock cycleT c for each phase: a transparent state of durationT i , and an opaque state of the rest cycle timeT c T i . In Figure 2.1, for example, the opaque state starts at time 0, the transparent state 11 starts at T c T i , and the closing edge of the transparent state occurs at T c . The transmission of data from input to output is only enabled in transparent states, from T c T i to T c . Otherwise, in opaque states, transitions are disabled and the outputs remain unchanged during 0 toT c T i . A latch-based design withk phases of clocking is called k-phase latch-based design, with dierent phases are ordered in a predened sequence. 2.1 Review of Multi-Phase Latch Timing Model The Sakallah, Mudge, and Okulotun (SMO) model [20] denes an optimal frame- work for multi-phase latch-based designs. It denes ak-phase clock as a collection ofk periodic signals with a common cycle time and associated timing constraints, called the General System Timing Constraints (GSTC). The phases (p 1 ,p 2 , ... p k ) are ordered in a global time reference: e i1 e i ; e k = T c , where e i is the closing time of phase p i . E ij is the forward phase shift from phase p i to phase p j dened below. E ij = 8 > > < > > : (e j e i ); i<j (T c +e j e i ); ij (2.1) The D-type latches have three terminals: data input, data output, and clock input. Each of them is characterized with the following ve parameters: 12 Figure 2.2: Hold and setup timing constraints • p i : clock phase used to control latch i. • S i : setup time of i relative to latching edge of p i . • H i : hold time of i relative to latching edge of p i . • i , i : minimal and maximal propagation delay value of the latch i. Events are mentioned as the interval when the signal is switching between its old and new values, as illustrated in Figure 2.2. The signal arriving at the data input of latchi is dened asa i andA i , wherea i is referred to as the early and late arrival times, respectively. Corresponding toa i (A i ),d i (D i ) represents the earliest (latest) signal departure time, i.e., the amount of time after the last e i that the next data starts to propagate through the latch i. d i = max(a i ;T c T p i ) D i = max(A i ;T c T p i ) (2.2) 13 For correct operations, the arriving data signal must satisfy the following hold and setup time requirement: Hold: a i H i Setup: A i T c S i (2.3) The delay value is measured as the propagation delay through combinational logic between latches i and j. • ij : minimal propagation delay from the data output of latch i through combinational logic to the data input of latch j; ij =1 when i and j are not directly connected by combinational logic. • ij : maximal propagation delay from the data output of latch i through combinational logic to the data input of latchj; ij =1 wheni andj are not directly connected by combinational logic. According to the denition of combinational delays, the arrival time of an event can be dened by the following two equations: a i = min j (d j + j + ji E p j p i ) A i = min j (D j + j + ji E p j p i ) (2.4) 14 To conclude, the worst-case setup and hold constraints for each phase are de- ned as follows. Hold: H i d j + j + ji E p j p i Setup: T c S i D j + j + ji E p j p i (2.5) 2.2 Reasonable Constraints for Latch Conver- sion There are two constraints we adopt that are designed to make the application of latch-based designs easier. C1: the original position of all FFs must be latched; C2: neighboring latches, connected by combinational logic, must not be simulta- neously transparent; Constraint C1 is designed to make logical equivalence checking between the latch and FF-designs easier. In particular, we will convert every FF to a latch and only add extra latches where necessary to meet these constraints. During logical equivalence checking the xed latches can be viewed as FFs and the extra latches can be treated as transparent. Ensuring latches are present at the same position as the original FFs also guarantees the ability to reset the circuit in the same state [14]. 15 Constraint C2 is designed to avoid min-delay problems. In particular, even with min delay paths equal to 0 ( i = ij = 0) the hold constraint is satised with zero hold times (H i = 0). 1 This constraint is particularly important when considering an FF with combinational feedback. If no extra latch is added during conversion, the converted circuit would have a single latchi with combinational feedback which violates C2. This conguration is dangerous because the transparency phase of the latch must be smaller than the minimum delay of the combinational feedback ii to avoid a hold violation. More precisely, the constraint can be formalized as: i + ii H i +T p : The key point is that this constraint guarantees this conguration is not allowed. In particular, any solution that satises this constraint will break such combinational feedback by at least two latches that have non-overlapping clocks. A well-known but non-optimal solution is to convert every FF into two latches, a master and a slave latch, as in [4,15], and retime the slave latches. This master- slave approach satises both constraints C1 and C2 but at the cost of doubling the number of sequential elements. That is, before retiming, the extra number of latches added is exactly equal to the number of FFs. 1 This follows because constraint C2 means E pjpi T c T p and the signal can start to propagate through a latch only after it opens d j T c T p . 16 2.3 Special Case of Linear Pipelines It is interesting to consider the special case of a linear pipeline because they have no FFs with combinational feedback that must be considered. Such a pipeline is illustrated in Figure 2.3(a) and its cycle time T c is no shorter than 1 + 11 +S, where 1 represents the FF's clk-to-q delay, 11 represents the longest data-path delay, and S stands for the FF's setup time. Such linear pipelines can be converted to a latch-based design adding no extra latches, where we clock alternating pipeline stages with alternating phases of a two-phase non-overlapping clock, as illustrated in Figure 2.3(b). The problem with this solution is that if each combinational logic stage is critical, the time separation between each phase of the clock must be equal to the original cycle time, i.e., E ij =T c , where T c represents the original cycle time. Letting T 2P c denote the cycle time of the two-phase non-overlapping clocks and assuming E 12 =E 21 =T c , Equation 2.1 implies e 2 e 1 =T 2P c +e 1 e 2 =T c and thus, T 2P c = 2T c . In other words, the frequency of the two-phase clocks must be half that of the original FF-based design. This analysis highlights the fact that there is a trade-o between the number of extra latches added and the performance of the resulting circuit. To avoid this trivial solution in our formulation, we adopt a third constraint: 17 Figure 2.3: Converting a linear FF-based pipeline (a) to a 2-phase latch-based pipeline (b) and to a 3-phase latch-based pipeline (c) C3: the converted latch-based design after retiming must have the same through- put as the FF-based design assuming the combinational logic is already crit- ical. 18 2.4 Suciency of 3-Phase Timing The conversion problem is generalized as below. To convert any arbitrary complex FF-based pipeline with a single clock do- main to a latch-based pipeline, the following Constraints C1-C3 are required to be satised. C1: the original position of all FFs must be latched; C2: neighboring latches, connected by combinational logic, must not be simulta- neously transparent; C3: the converted latch-based design after retiming must have the same through- put as the FF-based design assuming the combinational logic is already crit- ical. Theorem I: Any FF-based design with a single clock domain can be converted to a 3-phase latch-based design while satisfying C1-C3. A well-known but non-optimal solution is to convert every FF into two latches, a master and a slave latch, and retime the slave latches. For 3-phase latch-based designs, the master latches can always be replaced with latches clocked by p 3 , and the slave latches can be replaced with latches clocked withp 2 . Thus, there is always a solution to convert any FF to 3-phase latch-based design. Note that the slave latches need to be retimed to meet the performance. 19 Theorem II: Adding one latch every two stages in a linear pipeline is sucient to meet all Constraints C1-C3. We can achieve a latch-based design that meets all Constraints C1-C3 in which we add exactly one extra latch stage for every other original pipeline stage using a 3-phase clock system, as illustrated in Figure 2.3(c). Notice that as desired, after retiming, this solution has the same throughput as the original pipeline having phases p 1 and p 3 open and close their respective latches at the rising edge of the FF-based clock. We rely on the p 3 latches time borrowing to properly capture near critical combinational paths. The retimed p 2 latches inserted between the p 3 and p 1 latches prevent data being latched by p 3 to violate the setup times of the subsequent latches. 2.5 Optimality of 3-Phase Timing A natural question to ask is if 3-phase clocking guarantees optimality in terms of the number of required extra latches in the resulting designs. This section proves that it is optimal for linear pipelines but does not guarantee optimality for more general non-linear pipelines. 2 2 Our conversion approach is a two-step approach in which we rst restrict the insertion of extra latches to before/after existing latches and then retime the extra latches to ensure the desired performance. Our notion of optimality is focused on the rst stage of the conversion algorithm and the minimum number of extra latches required before retiming. Even though retiming may change the number of latches in the nal design, we argue that the optimality of the rst step of the algorithm is still a reasonable notion of optimality. 20 Theorem III: To meet Constraints C1-C3, adding at least one latch stage between any 3 consecutive stages of a linear pipeline is necessary. Proof by contradiction: Assume there exists three consecutive stages of a linear pipeline for which no extra latch stage is inserted within the combinational logic between stage 1 and stage 2 or between stage 2 and stage 3. Let time 0 represent the rising edge of the stage 1 clock. According to Con- straints C2 and C3, stage 2 clock can only go high during the time window (T p , T c T p ) and must go low no later than T c . Case 1: Assume stage 1 data is valid at time 0. Since there is no latch between stage 1 and stage 2, stage 2 clock captures data no earlier than T c . Then stage 2 clock should be high during the time period (T c T p ,T c ). According to Constraints C2 and C3, stage 3 can only go low during the period (T c +T p , 2T c T p ). This means that stage 3 has to capture data before time 2T c T p . Because there is no extra latch inserted between stage 2 and stage 3, stage 3 must capture the data no earlier than 2T c . This, however, contradicts the fact that stage 3 must go low before 2T c T p . Case 2: Assume the data leaves stage 1 at time t (0 < t <= T p ). Then stage 2 needs to sample the data no earlier than time t +T c . This contradicts the fact that stage 2 goes low no later than T c . Next, we present Figure 2.4 which illustrates an example in which 4-phase clocking is needed to achieve an optimal latch conguration. In particular, Figure 21 Figure 2.4: Example non-linear pipeline requiring 4-phase clocking to support the minimum number of extra latches 2.4(a) illustrates an original FF-based design where the combinational connections are abstracted to wires for simplicity. The optimal 3-phase clocking solution re- quires at least four extra latches, labeled \2" in Figure 2.4(b). However, 4-phase clocking yields a latch-based design that requires only three extra latches (labeled 22 \2" and \4" in Figure 2.4(c)). The p 4 latch that breaks two self loops from p 3 latches requires the fourth phase to avoid hold time from p 2 . To the best of our knowledge, it is an open question as to whether there are optimal latch-based designs that require more than four clock phases. Despite this example, the remainder of this paper presents an conversion algo- rithm that produces three-phase latch-based designs. The algorithm is thus not guaranteed to be optimal because it does not support more than three clock phases. It is also not optimal as it considers the restrictive case of adding extra latches only directly after required latches. More specically, we rely on retiming of these extra latches to position the extra latches within the combinational logic and sat- isfy constraints C1-C3. The separation of these two steps can lead to non-optimal results. Extending our algorithm to support four or more phases and additional latch locations is more complex and is left as future work. 23 Chapter 3 Converting Designs to 3-Phase Latch-Based Designs Chapter 2 generalized three constraints for latch conversion and theoretically proved the optimality of a 3-phase clock system in linear pipelines. The topic in this chapter is to design an Integer Linear Programming (ILP) based automatic ow that converts any RTL with a single clock domain to a 3-phase latch-based design. Latches tied to a particular clock phase (p 2 ) are optimized with retiming and clock gating techniques, which are introduced in Section 3.3 and Section 3.4. The benets associated with latch conversion are evaluated in Section 3.5. 3.1 Conversion Algorithms This section explores the 3-phase opportunities in more general non-linear pipelines. The phases (p 1 , p 2 , and p 3 ) are ordered in a global time reference where the rising edge of p 1 is the rising edge of clock in FF-based design. The original position of each FF is replaced with a latch controlled by either p 1 or p 3 . To 24 satisfy the conversion constraints in Section 2.2, paths from p 1 to p 1 , p 3 to p 3 , and p 3 to p 1 latches are broken by inserting a p 2 latch. Note because considering all possible locations of inserting p 2 latches is computationally challenging, our solution is to initially constrain where to add latches: either after or before the sequential elements, and then subsequently retime the added latches through the combinational logic. Our conversion approach is to automatically decompose the FFs into two groups, ones that will be converted into a single latch and ones that will be con- verted to back-to-back connected latches. The rst group of FFs are converted to a single latch and assigned to clock phase p 1 or p 3 . The remaining FFs are in the second group and converted to latches clocked by either p 1 or p 3 . In addition, for each latch in the second group, a p 2 latch is inserted at two alternative positions. In our 3-phase post-latch designs, ap 2 latch is inserted at the output of each latch. In contrast, in our 3-phase pre-latch designs, a p 2 latch is inserted at the input of each latch. It may be interesting to note that we also considered a third approach of con- sidering both possiblep 2 latch insertion points simultaneously and having the ILP choose the insertion point on a latch-by-latch basis. This approach, however, has three challenges and thus is not explored in further detail. First, this approach requires extra constraints of inserting no more than one p 2 latch for every path. Second, retiming (to be discussed in Section 3.3) will be restricted by dierent 25 initial positions of p 2 latches. Third, the preferred clock duty cycles, as discussed in Section 3.4, for inserting p 2 latches before and after a latch are dierent. Using a single clock duty cycle for both options will hinder the optimization. Our conversion algorithms are always able to nd solutions to any arbitrary complex conversions. One trivial solution is to put all FFs into the second group of back-to-back latch conguration. Note that the consideration of inserting latches between combinational logic may generate more latch savings. However, the run- time grows exponentially with the size of the circuit and it would be extremely costly to search among all combinational and sequential gates. Our conversion algorithm achieves the exibility by separates the step of inserting latches and retime the latches through the combinational logic. The separation may generate sub-optimal solutions, but makes the run-time quite manageable. 3.1.1 3-Phase Post-Latch Approach In this approach, the single latch group of FFs are clocked by phase p 1 and the back-to-back group insert an additional latch clocked by p 2 at the output of each latch in the group. This means that, by construction, there is no direct data path from p 3 to p 1 latches. Min delay related hold problems are avoided by allowing a FF to be assigned to phase p 1 and converted to a single latch only if none of its fanout FFs are also assigned to p 1 . 26 The above conditions are captured in an ILP that minimizes the number of latches. Each FF is treated as a node and its FO is the set of FFs that can be reached from the FF via only combinational logic. Every node u has two binary parameters, G a (u) and K(u). G a (u) determines which group of latches to assign the node u to, either the back-to-back latch group with G a (u) being 1, or the single-latch group whereG a (u) is 0. The subscripta stands for inserting ap 2 latch after the node. K(u) determines the node u's clock phase; 1 infers u is clocked by p 1 and 0 impliesu is clocked byp 3 . All inserted latches are driven byp 2 . Our ILP automatically performs this assignment minimizing the number of back-to-back latches as follows: Minimize X u2V[PI G a (u) Subject to: 8u2V[ PI : G a (u) = 8 > > > > > > > > > < > > > > > > > > > : 1; K(u) = 0; 1; K(u) = 1^9v2FO(u)K(v) = 1 0; otherwise K(u) = 8 > > > > < > > > > : 1; 8u2 PI f0; 1g; 8u2V 27 Here, PI stands for the set of all primary input ports and set V contains all FFs in the circuit. To provide consistency to the interface of the design, we assign all primary input ports (PI s) as if they were clocked by p 1 . To make the ILP compatible with Gurobi [21], we convert the conditional equations into inequalities: Minimize X u2V G a (u) Subject to: 8 > > > > > > > > > < > > > > > > > > > : G a (u) +K(u) 1 8u2V G a (u)K(u) +K(v) 1 8u2V;8v2FO(u) G a (u)K(v) 8u2 PI;8v2FO(u) 8 > > > > < > > > > : G a (u);K(u)2f0; 1g; 8u2V K(u) = 1;G a (u)2f0; 1g 8u2 PI The rst constraint implies ifK(u) is 0, then the inequalityG a (u) 1 is satised. In other words, a p 3 latch is always in a back-to-back latch group. The second constraint makes sure we insert a latch for consecutivep 1 latches by forcingG a (u) 1 when bothK(u) and any of its fanoutK(v) are 1. The rst two constraints should be satised for all nodesu2 PI[V but are not applied to the PI s. This is because 28 PI s must also be clocked by p 1 , i.e., K(u) = 1, as ensured by the last constraint. By substituting K(u) = 1 for u2 PI, the rst constraint is always satised, and the second constraint reduces to the third constraint. Note that ops converted to latches clocked by p 3 automatically require an extra latch and thus the program tries to assign ops to p 3 only when necessary. Because every FF is converted to a latch that is either clocked by p 3 or p 1 , these constraints imply that any FF that has a fanout assigned to the same phase will be converted to back-to-back latches. Thus neighboring latches are assigned to dierent phases, as desired. In particular, FFs with self-loops (that are neighbors with themselves) must be mapped to back-to-back latches and thus loops have at least two latches. Note that the ILP for linear pipelines will generate the optimal solution that inserts one latch every two latch stages. In particular, the p 2 latches are inserted between the p 3 and p 1 latch stages, and there is no latches inserted along paths from p 1 to p 3 latch stages. As an example, Figure 3.1(a) is a 3-stage pipeline with a feedback loop fanning into the second stage. We label the three FFs asn1,n2, andn3, respectively. The ILP for this design can be written as: MinimizeG a (n1) +G a (n2) +G a (n3) 29 Figure 3.1: An example of (a) a 3-stage FF-based pipeline with a feedback loop and (b) the converted 3-phase latch-based design Subject to: 30 Stage 1: Stage 2: Stage 3: G a (n1) +K(n1) 1 G a (n2) +K(n2) 1 G a (n3) +K(n3) 1 G a (n1)K(n1) +K(n2) 1 G a (n2)K(n2) +K(n3) 1 G a (n2)K(n2) +K(n2) 1 G a (n1)K(n2) G a (n2)K(n3) G a (n2)K(n2) G a (n1);K(n1)2f0; 1g G a (n2);K(n2)2f0; 1g G a (n3);K(n3)2f0; 1g An optimal solution is to assignn1 andn3 to the single latch group and maken2 to be back-to-back connected latches and clocked byp 3 andp 2 , namelyG a (n1) = 0, G a (n2) = 1, G a (n3) = 0, K(n1) = 1, K(n2) = 0, K(n3) = 1. The converted 3- phase latch-based design is drawn in Figure 3.1(b). The timing diagrams are drawn in the bottom of each sub-gure. The rst instruction starts from the rising edge of Stage 1 and propagates to Stage 2 captured by the falling edge of p 3 . In the next cycle, instruction 1 goes through inserted p 2 latch and is captured by a p 1 latch in Stage 3. At the same cycle, instruction 2 is initiated from the rising edge of p 1 in Stage 1, and so on. 3.1.2 3-Phase Pre-Latch Approach Insertingp 2 latches before, rather than after,p 1 andp 3 latches sometimes generates additional latch savings. As an example, consider the FF-based circuit illustrated in Figure 3.2(a). Two p 2 latches are required for the post-latch conguration, as illustrated in Figure 3.2(b). In contrast, the pre-latch conguration requires only one p 2 latch, as illustrated in Figure 3.2(c). 31 Figure 3.2: (a) An illustrated FF-based example. (b) post-latch conguration requires at least twop 2 latches. (c) pre-latch conguration needs only one p 2 latch. The position to insert latches are pointed by red arrows. The 3-phase pre-latch approach is implemented by a dierent ILP. In contrast to the ILP for 3-phase post-latch designs, this ILP replaces the binary parameter G a withG b to designate if the FF should be in the back-to-back or single latch group. The subscript b is short for \before", meaning the latch insertion, if necessary, is before the original latch. If G b (v) is 0, on the other hand, the FF is in the single latch group and requires no extra p 2 latch. Any FF that has a fanin assigned to the same phase will be converted to back- to-back latches. In particular, if a latch v and its fanin u are both clocked by p 1 , 32 a p 2 latch is inserted before v, i.e., G b (v) = 1. To ensure p 3 latches are followed an additional p 2 latch, all fanouts of p 3 latches are forced to be in the back-to- back group, i.e., G b (v) = 1. Our ILP that captures the above constraints while minimizing the number of inserted latches is as follows: Minimize X v2V[PI G b (v) Subject to: 8e uv 2E : G b (v) = 8 > > > > > > > > > < > > > > > > > > > : 1; K(u) = 0 1; K(u) = 1^K(v) = 1 0; otherwise 8 > > > > < > > > > : K(v) =f0; 1g; 8v2V G b (v) = 0;K(v) = 1; 8v2PI Similarly to the post-latch approach, we make the ILP compatible with Gurobi by converting the conditional equations to inequalities: Minimize X v2V G b (v) 33 Subject to: 8e uv 2E : 8 > > > > < > > > > : G b (v) 1K(u) G b (v)K(u) +K(v) 1 8 > > > > < > > > > : G b (v);K(v)2f0; 1g; 8v2V K(v) = 1; G b (v) = 0; 8v2 PI The rst constraint implies if K(u) is 0 then G b (v) 1, which means a p 2 latch is inserted before each node that is in the fanout of a p 3 latch. The second constraint forces an extra latch at the fanin of v to break consecutive p 1 latch stages. We assume all PI s are clocked by p 1 , thus require K(v) = 1 for all PI s. Moreover, since no latches can be inserted before a primary input, we also set G b (v) = 0 for all PI s. Thus P v2PI G b (v) = 0, and the objective function is simplied to minimize P v2V G b (v). As in the structure of linear pipelines, the new ILP will also generate the optimal solution that inserts one latch every two latch stages. In particular, the p 2 latches are inserted before the p 1 latch stages, and there is no latches inserted along paths from p 1 to p 3 latch stages. Notice that both 3-phase post- and pre-latch approaches capture two con- straints, one that everyp 3 latch is followed by ap 2 latch and one that ap 2 latch is inserted to break consecutivep 1 latch stages. If we assume the fraction of primary inputs is negligible, and letjEj andjVj represent the number of edges and nodes 34 in the graph, we can easily analyze their relative ILP complexity. For the rst constraint, the 3-phase post-latch ILP checks each node's phase and the number of constraints isjVj. In contrast, the 3-phase pre-latch ILP walks through all edges and the number of constraints isjEj. The second constraint is the edge related constraint in both approaches. The total number of constraints in the 3-phase pre-latch ILP is thus 2jEj, compared tojEj +jVj in the 3-phase post-latch ILP. As a result, the runtime of solving the pre-latch ILP is expected to be longer than solving the post-latch ILP. Indeed, this analysis is consistent with the actual ILP run-time dierences reported in Section 3.5. 3.2 The Design Flow The ILP described in the last section is the core step in a design ow that supports FF-based to 3-phase latch-based design conversion. Figure 3.3 describes the design ow from RTL to physical designs. The rst step of our design ow is to run standard synchronous synthesis on the given FF-based RTL design. Here, we take care to enable clock gating to minimize the number of FFs with self-loops which would otherwise unduly constrain the optimization problem. To be specic, the gated clock, shown in Figure 3.4(b), is set to be the preferred clock gating style, as compared to enabled clocks illustrated in Figure 3.4(a). 35 Figure 3.3: The design ow Figure 3.4: Enabled (a) to gated clock (b) transformation Using Python and TCL scripts that interface a leading commercial logic syn- thesis tool to the Gurobi Integer Linear Program solver [21], we then take the resulting FF-based design, identify the connections between FFs, and formulate one of the ILPs described in Subsections 3.1.1 or 3.1.2. We run the chosen ILP, 36 and, using the results, create the equivalent latch-based design by dening the three-phase clocks and connecting them to their associated latches. Figure 3.5: Duplicated clock gating logic for phase conversion For each latch that are clock gated, we trace the clock signal back through the clock gating logic and replace the clock with p1 or p3. In the case of latches belonging to the same clock gating register bank but driven by dierent clock phases, the clock gating logic is duplicated and connected to the two clock phases separately, as shown in Figure 3.5. We then retime the newly added latches, as described in Section 3.3. Before the physical design step of running place-and-route, we also apply two clock gating techniques to the newly added latches, as discussed in Section 3.4. Both of the converted latch-based designs and the original RTL designs are passed through the testbench and run the simulation. 37 3.3 Modied Retiming Retiming [13] is an optimization algorithm that moves sequential elements over the combinational logics without changing the functionality of the circuits. To retime 3-phase latch-based designs, we re-position the added latches within the combinational logic minimizing area while satisfying all latch constraints. We restrict our retiming to xed-phase retiming, where latches clocked by p 1 and p 3 are xed (thus named xed-phase) and only latches clocked on p 2 are relocated. Unfortunately, many commercial tools have limited support for retiming latches. They do, however, have well-optimized support for the retiming of FFs. Using this fact, [22] proposed to retime latches by mapping it to an FF-based retiming problem. This section shows how this approach can be adopted to 3-phase latch-based retiming. A more sophisticated graph-based retiming approach for 3-phase designs is proposed in Chapter 4. 3.3.1 Work-Around Retiming for Master-Slave Latch- Based Designs Given a synthesized design with clock period T c , they replace each FF with two FFs and retime the entire design with a faster clock constraint of half the original period (T c =2). After splitting the combinational logic, the FFs are converted into 38 Figure 3.6: (a) A ip- op based design. (b) Replace FFs with two FFs and retime to achieve a clock period of T/2. (c) Replace the retimed FFs with latches at clock period of T alternating transparent low and high latches. Figure 3.6 describes the work-around retiming for master-slave latch-based designs. In Figure 3.6(a), there are two stages of FFs colored in blue and green, separately. Each FF is replaced with two FFs in the same color. During retiming, some combinational logics are moved through the FFs to meet the faster clock constraint (clk f ). As a result, combinational logics belonging to the same stage are split into two parts. The FFs and combinational 39 logic in the same color indicate they belong to the same stage, in Figure 3.6(b). To generate retimed master-slave designs, Figure 3.6(c) replaces all FFs with latches driven by alternativeclk m andclk s signals, which are essentiallyclk and inverted clk in the original FF-based designs. 3.3.2 Modied Retiming for 3-Phase Latch-Based Designs Figure 3.7: The 3-phase clocks are mapped to clk and clkbar for modied retiming In this chapter, instead of halving the cycle, we keep the cycle time unchanged but use back-to-back FFs, where the rst FF is controlled by clk and the second clocked by clk inverted (clkbar), as illustrated in Figure 3.7. The group that is converted to a single latch is replaced with a single FF, also controlled by clk. The 3-phase clocks are mapped to clk and clkbar as shown in Figure 3.8. Phase p 1 and p 3 are mapped to clk and p 2 is tied to clkbar. We then retime the circuit only allowing FFs tied to clkbar to move. This splits the combinational logic in 40 Figure 3.8: (a) A ip- op based design. (b) Replace back-to-back latch group to two FFs and single latch group to one FF. (c) Convert the retimed FF-based design to 3-phase latch-based design the pipeline stages that require an extra latch into two with each part being able to operate at twice the frequency (cycle time T c =2). After the relocation of FFs clocked by clkbar, all FFs can be converted back to latches with their designated 41 3-phase assignments. Further optimization is then triggered to optimize the sizes of gates in the retimed latch-based design. Figure 3.8 describes modied retiming for a 3-phase latch-based design with two stages. In the rst stage, the FF is in a single latch group tied with p 1 . During retiming this FF is replace with a single FF connected to clk. There is no clkbar controlled FF in this stage, thus the combinational logic remain the same, highlighted blue in Figure 3.8(b). The FF in the second stage, on the other hand, is assigned to the back-to-back latch group and tied with p 3 and p 2 , respectively. During retiming, this FF is rstly converted to back-to-back FFs controlled by clk and clkbar. Then some part of combinational logic is moved through FFs that clocked by clkbar. Figure 3.8(c) converts FFs back to latches with their designated 3-phase assignments. Notice that the optimal 3-phase solution in Fig. 2.4(b) can be achieved by retiming the 3-phase post-latch-based design and the same number of latches can also be obtained using the 3-phase pre-latch approach but with dierent p 2 latch positions. 42 3.4 Clock Gating (CG) Power dissipation is classied into two types: dynamic and static. Static power is caused by the leakage current from transistors whereas dynamic power results from switching activities of transistors. Dynamic power can be calculated as P dynamic =C L V 2 dd f clk (3.1) where represents the switching activities, C L is the load capacitance, V dd and f clk means the valtage supply and clock frequency, respectively. The clock nets operate at the highest switching frequency and drive a large capacitance, thus consumes a great amount of dynamic power. According to the Equation 3.1, the dynamic power can be reduced by reducing the switching activ- ities. A clock gate shuts o a branch of the clock tree when it is not needed thus is possibly save the dynamic power. In this section, we apply clock gating to the newly added p 2 latches. For a fair comparison, no extra clock gating is applied to latches clocked by p 1 or p 3 . 3.4.1 Review of Clock Gating Techniques According to [23], the clock network normally accounts for signicant part of power dissipation. Clock tree power in real designs, for example, the DEC Alpha 21064 43 consumes 40% of the chip power [24], and the Motorola MCORE micro-RIC pro- cessor accounts for 36% of the total power [25]. Consider that clock signals are not always needed, clock gating techniques save power by masking o (i.e. gating) the clock when circuits are idle. Clock gating techniques have been explored at all levels: system architecture, block design, logic design, and gate-level design. Considering both logical and physical information possibly saves more power. [26,27] built activation functions to exploit high level information and generated clock tree routing constraints or topology structures. Given our conversion algorithm starts from the gate-level FF-based designs, we focus on gate-level clock gating only. Gate-level clock gating starts with a netlist that is initially not or partial not gated, some registers are then selected for clock gating. A FF can be disabled in the next cycle if its current output and the present data input are the same. Based on this idea, two gating architectures are proposed in [28], double-gated clock gating (DG-CG) and negative clocked CMOS based clock gating (NC 2 MOS-CG). In DG-CG, both of the master and the slave latches are gated separately by comparing their D and Q. In contrast, NC 2 MOS-CG cells use a gated clock signal to control both of the master and the slave latches. Clock gating is not added for free. Extra logic are required to generate the clock enabling signals, resulting in extra area and power. Grouping several FFs to share a common enable signals reduce the overheads, but at the risk of lowering the 44 Figure 3.9: (a) Clock gate p 2 latches whose leading latches share a common enable EN (3-phase post). (b) Timing diagram of EN (3-phase post). (c) Clock gate p 2 latches whose successor latches share a common enable EN (3-phase pre). (d) Timing diagram of EN (3-phase pre). gating eectiveness since the clock will be enabled when any of FFs in the same group is enabled. [29] proposed a probabilistic model of the clock gating network to achieve the optimal fan-out of a clock gate in this trade-os. Another attempt to reduce the overheads is to maximize the utilization of internal combinational nodes that already exist in a circuit. In [30], the smallest possible gating function is explored by factored form matching, so that internal nodes can be directly use as enable signals and extra logics are only required for those not matching functions. 45 3.4.2 Look Ahead Clock Gating We manage the rst means of clock gating p 2 latches dierently depending on the latch insertion. For 3-phase post-latch designs, we gate those retimed p2 latches whose up- stream latches are also gated. In particular, we identify every individual p 2 latch whose fan-in latches share a common enable signal EN. We then can clock gate the p 2 latch using a separate CG cell also controlled by EN and simplify it as p 2 CG cell, highlighted in green in Fig. 3.9(a). Notice that in 3-phase post-latch designs, each p 2 latch is initially placed at the output of a p 1 or p 3 latch. Even though it is later retimed through combinational logic, it often still has a common enable signal, referred to as EN and highlighted in blue in Fig. 3.9(b). EN is guaranteed to be stable when upstream p 1 and/or p 3 latches open, thus it can be used to control a p 2 CG cell, as highlighted in green. In 3-phase pre-latch designs, we clock gate the subset of retimed p 2 latches whose downstream p1=p3 latches all share a common enable signal EN, as illus- trated in Fig. 3.9(c). This dierence is motivated by the fact that in contrast to 3-phase post-latch designs,p 2 latches in pre-latch designs are inserted at the fan-ins of p 1 /p 3 latches before being retimed. They are thus less likely to be clock-gated with a common enable signal from upstream latches. In particular, because of the dierence in initial insertion point, we have observed that they are more likely 46 to drive latches that have the same enable domain. Therefore, we gate those p 2 latches using the common enable from downstream latches. Unlike in 3-phase post-latch designs, to ensure the p2 EN signal in 3-phase pre-latch designs are stable when the p 2 clock rises, we must add extra max delay constraints. This is illustrated in orange in Fig. 3.9(d). Notice that we increase the duty cycle ofp 1 to 50% in order to relax the maximum delay requirements for EN paths. Further tuning the clock duty cycles might lead to further gains but this optimization is left as future work. 47 Figure 3.10: The original and modied clock gating cells (a-d) and the situations to apply these modications in (e1) 3-phase post-latch and (e2) 3-phase pre-latch-based designs A typical CG cell, Fig. 3.10(a) for example, would be composed of a latch, an inverter generating the inverted clock, and an AND gating the clock. While this conventional cell is generally applicable, we apply three modications to reduce overhead that are specic to three-phase designs. The situations to apply each modication are illustrated in Fig. 3.10(e1) for 3-phase post and Fig. 3.10(e2) for 3-phase pre-latch designs. 48 Our rst modication (M1) is to remove the inverter and use p 3 rather than the inverted p 2 in a p 2 CG cell in 3-phase post-latch designs. As illustrated in Fig. 3.10(b), a new pinp 3 is introduced to thep 2 CG cell, and the clock pin (CLK) is specically tied to p 2 . Interestingly, the timing constraints associated with the CG cell are easily met. Recall that the enable signal EN that clock gates the p 2 latch also clock gates its upstream latches. As highlighted in blue in Fig. 3.9(b), the EN signal is guaranteed to be stable when the upstream latches open. This means the EN that controls ap 2 CG cell, highlighted in green, becomes valid before the rising edge of p 1 and it thus basically safe to be latched using p 3 (assuming only a small (if any) gap betweenp 1 rising andp 3 falling). The output of the latch in the CG cell, labeled ENLT in Fig. 3.10(b) is thus stable from the falling edge of p 3 until the next rising edge ofp 3 . This guarantees the stability of ENCLK during the high period of p 2 . Note that Fig. 3.9(b) shows the EN paths starting from p 2 latches, but paths starting from p 1 latches can be analyzed similarly. Note also that there is no direct path from a p 3 latch to a CG cell because every p 3 latch is in the back-to-back latch group, as labeled \No Path" in Fig. 3.10(e1). Our second modication (M2) is to selectively remove the latch in the CG cell that controls p 1 or p 3 latches, as illustrated in Fig. 3.10(c). The latch in the CG cell for traditional FF-based designs is important because it prevents glitches on the clock signal. Without the latch, as illustrated in Figure 3.11, the glitch in EN propagates to the clock pin of the downstream registers in FF-based designs. In 49 Figure 3.11: (a) Simplied clock gating cells cause (b) hazards to FF-based designs. (c) Modied CG cells (M2) are safe in 3-phase latch-based designs. (d) EN stays stable after reaching CG thus hazard free. 3-phase latch-based designs, on the other hand, the latch in the CG is redundant when its EN has no start point latched by the same phase as the CG cell, as listed in Fig. 3.10(e1) when its EN has no start point latched by the same phase as the CG cell. Consider a CG cell that controls a p 1 latch. If its EN is driven through paths starting from p 2 1 , then upon the opening of the p 1 latch, all p 2 latches have already closed. This means that EN will stabilize beforep 1 latches open and, once stabilized, will stay stable until the next rising edge of p 3 . In other words, EN is guaranteed to stay stable when p 1 is high and hazards are naturally avoided. In contrast, if a p 1 latch has a path to EN that controls a p 1 CG cell, then the latch 1 it could not have paths starting from p 3 because all p3 paths are captured by p 2 50 in CG cell should not be removed because of the potential hazards when p 1 latch is transparent. M2 is also applied to p 2 CG cells in 3-phase pre-latch designs, but with a maximum delay constraint that guarantees EN is stable before p 2 rises. Our third modication (M3) removes the inverter and uses p 2 to control the latch in a p 1 /p 3 CG cell, as illustrated in Fig. 3.10(e2). This modication is particularly useful because, if the original CG cell was used to clock gatep 1 where EN is driven by p 3 , it would need to be preceded by a p 2 latch to avoid data mismatch. Integrating the p 2 latch into the optimized M3 CG cell removes this need, signicantly reducing overhead. In particular, EN is captured by thep 2 latch closing and ENLT stays stable when p 2 is opaque, i.e., when p 1 and p 3 are high. 3.4.3 Date Driven Clock Gating (DDCG) Our second means of clock gating p 2 latches is data-driven clock gating (DDCG) [28]. In DDCG, the clock signal is gated when D and Q are the same and enabled only when D diers from Q. It saves dynamic power when the data pin has low switching activity [28]. In the structure of a single-bit DDCG, as shown in Fig- ure 3.12(a), every individual latch requires an XOR and an AND gate. To reduce this overhead, multi-bit DDCG structure groups several latches to be driven by the same gated clock signal, generated by combining the comparison signals from individual latches, as illustrated in Figure 3.12(b). Since the clock signal p 2 drives one AND gate to control a group of latches, the clock tree can consume less power. 51 Figure 3.12: (a) Single and (b) multi bit data-driven clock gating (DDCG) However, the gating eciency may be reduced because a switch in one latch en- ables the whole group. Therefore, it is benecial to group latches whose switching activities are low and highly correlated. In this research, after gating based on common upstream enables, I used multi- bit DDCG to gate the remaining ungated p 2 latches. Latches are rst grouped according to their toggle rates. DDCG is then added to those groups whose data pins have low switching activities (less than 1% of clock frequency). Note that the optimal fanout of a CG cell depends on the average toggling statistics of the individual registers as well as the process technology and cell library. In this work, I chose 32 as the maximum CG fanout and therefore groups with more latches were divided among multiple multi-bit DDCG structures. Finally, note that in some libraries, FFs or latches are physically merged into a single multi-bit cell so that the clock signals are shared among multiple storage elements. Coupling multi-bit registers with multi-bit clock gating may yield more power savings [31]. This work is based on 28-nm FDSOI CMOS cell library and no 52 multi-bit registers are found in this library. Thus, this coupling is not considered in my research. 3.4.4 Special Flip-Flops and Reset Though each FF is treated as a node, it possibly has multiple input/output ports. For example, some FFs have a reset signal and some have both Q and QN (inverted version of Q) ports. While identifying the connections, all these ports are taken into consideration. After solving the ILP,p 2 latches are inserted to latch all related ports. For example, in 3-phase post-latch designs, two latches are inserted at the output ports Q and QN; while in 3-phase pre-latch designs, two latches are inserted at the input ports D and reset. The rst means of clock gating 3-phase post-latch designs is to clock-gate a p 2 latch by the common EN of its upstream latches. Consider the case that the upstream latches have an active high reset pin. When they are in the reset state, the p 2 latches should be transparent to properly propagate the reset value. We thus clock-gatep 2 latches by ORing the common EN with the common reset of its upstream latches. 53 3.5 Experimental Results This section quanties the benets of the proposed conversion algorithm comparing the resulting 3-phase post-latch-based and 3-phase pre-latch-based designs to the original FF-based as well as traditional master-slave latch-based designs. To pro- duce master-slave designs, we follow some of the same steps, involving synthesis, latch conversion 2 and retiming, as an open-source ow [32]. The experiments rely on an industrial 28-nm FDSOI CMOS cell library and a range of circuits that include, ISCAS89 benchmark circuits [33], CEP submodules [34], and three CPU designs, a 3-stage MIPS Open Core Plasma [35], a RISC-V Rocket Core [36], and an ARM-M0 core [37]. The ISCAS89 benchmarks were run at 1 GHz. Other circuits were run at 500 MHz, except for RISC-V and ARM-M0 which operate at 333.3 MHz. We validated both master-slave and 3- phase latch-based circuits by streaming inputs to the FF-based and latch-based designs and compare output streams. For ISCAS designs we used auto-generated pseudo-random input streams. For CEP and CPU designs, we used the open- source provided testbenches. In particular, Plasma was simulated using the \pi" program, ARM-M0 was simulated using \hello world", RISC-V was simulated using the \rv32ui-v-simple", and CEP designs were simulated using the open- source provided self-check programs. These gate-level simulations were also used 2 Master latches inherent any existing clock-gating logic, while slave latches are inserted at the fanout of the master latches 54 to determine signal activity that drove data-driven clock gating and to measure the relative power consumption of our approach. All experiments were run on two Intel Xeon E5-2450 v2 CPUs with 128GB of RAM. Note that for a fair comparison, all variants of each design are run at the same frequency. With library-provided integrated clock gating cells, the tool decides the best clock gating technique for FF- and master-slave latch-based designs. The modied retiming strategy and relevant clock gating strategies described in Section 3.4 are also performed on the latch-based designs. Notice that the ow for master-slave designs [32] involves an extra synthesis step to restrict the ip- op usage to one type of FF. The extra synthesis step, however, sometimes increases the power consumption of the combinational logic. In particular, in our original experiments [38], the master-slave MD5 consumes signicantly higher power than its FF counterpart. In this paper, we used an incremental compilation option for the MD5 benchmark to avoid this problem. Table 3.1 summarizes the number of registers (FFs/latches) in the original FF-based, conventional master-slave latch-based, 3-phase post-latch-based, and 3- phase pre-latch-based designs. The savings for the number of registers are the percentages of the number of latches in 3-phase designs compared to twice the number of FFs in FF-based designs and compared to the number of latches in master-slave designs. Compared to the 3-phase post-designs where the number latches are reduced by an average of 22.4% and 21.2%, the 3-phase pre-latch results 55 Table 3.1: Number of registers (FFs or latches) in the original ip- op (FF), converted master-slave latch (M-S), and proposed 3-phase (3-P) latch based designs withp 2 latches placed post- and pre- p 1 =p 3 latches Design # of Regs FF M-S 3-Phase Post Save (%) Pre Save (%) post pre 2*FF M-S 2*FF M-S ISCAS s1196 18 36 26 30 27.8 27.8 16.7 16.7 s1238 18 36 26 30 27.8 27.8 16.7 16.7 s1423 81 158 146 154 9.9 7.6 4.9 2.5 s1488 6 16 12 18 0 25 -50.0 -12.5 s5378 163 317 250 303 23.3 21.1 7.1 4.4 s9234 140 278 225 219 19.6 19.1 21.8 21.2 s13207 457 890 725 979 20.7 18.5 -7.1 -10.0 s15850 454 904 747 776 17.7 17.4 14.5 14.2 s35932 1728 3456 2737 2861 20.8 20.8 17.2 17.2 s38417 1489 2751 2366 2563 20.6 14 13.9 6.8 s38584 1319 2633 2422 2456 8.2 8 6.9 6.7 Average 534 1043 880 944.5 17.8 18.8 5.7 7.6 CEP AES 9715 16829 12871 13122 33.8 23.5 32.5 22.0 DES3 436 842 573 574 34.3 31.9 34.2 31.8 SHA256 1574 3308 2523 2490 19.9 23.7 20.9 24.7 MD5 804 1780 996 1007 38.1 44.0 37.4 43.4 Average 3132 5690 4241 4298 31.5 30.8 31.2 30.5 CPU Plasma 1606 2357 2078 2134 35.3 11.8 33.6 9.5 RISCV 2795 5312 4084 3714 26.9 23.1 33.6 30.1 ArmM0 1397 2713 2290 1834 18 15.6 34.4 32.4 Average 1933 3461 2817 2560.7 26.8 16.8 33.8 24.0 Average 1344 2479 1950 1959 22.4 21.2 16.1 15.4 show relatively smaller savings, i.e., an average of 16.1% and 15.4%, respectively. This is because many of FFs in our designs have an extra reset pin but have only one output pin. The 3-phase pre-latch approach thus inserts twop 2 latches to both D and reset but 3-phase post-latch designs require only one p 2 latch at the output for every FF in the back-to-back latch group. These extra latches are often, but not always, merged during re-timing. Notice that both 3-phase algorithms have the least benet on ISCAS89 circuits and, in particular, no benet on s1488, compared 56 to FF-based designs. According to [39], s1488 is re-synthesized from a controller and may suggest that our algorithms bring limited benets to control dominated designs that have a predominance of FFs with combinational feedback. Table 3.2: Total Area (m 2 ) in the original ip- op (FF), converted master-slave latch (M-S), and proposed 3-phase (3-P) latch based designs with p 2 latches placed post- and pre- p 1 =p 3 latches Design Total Area FF M-S 3-Phase Post Save (%) Pre Save (%) post pre FF M-S FF M-S ISCAS s1196 240 228 219 230 9 4.2 4.2 -0.9 s1238 238 229 215 229 9.7 6.1 3.7 -0.1 s1423 591 466 524 573 11.5 -12.4 3.0 -23.1 s1488 217 232 239 245 -10.2 -3.1 -13.0 -5.7 s5378 1164 930 914 1068 21.4 1.7 8.2 -14.9 s9234 902 752 741 728 17.8 1.5 19.3 3.1 s13207 2675 2058 2056 2455 23.1 0.1 8.2 -19.3 s15850 2885 2565 2315 2468 19.7 9.7 14.4 3.8 s35932 11770 9356 9054 8816 23.1 3.2 25.1 5.8 s38417 9395 7272 7863 8123 16.3 -8.1 13.5 -11.7 s38584 9355 7683 7961 7775 14.9 -3.6 16.9 -1.2 Average 3585 2888 2918 2974 14.2 -0.1 9.4 -5.8 CEP AES 133115 121960 119174 121886 10.5 2.3 8.4 0.1 DES3 2711 2738 2449 2369 9.7 10.6 12.6 13.5 SHA256 9996 9461 8594 8370 14 9.2 16.3 11.5 MD5 7023 6823 6947 5587 1.1 -1.8 20.5 18.1 Average 38212 35246 34291 34553 8.8 5.1 14.4 10.8 CPU Plasma 8944 7546 8029 7680 10.2 -6.4 14.1 -1.8 RISCV 14453 15268 14002 13416 3.1 8.3 7.2 12.1 ArmM0 10690 11007 11514 10427 -7.7 -4.6 2.5 5.3 Average 11362 11274 11182 10508 1.9 -0.90 7.92 5.21 Average 12576 11476 11267 11247 11 0.94 10.28 -0.29 Table 3.2 summarizes total area (m 2 ) in the original FF-based, conventional master-slave latch-based, 3-phase post-latch-based, and 3-phase pre-latch-based 57 designs. In terms of the total area, 3-phase post-latch designs show average im- provements of 11% and 0.9%, whereas 3-phase pre-designs have area savings of 10.3% and -0.3%, compared to FF and master-slave designs. The overall area averages are across individual benchmarks without weighting by benchmark size. The CPU benchmarks show a relatively low area saving compared to FF-based de- signs. This is because converting FF- to latch-based designs sometimes increases the combinational logic area depending on the results of retiming. In particu- lar, compared to 3-phase designs, master-slave designs have more regularly placed latches which facilitates re-timing, possibly resulting in smaller logic area. The degree of logic area increase is clock frequency dependent and re-running these experiments at lower frequencies, reduces this impact. 58 Table 3.3: Power dissipation (mW ) based on simulation of ip- op (FF), master-slave latch (M-S), and 3-phase latch-based designs with p 2 latches placed post- and pre- p 1 =p 3 latches. \Best" is the higher saving in either post- or pre-latch approach. We run the ISCAS designs on pseudo-random input streams, CEP designs on the open-source provided self-check programs, Plasma on \pi", RISC-V on \rv32ui-v-simple", and ARM-M0 on the \hello world" program. Design FF Power M-S Power 3-Phase Post Power 3-Phase Pre Power Save % Post Save % Pre Save % Best Clock Seq Comb Total Clock Seq Comb Total Clock Seq Comb Total Clock Seq Comb Total FF M-S FF M-S FF M-S ISCAS s1196 0.08 0.04 0.18 0.3 0.09 0.04 0.18 0.32 0.07 0.03 0.18 0.28 0.06 0.03 0.18 0.27 7.12 11.06 11.33 16.88 11.33 16.88 s1238 0.08 0.04 0.17 0.29 0.1 0.04 0.18 0.32 0.07 0.03 0.17 0.27 0.06 0.03 0.17 0.26 6.48 14.19 11.69 19.97 11.69 19.97 s1423 0.56 0.08 0.17 0.82 0.42 0.08 0.12 0.63 0.5 0.11 0.15 0.75 0.47 0.10 0.15 0.71 8.21 -19.62 12.93 -13.33 12.93 -13.33 s1488 0.03 0.01 0.13 0.17 0.04 0.02 0.13 0.19 0.03 0.01 0.12 0.17 0.03 0.01 0.12 0.16 -0.06 10.61 4.18 14.26 4.18 14.26 s5378 0.82 0.25 0.37 1.44 0.84 0.25 0.24 1.34 0.59 0.28 0.26 1.13 0.55 0.20 0.27 1.02 21.75 15.61 29.03 23.73 29.03 23.73 s9234 0.69 0.1 0.1 0.89 0.62 0.11 0.05 0.78 0.55 0.1 0.08 0.73 0.52 0.09 0.06 0.67 17.72 6.73 24.61 13.97 24.61 13.97 s13207 2.04 0.43 0.42 2.89 1.98 0.5 0.2 2.69 1.53 0.46 0.22 2.21 1.79 0.45 0.22 2.46 23.67 17.87 14.95 8.62 23.67 17.87 s15850 2.13 0.31 0.53 2.98 2.14 0.3 0.44 2.87 1.81 0.3 0.35 2.47 1.78 0.31 0.34 2.43 17.10 14.1 18.52 15.40 18.52 15.40 s35932 11.5 2.7 4.32 18.5 10.6 3.01 3.11 16.8 8.12 2.83 3.06 14 7.39 2.89 2.72 13.00 24.32 16.67 29.73 22.62 29.73 22.62 s38417 6.34 0.88 2.05 9.26 6.27 0.96 1.4 8.62 4.81 0.96 1.47 7.24 4.82 0.90 1.44 7.16 21.83 16.03 22.71 16.97 22.71 16.97 s38584 7.11 2.5 4.88 14.5 7.04 2.68 3.54 13.3 7.31 3.02 3.4 13.7 7.28 2.84 3.49 13.60 5.52 -3.01 6.21 -2.26 6.21 -2.26 Average 2.85 0.67 1.21 4.73 2.74 0.73 0.87 4.35 2.31 0.74 0.86 3.9 2.25 0.71 0.83 3.79 13.97 9.11 16.90 12.44 17.69 13.28 CEP AES 18.8 0.05 0.2 19.1 14.3 0.06 0.17 14.5 7.94 0.06 0.26 8.27 8.31 0.05 0.22 8.59 56.72 42.99 55.03 40.77 56.72 42.99 DES3 0.26 0.14 0.51 0.91 0.21 0.12 0.41 0.74 0.2 0.1 0.41 0.72 0.20 0.11 0.41 0.71 21.42 3.18 21.46 3.42 21.46 3.42 SHA256 0.13 0.05 0.13 0.31 0.27 0.06 0.09 0.42 0.13 0.05 0.13 0.3 0.11 0.05 0.13 0.29 0.82 27.21 7.87 32.00 7.87 32.00 MD5 0.11 0.02 0.28 0.4 0.35 0.03 0.18 0.56 0.09 0.02 0.25 0.36 0.11 0.02 0.27 0.39 9.96 35.36 2.27 29.81 9.96 35.36 Average 4.82 0.06 0.28 5.18 3.78 0.07 0.21 4.05 2.09 0.06 0.26 2.41 2.18 0.06 0.26 2.50 22.23 27.18 21.66 26.50 24.00 28.44 CPU Plasma 0.59 0.44 0.65 1.68 0.99 0.19 0.45 1.63 0.64 0.17 0.54 1.36 0.74 0.14 0.46 1.35 19.03 16.54 19.70 17.24 19.70 17.24 RISC-V 0.52 0.11 0.37 1.01 0.87 0.07 0.3 1.25 0.54 0.07 0.3 0.92 0.59 0.07 0.30 0.97 8.99 26.63 4.10 22.51 8.99 26.63 ARM-M0 0.54 0.31 1.14 2 1.23 0.23 1.34 2.9 0.5 0.11 1.22 1.84 0.59 0.15 1.00 1.74 7.92 36.56 12.95 39.97 12.95 39.97 Average 0.55 0.29 0.72 1.56 1.03 0.16 0.7 1.92 0.56 0.12 0.69 1.37 0.64 0.12 0.59 1.35 11.98 26.58 12.25 26.57 13.88 27.94 Average 2.91 0.47 0.92 4.3 2.69 0.49 0.70 3.88 1.97 0.49 0.7 3.15 1.97 0.47 0.66 3.10 15.47 16.04 17.18 17.92 18.46 19.09 59 Table 3.3 reports the power dissipation based on back-annotated gate-level simulation. The total power is composed of three power groups, namely the clock network (Clock), register/sequential logic (Seq), and combinational logic (Comb). The 3-phase post-latch designs achieve an overall average of 15.5% and 16% power reductions compared to FF and master-slave designs. Note the average power saving is calculated by taking the average across the benchmarks without weighting by their size. The combinational power savings drop from 15% to -4% in the comparison changing from FF to master-slave designs. This can be explained by the fact that latch-based designs allow time-borrowing which relaxes the normal FF edge-to-edge timing requirements and often have less glitching than their FF-based counterparts. Interestingly, the 3-phase pre-latch designs illustrate slightly higher power sav- ings, i.e., 17.2% and 17.9% compared to FF and master-slave designs. This rep- resents an additional 1.7% total power savings on top of 3-phase post-designs and the savings are mostly contributed by the lower combinational power (3.9% lower than the combinational power in 3-phase post-designs). This suggests that our 3-phase pre-latch algorithm, although requires more sequential elements than our 3-phase post-latch approach, adds more exibility during retiming. The right most two columns in Table 3.3 re ect the savings possible if both three-phase strategies were evaluated and the best of the two were chosen on a 60 Figure 3.13: Power dissipation (mW ) of RISC-V and Arm-M0 running Dhrystone and Coremark with ip- op, converted master-slave latch, and proposed 3-phase latch with dierent p 2 insertion positions (Post and Pre) benchmark-by-benchmark basis. In particular, they report an average of 18.5% and 19.1% power savings can be achieved compared to FF and master-slave designs. Coremark and Dhrystone are two standard benchmarks traditionally used to measure general process (CPU) performance. For this reason, we also tested our generated place-and-routed CPUs (RISC-V and Arm-M0) on both these bench- marks. Fig. 3.13 identies the decomposition of power consumption in the middle and the total power consumption on the top of each bar. The 3-phase post-latch algorithm saves power by an average of 15.6% and 21.2% in RISC-V and 8.3% and 20.1% in Arm-M0 compared to FF and master-slave designs, respectively. Note that the obtained power savings are similar to that obtained using our simple CPU testbench programs, i.e., 9.0% and 26.6% for RISC-V and 7.9% and 36.6% for Arm-M0 compared to FF and master-slave designs. Similar savings are also 61 seen in the 3-phase pre-latch designs, showing an average of 11.4% and 17.4% in RISC-V and 12.7% and 23.8% in Arm-M0 compared to FF and master-slave designs, respectively. Table 3.4: Run-time comparison for FF-, master-slave, 3-phase pre-latch and 3-phase post-latch-based designs Design FF M-S 3-Phase Post 3-Phase Pre Syn PnR Total Syn PnR Total ILP Syn PnR Total ILP Syn PnR Total ISCAS s1196 0.4 m 1.7 m 2.1 m 3.8 m 2.8 m 6.6 m 4 s 1.8 m 2.8 m 4.7 m 4 s 3.9 m 6.9 m 19.8 m s1238 0.4 m 1.5 m 1.9 m 3.8 m 2.9 m 6.7 m 4 s 1.8 m 2.8 m 4.6 m 4 s 3.9 m 3.2 m 7.4 m s1423 0.4 m 1.7 m 2.1 m 3.9 m 3.6 m 7.5 m 4 s 2.3 m 4.0 m 6.3 m 4 s 5.1 m 8.2 m 14.0 m s1488 0.3 m 1.5 m 1.8 m 3.9 m 2.8 m 6.8 m 3 s 1.6 m 3.0 m 4.7 m 4 s 3.9 m 7.2 m 11.5 m s5378 2.9 m 2.1 m 4.9 m 4.0 m 3.7 m 7.7 m 3 s 2.8 m 4.2 m 7.1 m 4 s 15.9 m 5.3 m 13.8 m s9234 0.4 m 2.2 m 2.6 m 3.8 m 3.9 m 7.7 m 4 s 2.4 m 4.1 m 6.6 m 4 s 12.9 m 4.0 m 6.8 m s13207 0.5 m 3.2 m 3.7 m 4.3 m 5.9 m 10.2 m 4 s 3.3 m 6.3 m 9.6 m 5 s 19.5 m 8.7 m 17.6 m s15850 0.5 m 3.5 m 4.0 m 4.8 m 8.2 m 13.0 m 5 s 4.2 m 7.0 m 11.3 m 6 s 23.2 m 11.6 m 37.3 m s35932 1.2 m 9.3 m 10.4 m 4.0 m 19.0 m 22.9 m 4 s 8.9 m 17.9 m 26.9 m 5 s 38.9 m 29.5 m 1.1 hr s38417 1.0 m 8.6 m 9.5 m 6.4 m 15.8 m 22.2 m 5 s 13.4 m 17.8 m 31.3 m 8 s 29.0 m 42.1 m 3.5 hr s38584 1.0 m 9.4 m 10.4 m 7.2 m 18.9 m 26.1 m 4 s 12.3 m 20.0 m 32.3 m 6 s 25.4 m 24.5 m 1.4 hr Average 0.8 m 4.0 m 4.9 m 4.5 m 7.9 m 12.5 m 4 s 5.0 m 8.2 m 13.2 m 5 s 16.5 m 13.7 m 44.2 m CEP AES 14.0 m 3.0 hr 3.3 hr 1.8 hr 4.8 hr 6.6 hr 8 s 4.4 hr 5.4 hr 9.8 hr 13 s 2.6 hr 5.4 hr 8.3 hr DES3 0.8 m 4.3 m 5.1 m 5.1 m 8.1 m 13.2 m 4 s 5.3 m 8.6 m 13.9 m 6 s 5.4 m 10.5 m 18.3 m SHA256 0.9 m 10.2 m 11.1 m 9.5 m 17.8 m 27.3 m 7 s 16.6 m 18.4 m 35.1 m 9 s 10.2 m 17.2 m 34.7 m MD5 2.0 m 10.0 m 12.0 m 11.0 m 16.7 m 27.7 m 8 s 10.2 m 25.1 m 35.4 m 12 s 9.4 m 1.9 hr 2.1 hr Average 4.4 m 51.9 m 0.9 hr 32.8 m 83.0 m 115.8 m 7 s 1.2 hr 1.6 hr 2.8 hr 10 s 44.9 m 1.9 hr 2.8 hr CPU Plasma 3.7 m 10.3 m 13.9 m 8.9 m 20.4 m 29.3 m 16 s 6.3 m 21.8 m 28.3 m 35 s 12.1 m 18.6 m 30.7 m RISCV 4.9 m 16.4 m 21.3 m 19.6 m 36.0 m 55.7 m 14 s 57.3 m 35.6 m 1.6 hr 22 s 40.6 m 34.4 m 1.2 hr ArmM0 5.1 m 13.5 m 18.6 m 11.7 m 27.2 m 38.9 m 27 s 26.7 m 52.9 m 1.3 hr 62 s 27.8 m 1.3 hr 2.1 hr Average 4.5 m 13.4 m 17.9 m 13.4 m 27.9 m 41.3 m 19 s 30.1 m 36.7 m 1.1 hr 40 s 26.9 m 43.0 m 1.3 hr Average 2.2 m 16.2 m 18.5 m 12.3 m 27.9 m 40.2 m 7 s 24.6 m 31.9 m 56.6 m 12 s 24.5 m 41.3 m 1.3 hr The run-time comparison of all designs is reported in Table 3.4. The design ows for the 3-phase post-latch and pre-latch designs including ILP, conversion, retiming, clock-gating, and place-and-route, require an average of 2X and 3X longer run-times compared to FF designs. They require an average of 41% and 92% longer run-times compared to master-slave designs. Most of the increases in run- time occurs in place-and-route. In particular, the ILP solver consumes at most 27 seconds in 3-phase post-latch designs and 62 seconds in 3-phase pre-latch designs among all the benchmarks and is generally a tiny fraction (1%) of the overall 62 run-time. In contrast, with three clocks to be routed, the 3-phase designs take generally three times longer in clock tree synthesis and 35% longer in routing compared to FF-based designs. Nevertheless, these run-time increases are quite manageable, suggesting that our proposed approach is computationally practical for at least moderately-sized blocks. In summary, our experiments suggest that while signicant saving in area and power is possible with our proposed approach, the amount of savings is variable and likely depends on a combination of factors including 1) the percentage of FFs with combinational feedback that limits the savings in number of latches and 2) the impact in retiming latch-based designs on the combinational logic. 3.6 Summary This chapter presents algorithms to automatically convert a FF-based design into a 3-phase latch-based design that uses an ILP to minimize the number of required latches. The post PnR results on a broad range of benchmark circuits show sig- nicant savings are possible in both area and power, with practical computational run-times particularly for pipelined circuits such as multi-stage CPUs when com- pared to both FF and master-slave latch-based designs. 63 Chapter 4 Latch-Based Retiming Section 4.3 showed how traditional FF-based retiming can be modied to naively support 3-phase designs. This chapter will rst review the traditional graph-based retiming approach, then propose to modify the graph for adopting the exibility in 3-phase retiming as well as consider the benet of clock-gating. 4.1 Introduction Retiming is an optimization algorithm that moves sequential elements through combinational logic without changing the structure of the combinational logic. Limited support of latch-based retiming in commercial tools motivated the work in [38] to adopt the method in [22] that maps the latch-based designs to FF-based designs for retiming. However, this method fails to consider the timing-borrowing nature of latches that relaxes the FF's edge-to-edge timing requirement, and thus may over constrain latch-based designs. Alternatively, the work in [40,41] formulates the retiming problem with a focus on minimizing the clock period or minimizing the number of sequential elements. The Leiserson-Saxe method can be applied to both FF and latch-based designs 64 and have also been extended to handle several variants, including targeting power reduction [42{44], retiming with interconnect and gate delay [45], and application to timing-resilient designs [19]. However, most retiming algorithms only consider the impact on the number of registers, but ignore the costs of the clock tree and the potential power benets of adopting clock-gating after retiming. Our proposed retiming approach makes two key contributions. 1) We identify and take advantage of a special property of 3-phase latch-based designs to propose an improved graph-based retiming algorithm for 3-phase designs. 2) We then extend this graph-based retiming algorithm to consider both the number of latches and the impact of clock gating. Our post place-and-route experimental results show an overall average of 20.8%, 21.3% and 6.1% reduction in power compared to FF-based, master-slave latch-based, and original 3-phase latch-based designs on ISCAS89 circuits [33], CEP submodules [34], and three CPU designs (i.e. a 3- stage MIPS CPU Plasma [35], a RISC-V Rocket Core [36], and an ARM Cortex-M0 core [37]). 4.2 Review of Graph-Based Min-Area Retiming Retiming is a structural operation that relocates registers without changing the functionality of the circuit. There are two traditional objectives for retiming al- gorithms: minimizing the clock period of the circuit [46{48] and minimizing the 65 number of registers in the circuit [46, 48, 49]. This subsection reviews min-area retiming that minimizes the number of sequential elements. The work in [40,41] characterized the retiming problem to an eciently solvable minimum state problem targeting at minimizing the clock period or minimizing the number of sequential elements. As in [40], a sequential circuit can be described as G =< V;E;d;w >, where each v represents a combinational gate and each directed edge e u;v represents a connection between gate u and gate v. The weight of the edge w(e u;v ) represents the number of FFs or latches between gate nodes. Each vertex has a xed delay d(v). If there is an edge from gate u to gate v, u is called a fanin of v and v is referred as a fanout of u. The set of fanouts of u is denoted by FO(u) and the set of fanins of u is represented as FI(u). A special vertex, the host vertex, is introduced into the graph. Edges are added from the host vertex to all primary inputs of the circuit and from all primary outputs to the host vertex. All edges directed into and from the host vertex has a delay of 0 and a weight of 0. A retiming is a labeling of the vertices r2 V , where r is integers. A positive (negative) retiming label r(u) means the number of registers are moved from its outputs (inputs) towards its inputs (outputs). A retiming label of 0 implies no movements of the registers for the gate. The weight of an edge e uv after retiming w r (e uv ) is calculated as w r (e uv ) =r(v) +w(e uv )r(u) (4.1) 66 The retiming of a circuit is the assignment of retiming labels to all the gates in the circuit. A path p represents a path from gate u to gate v, w(p) represents the total number of latches alongp, andd(p) is the delay fromu tov viap. Two additional variables D(u;v) and W (u;v) related to paths are dened as follows: W (u;v) = min 8p;u v w(p) (4.2) D(u;v) = max 8p;u v; w(p) = W(u;v) d(p) (4.3) The path delay between any two sequential gates should be less than cycle period P . P = max pjw(p)=0 d(p) (4.4) The notations that are critical in the retiming model are dened below. • W (u;v): min 8p;u v w(p), the minimum number of registers on any path from u tov. We call a pathp such thatw(p) =W (u;v) a critical path fromu tov. • D(u;v): the maximum total propagation delay on any critical path from u to v. • r(u): the number of FFs or latches that are retimed from the output of gate u towards the inputs of gate u. 67 • w r (e u;v ) =W (u;v)r(u) +r(v): the number of FFs or latches on edge e u;v after retiming. • P : the clock period. The path delay between any two sequential gates should be less than P . We ignore setup time for simplicity. • (e u;v ): a breadth that represents the cost coecient of a register on an edge e u;v . • L u and U u : the lower and upper bounds of the number of latches or FFs that can be moved backward through gate u without violating the timing constraints. The problem of minimizing the total number of registers after retiming should also consider the sharing of registers among dierent fanouts of the same register. As an example, instead of adding k registers along each of k fanout edge of a node, we should add one register that fans out to each of the k fanouts. To model this fanout sharing, for a gate u with k fanouts, a dummy node ^ u is added to the graph and connected with all its fanouts. All edges e u;v ande v;^ u are assigned with breadths of (e u;v ) =(e v;^ u ) = 1 k , as illustrated in Fig. 4.1(a) [40]. 68 Figure 4.1: Sharing of fanout and fanin registers: (a) fanout register sharing, (b) fanin register sharing The objective function and constraints of the classic retiming algorithm can be described as follows: min X v2V 2 4 X 8u2FI(v) (e u;v ) X 8u2FO(v) (e u;v ) 3 5 r(v) s:t: r(u)r(v)w(e u;v ) 8e u;v 2E r(u)r(v)W (u;v) 1 8D(u;v)>P (4.5) The rst constraint, also referred to as the circuit constraint, ensures the number of latches on edge e u;v after retiming is non-negative. The second constraint is dependent on the clock period and referred to as the period constraint. Given P represents the worst-case delay between any consecutive sequential gates, any edge delay longer than P should be broken by a register. Mathematically, that means, for a retiming to be legal,8D(u;v)>P , w r (e u;v ) 1. Substituting the denition of w r (e u;v ), this constraint becomes the second constraint in Equation 4.5: r(u) r(v)W (u;v) 1,8D(u;v)>P . 69 To simplify the constraints, [41] re-writes them as follows: min X v2V 0 2 4 X 8u2FI(v) (e u;v ) X 8u2FO(v) (e v;u ) 3 5 r(v) s:t: r(u)r(v)c uv 8(u;v)2C 0 L u r(u)U u 8u2V 0 (4.6) Here, L u and U u are the lower and upper bounds of the number of latches or FFs that can be moved backward through gate u without violating the timing constraints [41]. C 0 is the setf(u;v)2 CjU u L v > c uv g, where C is the set of constraints in Equation 4.5, one for each e u;v in E and one for each path whose cumulative delay is larger thanP . V 0 is the reduced variable setfu2VjU u 6=L u g. Equation 4.6 can be mapped to a min cost network ow algorithm and solved with the network simplex method in polynomial time [40,41]. 4.3 Retiming 3-Phase Latch-Based Designs In this section, we develop our min-area retiming to support the special nature of 3-phase latch-based designs and their clock-gating opportunities. 4.3.1 The Problem Statement As mentioned earlier, [38] applies retiming to only the newly added p 2 latches. The location of p 1 and p 3 latches are xed at the locations of the original FFs to 70 simplify formal equivalence checking. To map a circuit to a graphG(v;e), a nodev corresponds to a gatev ande u;v represents a signal ow from the output of gateu to the input of gatev. Each edge has an edge cost(e u;v ) to model fanout sharing, as illustrated in Fig. 4.1. A retiming labelr(v), for a nodev, represents the number of p 2 latches moved from its outputs towards its inputs. The original positions of p 2 latches before retiming are at the outputs of their associated p 1 or p 3 latches. The edge-weight w(e u;v ) is 1 for edges that originally have a p 2 latch and is 0 for all other edges. Therefore the retiming valuer(v) can only be -1 or 0, meaning the latch should be moved from the input to the output of the gate or should not be moved. For a retiming to be valid, there must be exactly one p 2 latch along any path that originally has a p 2 latch. All nodes that represent primary inputs and xed latches are grouped into the source nodes of the graph, and the xed latches can be the terminal nodes of the graph. The retiming problem is then mapped to nd a new cut to place the p 2 latches in the combinational logic cloud. 4.3.2 Retiming in 3-Phase Latch-Based Designs A valid retiming requires all inputs (or outputs) of the gate are latched or captured by a register. For example, Figure 4.2 moves two FFs that locate in front of the Node v over the combinational node v to the output of v. The notation r(v) is number of sequential elements that move over the node v from the output side to 71 Figure 4.2: An example of Retiming input side. The two registers in front of v should both be present to allow the forward retiming. In contrast to traditional retiming problems, retiming of 3-phase designs allows the cut to be placed along paths that originally have no p 2 latches. Consider a path from a p 1 latch to a p 3 latch. Even though no p 2 latch is placed along this path, inserting one there does not change the circuit's functionality. In particular, the ILP chose not to put a p 2 latch after the p 1 latch in order to minimize the number of required p 2 latches, because it did not anticipate such a latch would 72 benet retiming. Our proposed retiming algorithm incorporates this exibility to address this limitation. Figure 4.3: Example where 3-phase latch retiming has a larger search space than traditional retiming. (a) Traditional retiming cannot retime the p 2 latches through the rst AND gate. (b) In 3-phase retiming, the min-cut is found at the output of the rightmost AND gate. (c) The retimed graph has fewer latches than the original design. Fig. 4.3 illustrates this exibility and how it is incorporated in our modication of graph-based retiming. In Fig. 4.3(a), only one of the input pins of the rst AND gate is latched by p 2 . In this case, traditional retiming cannot move p 2 73 latches through the AND gate. In contrast, our modication is to simply ignore edges from latches that are in the single latch group (because assigning them to the back-to-back latch group is also functionally correct), removing them from the circuit model, as highlighted in green in Fig. 4.3(b). We can then apply traditional graph-based retiming to the modied graph and search for the min cut. By adding the ignored paths back, the functionality of the retimed graph remains the same and is guaranteed correct, as illustrated in Fig. 4.3(c). To implement this approach, the new graph contains only paths through p 2 latches. Instead of loading all nodes that represent primary inputs and xed latches, the source nodes can be restricted to nodes that are followed by ap 2 latch. The new graph is obtained by running breath-rst search (BFS) or depth-rst search (DFS) from the source nodes and ignoring any circuit nodes not reached. The new retiming problem is now reduced to nding the minimal cut in the reduced graph which by its construction is guaranteed to be acyclic. 4.4 Retiming Considering Clock-Gating A clock gate shuts o a branch of the clock tree when it is not in use to save dynamic power. In 3-phase post (pre) latch-based designs, the rst means of clock-gating is to gate p 2 latches whose upstream (downstream) latches are also gated. Given dierent gating strategies are applied to 3-phase post and 3-phase pre latch designs, this section discusses the two situations separately. 74 4.4.1 Retiming 3-phase post latch designs In 3-phase post designs, all p 2 latches are initially placed at the output pins of p 1 /p 3 latches or at primary inputs. Retiming moves p 2 latches across combinational logic, possibly increases the number of fan-in latches to p 2 latches. If there are un-gated fan-in latches, the p 2 latch cannot be gated thus consumes clock tree power every cycle. If the fan-in latches are gated by dierent enable signals, extra logic that combines the enable signals is required to gate the p 2 latch. To provide consistency to the interface of the design, we assume all primary input ports were un-gated and clocked by p 1 . Our retiming algorithm incorporates the impact of clock-gating logic as de- scribed in Algorithm 4.1. Here, FI(u) and FO(u) stand for the set of fanins and fanouts of the nodeu. Note that the enable noden(u) is the combination of enable signals from all fan-in latches, including ignored latches in the reduced graph. The cost of adding the clock gate logic cg(n(u)) is calculated once for each enable node n(u) and is library and design dependent. Combining more enable signals takes more cell area to generate a larger OR tree and may decrease gating eciency. In our experiments, we assume the cost of the OR tree is proportional to the # of enables and model it as \cg #(enable signals)". The upper bound of merging enable domains is U en , and we skip clock gating on nodes with more than U en en- able domains. The cost of each clock gate is added no more than once because the 75 Algorithm 4.1 Proposed graph-based retiming for 3-phase designs incorporating the impact of clock gating 1: Construct the reduced retiming graph, as described in subsection 4.3.2 2: for v2 V in topological sorting do 3: if v can be clock-gated by combining # of enablesU en then 4: Dene the enable node n(v) as such combination 5: Add n(v) if it does not exist in the graph 6: if v is connected to S then 7: Add an edge e S;n(v) with cost (e S;n(v) ) = cg(n(v)) 8: end if 9: for u in FI(v) do 10: if n(v) diers from n(u) then 11: add an edge e n(u);v with cost (e n(u);v ) = cg(n(u)) 12: end if 13: end for 14: if n(v)6=n(u);8u2FI(v) then 15: Add an edge e v;n(v) with cost (e v;n(v) ) = cg(n(v)); 16: end if 17: else 18: 8u2 FO(v), add an edge e v;u with cost (e v;u ) =ct and label it CLK. 19: end if 20: end for 21: Add fanout and fanin sharing for the enable nodes, as illustrated in Fig. 4.1 22: Compute weight for each edge: w(e S;u ) = (e S;u )8u2 FO(S), and w(e u;v ) = 08u = 2 FO(S);8v2 FO(u). combined enable node can be shared among multiple latches. This is guaranteed by adding both fanout and fanin sharing for these enable nodes, as described in Fig. 4.1. An illustrative example is shown in Fig. 4.4 and the corresponding graph rep- resentation is illustrated in Fig. 4.5. The black edges and nodes in Fig. 4.4 and 4.5 represent the reduced retiming graph. Extra labels, edges, and nodes, highlighted in orange, are added to re ect the impact of the clock-gating logic. 76 Figure 4.4: An illustrative circuit As the rst step of constructing the reduced graph, the source nodes are the primary inputs and xed latches that are followed by a p 2 latch. Suppose a, b, c, d are all followed by a p 2 latch. We add a pseudo source node S with directed edges from S to each source node. The initial cut is placed on these edges, i.e., the edge-weight w(e S;u ) =(e S;u );8u2 FO(S). To construct the new retiming graph, nodes are divided into two groups: ones that could be clock-gated if a p 2 latch is placed there; and ones that cannot be gated. Starting from source nodes, we label nodes that represent primary inputs and un-gated xed latches as CLK, and label gated xed latches with the enable signal that drives the clock gating (CG) logic. All fanouts of a node labeled CLK are also labeled as CLK. The remaining nodes are labeled with the combination of its fanin enable signals. The extra edge-weight of edges directed from S represent 77 edge(u,v) (e u;v ) w(e u;v ) edge(u,v) (e u;v ) w(e u;v ) (S,a) 1+ct 1+ct (a,f) 1+ct 0 (f,T) 1+ct 0 (S,b) 1 1 (b,e) 1 0 (S,c) 1 1 (c,e) 1 0 (e,f) 1/2 0 (e,g) 1/2 0 (f,^ e) 1/2 0 (g,^ e) 1/2 0 (g,T) 1 0 (S,d) 1 1 (d,g) 1 0 (S,EN1) cg cg (EN1,f) cg=2 0 (EN1,g) cg=2 0 (f, d EN1,) cg=2 0 (g, d EN1) cg=2 0 (S,EN2) cg cg (EN2,g) cg 0 (g,EN1&EN2) 2cg 0 (EN1&EN2,T) 2cg 0 Figure 4.5: Reduced retiming graph with added edges and nodes represent the impact of clock-gating for the circuit shown in Fig. 4.4 the associated impact of the clock gating of the initialp 2 latches. When two edges share the same start and end points (e.g. S anda), one is in the reduced retiming graph and the other is added to the new retiming graph. In minimum cost ow algorithms, we can merge the parallel edges into a single edge with cost and weight equal to sum of the individual edges' cost and weight [50]. Notice thatEN1 has two fanouts. To handle fanout sharing, a pseudo sink node [ EN1 is added to the graph 78 with edges directed from the two fanouts. The breadth of the associated edges is 1=k where k is the number of fanouts, i.e., (e EN1;f ) = (e EN1;g ) = (e f; d EN1 ) = (e g; d EN1 ) = 1=2. Even though not shown in this example, some enable nodes may have more than one fanin. Fanin sharing is handled by adding a pseudo source node with edges directed to the fanins, as illustrated in Fig. 4.1(b). 4.4.2 Retiming 3-phase pre latch designs In 3-phase pre designs, all p 2 latches are initially placed in front of the input pins of p 1 /p 3 latches or followed by primary outputs. Those p 2 latches could be gated by ORing the enable signals that control their fan-out latches. Retiming moves p 2 latches backward through combinational logic, possibly in- creases the number of fan-out latches starting fromp 2 latches. If there are un-gated fan-out latches, thep 2 latch cannot be gated thus consumes clock tree power every cycle. If the fan-out latches are gated by dierent enable signals, extra logic that combines the enable signals is required to gate thep 2 latch. To provide consistency to the interface of the design, we assume all primary output ports were un-gated. Our retiming algorithm for 3-phase pre latch designs incorporates the impact of clock-gating logic as described in Algorithm 4.2. Note that the enable node n(u) is the combination of enable signals from all fan-out latches, including ignored latches in the reduced graph. The cost of adding the clock gate logic cg(n(u)) and the upper bound of merging enable domains isU en are the same as Algorithm 4.1. 79 Algorithm 4.2 Proposed graph-based retiming for 3-phase pre designs incorpo- rating the impact of clock gating 1: Construct the reduced retiming graph, as described in subsection 4.3.2 2: for v2 V in reversed topological sorting do 3: if v can be clock-gated by combining # of enablesU en then 4: Dene the enable node n(v) as such combination 5: Add n(v) if it does not exist in the graph 6: if v is connected to S then 7: Add an edge e S;n(v) with cost (e S;n(v) ) = cg(n(v)) 8: end if 9: for u in FI(v) do 10: if n(v) diers from n(u) then 11: add an edge e n(u);v with cost (e n(u);v ) = cg(n(u)) 12: end if 13: end for 14: if n(v)6=n(u);8u2FI(v) then 15: Add an edge e v;n(v) with cost (e v;n(v) ) = cg(n(v)); 16: end if 17: else 18: 8u2 FO(v), add an edge e v;u with cost (e v;u ) =ct and label it CLK. 19: end if 20: end for 21: Add fanout and fanin sharing for the enable nodes, as illustrated in Fig. 4.1 22: Compute weight for each edge: w(e S;u ) = (e S;u )8u2 FO(S), and w(e u;v ) = 08u = 2 FO(S);8v2 FO(u). The enable node sharing is taken care by adding fanout/fanin sharing for enable nodes, as described in Fig. 4.1. An illustrative example is shown in Fig. 4.6 and the corresponding graph rep- resentation is illustrated in Fig. 4.7. The black edges and nodes in Fig. 4.6 and 4.7 represent the reduced retiming graph. Extra labels, edges, and nodes, highlighted in orange, are added to re ect the impact of the clock-gating logic. 80 Figure 4.6: An illustrative 3-phase pre circuit As the rst step of constructing the reduced graph, the source nodes are the primary inputs and xed latches that are followed by a p 2 latch. A pseudo source node S is created with directed edges from S to each source node with weight and delay being 0. A pseudo terminal node T is added with directed edges from each terminal node to T . The initial cut is placed on the terminal edges, i.e., the edge-weight w(e u;T ) =(e u;T );8u2 FI(T ). To construct the new retiming graph, nodes are divided into two groups: ones that could be clock-gated if a p 2 latch is placed there; and ones that cannot be gated. Starting from terminal nodes, we label nodes that represent primary out- puts and un-gated xed latches as CLK, and label gated xed latches with the 81 edge(u,v) (e u;v ) w(e u;v ) edge(u,v) (e u;v ) w(e u;v ) (S,a) 1 0 (a,f) 1 0 (f,T) 1 1 (S,b) 1 0 (b,e) 1 0 (S,c) 1 0 (c,e) 1 0 (e,f) 1/2 0 (e,g) 1/2 0 (f,^ e) 1/2 0 (g,^ e) 1/2 0 (g,T) 1 1 (S,d) 1 0 (d,g) 1 0 (EN1,T) cg cg (f,EN1) cg 0 (g,EN2) cg 0 (EN2,T) cg cg (S,EN1&EN2) 2cg 0 (EN1&EN2,f) cg 0 (EN1&EN2,g) cg 0 (f, \ EN1&EN2) cg 0 (f, \ EN1&EN2) cg 0 Figure 4.7: Reduced retiming graph with added edges and nodes represent the impact of clock-gating for the circuit shown in Fig. 4.6 enable signal that drives the clock gating (CG) logic. All fanins of a node la- beled CLK are also labeled as CLK. The remaining nodes are labeled with the combination of its fanout enable signals. The extra edge-weight of edges directed to T represent the associated impact of the clock gating of the initial p 2 latches. 82 Notice that EN1&EN2 has two fanouts. To handle fanout sharing, a pseudo sink node \ EN1&EN2 is added to the graph with edges directed from the two fanouts. The breadth of the associated edges is 1=k where k is the number of fanouts, i.e., (e EN1&EN2;f ) = (e EN1&EN2;g ) = (e f; \ EN1&EN2 ) = (e g; \ EN1&EN2 ) = cg=2. Even though not shown in this example, some enable nodes may have more than one fanin. Fanin sharing is handled by adding a pseudo source node with edges directed to the fanins, as illustrated in Fig. 4.1(b). Linear Programming (LP) The following Linear Programming (LP) minimizes the combined area ofp 2 latches and clock gating logic for both 3-phase post and pre latch designs. min X v2V[V 0 2 4 X 8u2FI(v) (e u;v ) X 8u2FO(v) (e v;u ) 3 5 r(v) s:t: r(u)r(v)w(e u;v ); 8e u;v 2E[E 0 L u r(u)U u ; 8u2V[V 0 (4.7) Equation 4.7 is similar to traditional retiming in Equation 4.6 except for the extra node set V 0 and the extra edge set E 0 . These extra nodes and edges re ect the impact of clock-gating, as described in Algorithm 4.1. If ap 2 latch is positioned on edgee u;v 2E[E 0 after retiming, the cost is(e u;v ). Ife u;v is inE, the cost means the number of p 2 latches, while an edge in E 0 means it involves clock gating. To 83 retime ap 2 latch (v) forward from the input to the output of the node (r(v) = -1), the cost is the dierence between the total cost of its fanout edges and the total cost of its fanin edges. The run-times to construct the new retiming graph and to formulate the LP for- mula areO(E). The min-cut network ow problem (LP formula) can be solved in polynomial time. Therefore our retiming problem has polynomial-time complexity. 4.5 Experimental Results This section quanties the benets of the proposed conversion algorithm com- paring the proposed graph-based retiming 3-phase (GR 3-phase) designs to the original FF-based, traditional master-slave latch-based, and the original 3-phase latch-based (Orig 3-phase) designs. In particular, Python and TCL scripts are used to interface a leading commercial synthesis tool with the Gurobi Optimiza- tion tool [21]. After initial synthesis using the commercial synthesis tool and latch-conversion, we export the netlist of the un-retimed 3-phase latch-based cir- cuit. Using this information, our Python code generates the min-cut network ow graph and calls the Gurobi Optimization tool [21] to solve the min-cut problem. Using another TCL script, the results containing the retiming and clock-gating in- formation are imported back into the commercial tool for further logic optimization and PnR. 84 Our experiments rely on an industrial 28-nm FDSOI CMOS cell library and a range of circuits that include, ISCAS89 benchmark circuits [33], CEP submodules [34], and three CPU designs, a 3-stage MIPS Open Core Plasma [35], a RISC-V Rocket Core [36], and an ARM-M0 core [37]. The ISCAS89 benchmarks were run at 1 GHz. Other circuits were run at 500 MHz, except for RISC-V and ARM-M0 which operate at 333.3 MHz. We validated both master-slave and 3- phase latch-based circuits by streaming inputs to the FF-based and latch-based designs and compare output streams. For ISCAS designs we used auto-generated pseudo-random input streams. For CEP and CPU designs, we used the open- source provided testbenches. In particular, Plasma was simulated using the \pi" program, ARM-M0 was simulated using \hello world", RISC-V was simulated using the \rv32ui-v-simple", and CEP designs were simulated using the open- source provided self-check programs. These gate-level simulations were also used to determine signal activity that drive data-driven clock gating and to measure the relative power consumption of our approach. All experiments were run on two Intel Xeon E5-2450 v2 CPUs with 128GB of RAM. Note that for a fair comparison, all variants of each design are run at the same frequency. With the library-provided integrated clock gating cells, the tool decides 85 the best clock gating technique for FF and master-slave latch-based designs. 1 The traditional retiming strategy and enable driven clock gating strategies are per- formed on the master-slave and the original 3-phase latch-based designs. The new retiming strategy is only applied to GR 3-phase latch-based designs, combined with the designated enable driven clock gating approach. The data-driven clock gating strategy is later applied to all three latch-based designs with the same parameters as in [38]. Table 4.1: The statistics of the # of enable nets for potential p2 latches in the 3-phase latch-based designs Design 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16+ s1196 134 s1238 136 s1423 365 45 1 2 s1488 320 s5378 707 s9234 488 85 s13207 1200 290 s15850 1471 317 12 s35932 6278 s38417 5217 732 2 s38584 6042 1138 51 10 8 5 7 1 2 aes 136223 13278 des3 2252 130 sha256 3492 2483 1194 193 md5 5972 365 421 2 plasma 2916 2063 588 324 894 73 176 58 107 31 63 14 8 12 15 10 464 RISCV 8193 2962 1392 454 37 30 34 18 6 10 4 7 6 1 8 4 611 arm-m0 7696 1466 224 516 115 16 2 90 6 93 1 1 22 1 2 39 Table 4.1 and Table 4.2 show histograms of the number of enable signals that are required to clock-gate potential p 2 latches in GR 3-phase post latch and GR 1 Notice that the ow for master-slave designs [32] involves an extra synthesis step to restrict the FF usage to one FF type. The extra synthesis step, however, sometimes increases the power consumption of the combinational logic. In particular, in [38], the master-slave MD5 consumes signicantly higher power than its FF counterpart. In this paper, we used an incremental com- pilation option for the MD5 benchmark to avoid this problem. 86 Table 4.2: The statistics of the # of enable nets for potential p2 latches in the 3-phase pre latch-based designs Design 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16+ s1196 28 139 s1238 25 139 s1423 257 189 s1488 115 111 s5378 370 226 s9234 402 148 s13207 785 877 8 s15850 1177 901 48 s35932 6084 607 s38417 5131 1011 16 435 16 1 s38584 4109 3603 226 3 2 3 2 4 2 aes 19258 975 des3 1806 370 33 sha256 73 5398 1272 9 md5 90 1651 52 4957 4 plasma 801 6400 55 297 50 94 RISCV 942 8809 1277 117 12 8 6 2 4 4 2 4 3 5 2 3 1101 arm-m0 5156 5045 54 8 1 9 1 4 46 70 81 39 1 1 3-phase pre latch-based designs, respectively. The histograms show that, in most designs, most of the potential locations ofp 2 latches require no more than 4 enables, except for s38417, s38584, and the three CPU designs. This motivated us to set the default value of the upper bound U e n in our enable-driven clock-gating logic to 4. The costs of clock gating logic also depends on the library. Here we choose ct and cg(n(v)) as 0.5 and 0:5#(enable signals) where a unit cost represents the cost of one latch. Table 4.3 summarizes the number of registers (FFs/latches) in the original FF-based, conventional master-slave latch-based, original converted 3-phase post latch-based, orignal converted 3-phase pre latch-based, GR 3-phase latch-based post latch-based, and GR 3-phase pre latch-based designs. Note that all average 87 Table4.3: Comparasons on the number of registers among the original FF, conventional master-slave latch (M-S), the original 3-phase post latch (O3 Post), the original 3-phase pre latch (O3 Pre), the GR 3-phase post latch (GR3 Post), and the GR 3-phase pre latch-based designs. The saves (%) for GR3 Post latch-based designs are compared with FF, M-S, and the O3 Post designs and The saves (%) for GR3 Pre latch-based designs are compared with FF, M-S, and the O3 Pre designs. Design # of Registers GR 3-phase Save (%) FF M-S Original 3-phase GR 3-phase GR3 Post Save (%) GR3 Pre Save (%) Post Pre Post Pre FF M-S O3 Post FF M-S O3 Pre ISCAS s1196 18 36 26 30 24 25 33.3 33.3 7.7 30.6 30.6 -4.2 s1238 18 36 26 30 24 25 33.3 33.3 7.7 30.6 30.6 -4.2 s1423 81 158 146 154 151 144 6.8 4.4 -3.4 11.1 8.9 4.6 s1488 6 16 12 18 12 13 0 25 0 -8.3 18.8 -8.3 s5378 163 317 250 303 239 230 26.7 24.6 4.4 29.4 27.4 3.8 s9234 140 278 225 219 225 218 19.6 19.1 0 22.1 21.6 3.1 s13207 457 890 725 979 725 706 20.7 18.5 0 22.8 20.7 2.6 s15850 454 904 747 776 728 769 19.8 19.5 2.5 15.3 14.9 -5.6 s35932 1728 3456 2737 2861 2737 2737 20.8 20.8 0 20.8 20.8 0.0 s38417 1489 2751 2366 2563 2361 2346 20.7 14.2 0.2 21.2 14.7 0.6 s38584 1319 2633 2422 2456 2368 2375 10.2 10.1 2.2 10.0 9.8 -0.3 Average 534 1043 880 945 872 872 19.3 20.3 1.9 18.7 19.9 -0.7 CEP AES 9715 16829 12871 13122 12858 13024 33.8 23.6 0.1 33.0 22.6 -1.3 DES3 436 842 573 574 545 526 37.5 35.3 4.9 39.7 37.5 3.5 SHA256 1574 3308 2523 2490 2522 2490 19.9 23.8 0 20.9 24.7 1.3 MD5 804 1780 996 1007 976 943 39.3 45.2 2 41.4 47.0 3.4 Average 3132 5690 4241 4298 4225 4246 32.6 31.9 1.8 33.7 33.0 1.7 CPU Plasma 1606 2357 2078 2134 2064 2065 35.7 12.4 0.7 35.7 12.4 0.0 RISCV 2795 5312 4084 3714 3809 3726 31.9 28.3 6.7 33.3 29.9 2.2 ArmM0 1397 2713 2290 1834 1940 1823 30.6 28.5 15.3 34.8 32.8 6.0 Average 1933 3461 2817 2561 2604 2538 32.7 23.1 7.6 34.6 25.0 2.7 Average 1344 2479 1950 1959 1906 1899 24.5 23.3 2.8 24.7 23.6 0.4 values are calculated across individual benchmarks without biasing by their size. The savings for the number of registers are the percentages of the number of latches in GR 3-phase designs compared to twice the number of FFs in FF-based designs and compared to the number latches in the conventional master-slave and Orig 3-phase latch-based designs. The results show an average of 24.5%, 23.3%, and 2.8% reductions, respectively. Similar to the original 3-phase designs, the GR 3-phase designs have the least benet on ISCAS89 circuits, in particular, no benet on s1488. This suggests that 3-phase designs, even with retiming, brings limited benets to control dominated designs that have a predominance of FFs 88 Table 4.4: The statistics of enable nodes after new retiming on the GR 3-phase post latch (GR3 Post) and the GR 3-phase pre latch-based designs Design GR 3-phase Post GR 3-phase Pre Not enable Avg Avg Not enable Avg Avg gated nodes Fanin Fanout gated nodes Fanin Fanout ISCAS s1196 6 0 - - 7 0 0 0 s1238 6 0 - - 7 0 0 0 s1423 38 7 1 4 39 5 1 3.8 s1488 6 0 - - 7 0 0 0 s5378 74 0 - - 40 1 1 27 s9234 40 5 1 8.8 32 5 1 8.8 s13207 133 12 1 10.6 129 9 1.3 13 s15850 123 16 1 9 144 17 1.1 9.5 s35932 1009 0 - - 1009 0 0 0 s38417 494 25 1 14.3 514 21 1.2 15.9 s38584 444 86 1 6.4 428 87 1 6.4 Average 215.7 13.7 1 8.9 214.2 13.2 0.7 7.7 CEP AES 3522 2 1 162.5 3617 2 1 162.5 DES3 97 1 1 8 52 2 1 19 SHA256 38 4 1 227.3 5 4 1 227.3 MD5 40 5 1 26.2 7 5 1 26.2 Average 924.3 3.0 1.0 106.0 920.3 3.3 1.0 108.7 CPU Plasma 5 19 1 23.7 16 17 1 25.8 RISCV 147 36 1.5 17.9 115 36 1.1 26.4 ArmM0 296 17 1.1 13.1 182 16 1 14.7 Average 149.3 24 1.2 18.2 104.3 23.0 1.0 22.3 Average 362.1 13.1 1.05 40.9 352.8 12.6 0.82 32.6 with combinational feedback. Notice that the timing algorithm saves the number of registers comparing to the traditional retiming approach, except for s1423, which has negative benet. Table 4.4 summarizes the number of registers (FFs/latches), the statistics of enable nodes after graph-based retiming, and total area (m 2 ) in the original FF-based, conventional master-slave latch-based, original converted 3-phase latch- based, and GR 3-phase latch-based designs. The new retiming algorithm has the preference of utilizing enable nodes that can be shared among dierent p 2 latches. On average, an enable node combines 1.05 enable nets and drives 41 p 2 latches. 89 Table 4.5: Area decomposition (m 2 ) of ip- op (FF), master-slave latch (M-S), orig- inal 3-phase post (O3Post), GR 3-phase post, original 3-phase pre (O3Post) and GR 3-phase pre latch-based designs Design FF M-S Orig 3-Phase Post GR 3-Phase Post Orig 3-Phase Pre GR 3-Phase Pre Seq Comb Total Seq Comb Total Seq Comb Total Seq Comb Total Seq Comb Total Seq Comb Total ISCAS s1196 68 173 240 59 170 228 44 174 218 41 170 211 50 193 243 43 175 218 s1238 68 171 238 59 170 229 44 167 211 41 178 219 50 178 228 43 179 221 s1423 305 286 591 265 206 471 239 302 541 251 315 566 253 326 579 236 287 523 s1488 23 195 217 25 207 232 20 214 233 20 215 235 28 215 244 21 214 236 s5378 612 552 1164 519 412 931 420 476 897 404 495 899 498 583 1081 393 537 930 s9234 525 377 902 466 284 750 374 368 742 375 352 727 365 381 746 363 354 717 s13207 1691 983 2675 1473 562 2035 1217 802 2019 1209 752 1961 1584 923 2508 1185 754 1939 s15850 1678 1207 2885 1499 1105 2603 1243 1086 2329 1216 1106 2322 1300 1444 2744 1279 1239 2518 s35932 6486 5283 11770 5640 3716 9356 4584 3524 8108 4585 4415 9000 4766 4312 9079 4584 4039 8623 s38417 5558 3837 9395 4553 2706 7259 3970 3645 7614 3950 3613 7563 4252 4069 8321 3941 4804 8746 s38584 4829 4526 9355 4384 3475 7859 3990 3875 7865 3888 3817 7705 4044 4255 8299 3892 3568 7460 Average 1986 1599 3585 1722 1183 2905 1468 1330 2798 1453 1402 2855 1563 1535 3097 1453 1468 2921 CEP AES 25363108015 133377 24748 96844 121592 18916100728 119643 18916101245 120160 19295102341 121635 19152102481 121633 DES3 1142 1569 2711 1251 1486 2737 842 1612 2454 801 1557 2357 843 1554 2397 773 1556 2329 SHA256 4115 5881 9996 4898 4685 9583 3707 4959 8666 3709 5054 8763 3660 4929 8589 3660 4579 8239 MD5 2100 4924 7023 2655 4168 6823 1472 5451 6923 1441 5472 6913 1480 4478 5958 1388 5211 6599 Average 8180 30097 38277 8388 26796 35183 6234 28187 34421 6216 28332 34548 6319 28325 34645 6243 28457 34700 CPU Plasma 4840 4104 8944 3687 3858 7544 3231 4798 8029 3218 4536 7754 3315 4514 7829 3216 4373 7589 RISC-V 6948 7504 14453 7932 7365 15296 5986 8016 14002 5606 7985 13591 5466 8784 14250 5486 7833 13320 ARM-M0 4090 6600 10690 4112 6836 10947 3524 8240 11764 2972 7882 10854 2870 8108 10979 2827 7643 10470 Average 5293 6069 11362 5243 6019 11263 4247 7018 11265 3932 6801 10733 3884 7136 11019 3843 6616 10459 Average 3913 8677 12590 3790 7681 11471 2990 8247 11237 2925 8287 11211 3007 8422 11428 2916 8324 11239 Table 4.6: Comparasons on area (m 2 ) among the original FF, conventional master- slave latch (M-S), the original 3-phase post latch (O3 Post), the original 3-phase pre latch (O3 Pre), the GR 3-phase post latch (GR3 Post), and the GR 3-phase pre latch- based designs. The saves (%) for GR3 Post latch-based designs are compared with FF, M-S, and the O3 Post designs and The saves (%) for GR3 Pre latch-based designs are compared with FF, M-S, and the O3 Pre designs. Design Area (m 2 ) GR 3-phase Save (%) FF M-S Original 3-phase GR 3-phase GR3 Post Save (%) GR3 Pre Save (%) Post Pre Post Pre FF M-S O3 Post FF M-S O3 Pre ISCAS s1196 240 228 219 230 211 218 12.2 7.6 3.8 9.3 4.5 -3.2 s1238 238 229 215 229 219 221 8.1 4.5 -1.7 7.0 3.4 -1.1 s1423 591 466 524 573 566 523 4.2 -21.5 -8.1 11.6 -12.2 7.6 s1488 217 232 239 245 235 236 -8.2 -1.2 1.7 -8.5 -1.5 -0.2 s5378 1164 930 914 1068 899 930 22.8 3.3 1.7 20.1 0.0 -3.4 s9234 902 752 741 728 727 717 19.4 3.3 1.9 20.5 4.6 1.3 s13207 2675 2058 2056 2455 1961 1939 26.7 4.7 4.6 27.5 5.8 1.1 s15850 2885 2565 2315 2468 2322 2518 19.5 9.5 -0.3 12.7 1.8 -8.4 s35932 11770 9356 9054 8816 9000 8623 23.5 3.8 0.6 26.7 7.8 4.2 s38417 9395 7272 7863 8123 7563 8746 19.5 -4 3.8 6.9 -20.3 -15.6 s38584 9355 7683 7961 7775 7705 7460 17.6 -0.3 3.2 20.3 2.9 3.2 Average 3585 2888 2918 2974 2855 2921 15 0.9 1 14.0 -0.3 -1.3 CEP AES 133115 121960 119174 121886 120160 121633 9.7 1.5 -0.8 8.6 0.3 -1.2 DES3 2711 2738 2449 2369 2357 2329 13 13.9 3.7 14.1 14.9 1.2 SHA256 9996 9461 8594 8370 8763 8239 12.3 7.4 -2 17.6 12.9 6.0 MD5 7023 6823 6947 5587 6913 6599 1.6 -1.3 0.5 6.0 3.3 4.5 Average 38212 35246 34291 34553 34548 34700 9.2 5.4 0.4 11.6 7.9 2.6 CPU Plasma 8944 7546 8029 7680 7754 7589 13.3 -2.8 3.4 15.2 -0.6 2.1 RISCV 14453 15268 14002 13416 13591 13320 6 11 2.9 7.8 12.8 2.0 ArmM0 10690 11007 11514 10427 10884 10470 -1.8 1.1 5.5 2.1 4.9 3.8 Average 11362 11274 11182 10508 10743 10459 5.8 3.1 3.9 8.4 5.7 2.6 Average 12576 11476 11267 11247 11213 11239 12.2 2.3 1.4 12.5 2.5 0.2 90 Table 4.5 decomposed the total area into two groups, sequential elements and comibnational logic. Table 4.6 summarizes total area (m 2 ) in the original FF- based, conventional master-slave latch-based, original converted 3-phase latch- based, and GR 3-phase latch-based designs. According to the results, GR 3-phase post designs show average area improvements of 12.2%, 2.3%, and 1.4% compared to FF, master-slave, and the original 3-phase post designs. Similar power savings are observed by GR 3-phase pre designs. Compared to FF, master-slave, and the original 3-phase pre designs, GR 3-phase pre designs improves the area by 12.5%, 2.5%, and 0.2%. The CPU benchmarks observe the highest average savings com- paring GR 3-phase to original 3-phase designs, in terms of both number of registers and the total area. This suggests that graph-based retiming eciently saves area by reducing the number of sequential elements. 91 Table 4.7: Power dissipation (mW ) based on back-annotated gate-level simulation of ip- op (FF), master-slave latch (M-S), original 3-phase post (O3Post), GR 3-phase post, original 3-phase pre (O3Post) and GR 3-phase pre latch-based designs. Design FF Power M-S Power Orig 3-Phase Post Power GR 3-Phase Post Power Orig 3-Phase Pre Power GR 3-Phase Pre Power Clock Seq Comb Total Clock Seq Comb Total Clock Seq Comb Total Clock Seq Comb Total Clock Seq Comb Total Clock Seq Comb Total ISCAS s1196 0.08 0.04 0.18 0.30 0.09 0.04 0.18 0.32 0.07 0.03 0.18 0.28 0.06 0.03 0.18 0.27 0.06 0.03 0.18 0.27 0.06 0.03 0.17 0.26 s1238 0.08 0.04 0.17 0.29 0.10 0.04 0.18 0.32 0.07 0.03 0.17 0.27 0.06 0.03 0.18 0.26 0.06 0.03 0.17 0.26 0.06 0.03 0.18 0.27 s1423 0.56 0.08 0.17 0.82 0.42 0.08 0.12 0.63 0.50 0.11 0.15 0.75 0.40 0.11 0.15 0.67 0.47 0.10 0.15 0.71 0.42 0.09 0.13 0.63 s1488 0.03 0.01 0.13 0.17 0.04 0.02 0.13 0.19 0.03 0.01 0.12 0.17 0.03 0.01 0.12 0.16 0.03 0.01 0.12 0.16 0.03 0.01 0.14 0.18 s5378 0.82 0.25 0.37 1.44 0.84 0.25 0.24 1.34 0.59 0.28 0.26 1.13 0.56 0.25 0.27 1.08 0.55 0.20 0.27 1.02 0.58 0.21 0.26 1.05 s9234 0.69 0.10 0.10 0.89 0.62 0.11 0.05 0.78 0.55 0.10 0.08 0.73 0.46 0.10 0.07 0.63 0.52 0.09 0.06 0.67 0.44 0.09 0.06 0.59 s13207 2.04 0.43 0.42 2.89 1.98 0.50 0.20 2.69 1.53 0.46 0.22 2.21 1.37 0.45 0.23 2.06 1.79 0.45 0.22 2.46 1.38 0.47 0.22 2.07 s15850 2.13 0.31 0.53 2.98 2.14 0.30 0.44 2.87 1.81 0.30 0.35 2.47 1.62 0.31 0.33 2.26 1.78 0.31 0.34 2.43 1.71 0.31 0.35 2.37 s35932 11.50 2.70 4.32 18.50 10.60 3.01 3.11 16.80 8.12 2.83 3.06 14.00 7.22 2.83 3.41 13.50 7.39 2.89 2.72 13.00 7.46 2.90 3.03 13.40 s38417 6.34 0.88 2.05 9.26 6.27 0.96 1.40 8.62 4.81 0.96 1.47 7.24 4.40 0.69 1.53 6.63 4.82 0.90 1.44 7.16 4.65 0.67 2.57 7.89 s38584 7.11 2.50 4.88 14.50 7.04 2.68 3.54 13.30 7.31 3.02 3.40 13.70 6.70 2.89 3.68 13.30 7.28 2.84 3.49 13.60 6.75 2.74 3.35 12.80 Average 2.85 0.67 1.21 4.73 2.74 0.73 0.87 4.35 2.31 0.74 0.86 3.90 2.08 0.70 0.92 3.71 2.25 0.71 0.83 3.79 2.14 0.69 0.95 3.77 CEP AES 18.80 0.05 0.20 19.10 14.30 0.06 0.17 14.50 7.94 0.06 0.26 8.27 7.97 0.07 0.27 8.31 8.31 0.05 0.22 8.59 8.29 0.05 0.22 8.57 DES3 0.26 0.14 0.51 0.91 0.21 0.12 0.41 0.74 0.20 0.10 0.41 0.72 0.16 0.11 0.39 0.67 0.20 0.11 0.41 0.71 0.18 0.11 0.39 0.68 SHA256 0.13 0.05 0.13 0.31 0.27 0.06 0.09 0.42 0.13 0.05 0.13 0.30 0.10 0.04 0.13 0.27 0.11 0.05 0.13 0.29 0.10 0.05 0.13 0.27 MD5 0.11 0.02 0.28 0.40 0.00 0.00 0.00 0.56 0.09 0.02 0.25 0.36 0.05 0.02 0.26 0.33 0.11 0.02 0.27 0.39 0.04 0.01 0.24 0.29 Average 4.82 0.06 0.28 5.18 3.70 0.06 0.17 4.05 2.09 0.06 0.26 2.41 2.07 0.06 0.26 2.39 2.18 0.06 0.26 2.50 2.15 0.05 0.25 2.45 CPU Plasma 0.59 0.44 0.65 1.68 0.99 0.19 0.45 1.63 0.64 0.17 0.54 1.36 0.66 0.15 0.52 1.34 0.74 0.14 0.46 1.35 0.59 0.14 0.48 1.21 RISC-V 0.52 0.11 0.37 1.01 0.87 0.07 0.30 1.25 0.54 0.07 0.30 0.92 0.48 0.07 0.29 0.84 0.59 0.07 0.30 0.97 0.47 0.07 0.31 0.85 ARM-M0 0.54 0.31 1.14 2.00 1.23 0.23 1.34 2.90 0.50 0.11 1.22 1.84 0.62 0.17 1.03 1.82 0.59 0.15 1.00 1.74 0.58 0.15 0.98 1.72 Average 0.55 0.29 0.72 1.56 1.03 0.16 0.70 1.92 0.56 0.12 0.69 1.37 0.59 0.13 0.61 1.33 0.64 0.12 0.59 1.35 0.55 0.12 0.59 1.26 Average 2.91 0.47 0.92 4.30 2.67 0.48 0.69 3.88 1.97 0.49 0.70 3.15 1.83 0.46 0.73 3.02 1.97 0.47 0.66 3.10 1.88 0.45 0.73 3.06 92 Table 4.8: Comparasons on Power dissipation (mW ) among the original FF, conven- tional master-slave latch (M-S), the original 3-phase post latch (O3 Post), the original 3-phase pre latch (O3 Pre), the GR 3-phase post latch (GR3 Post), and the GR 3-phase pre latch-based designs. The saves (%) for GR3 Post latch-based designs are compared with FF, M-S, and the O3 Post designs and The saves (%) for GR3 Pre latch-based designs are compared with FF, M-S, and the O3 Pre designs. Design Total Power GR3 Post Save (%) GR3 Pre Save (%) FF M-S O3Post GR3Post O3Pre GR3Pre FF M-S O3Post FF M-S O3Pre ISCAS s1196 0.30 0.32 0.28 0.27 0.27 0.26 12.87 16.56 6.18 14.21 17.85 1.73 s1238 0.29 0.32 0.27 0.26 0.26 0.27 9.94 17.36 3.70 7.13 14.78 -5.78 s1423 0.82 0.63 0.75 0.67 0.71 0.63 17.52 -7.48 10.14 22.41 -1.10 11.43 s1488 0.17 0.19 0.17 0.16 0.16 0.18 3.64 13.91 3.70 -6.80 4.58 -9.88 s5378 1.44 1.34 1.13 1.08 1.02 1.05 25.35 19.49 4.60 27.08 21.36 -3.03 s9234 0.89 0.78 0.73 0.63 0.67 0.59 28.39 18.83 12.97 33.30 24.40 11.97 s13207 2.89 2.69 2.21 2.06 2.46 2.07 28.82 23.42 6.75 28.34 22.90 15.74 s15850 2.98 2.87 2.47 2.26 2.43 2.37 23.95 21.20 8.27 20.29 17.40 2.27 s35932 18.50 16.80 14.00 13.50 13.00 13.40 27.03 19.64 3.57 27.57 20.24 -3.08 s38417 9.26 8.62 7.24 6.63 7.16 7.89 28.47 23.16 8.49 14.82 8.50 -10.23 s38584 14.50 13.30 13.70 13.30 13.60 12.80 8.28 0.00 2.92 11.72 3.76 5.88 Average 4.73 4.35 3.90 3.71 3.79 3.77 19.48 15.10 6.48 18.19 14.06 1.55 CEP AES 19.10 14.50 8.27 8.31 8.59 8.57 56.49 42.68 -0.54 55.16 40.93 0.28 DES3 0.91 0.74 0.72 0.67 0.71 0.68 27.16 10.26 7.30 25.62 8.37 5.00 SHA256 0.31 0.42 0.30 0.27 0.29 0.27 11.51 35.06 10.79 11.29 34.90 4.76 MD5 0.40 0.56 0.36 0.33 0.39 0.29 19.05 41.16 10.10 28.61 48.11 26.07 Average 5.18 4.05 2.41 2.39 2.50 2.45 28.55 32.29 6.91 30.17 33.08 9.03 CPU Plasma 1.68 1.63 1.36 1.34 1.35 1.21 20.29 17.84 1.55 28.04 25.83 10.60 RISC-V 1.01 1.25 0.92 0.84 0.97 0.85 16.19 32.44 7.92 15.93 32.24 12.69 ARM-M0 2.00 2.90 1.84 1.82 1.74 1.72 8.72 37.11 0.87 13.78 40.59 1.15 Average 1.56 1.92 1.37 1.33 1.35 1.26 15.07 29.13 3.45 19.25 32.89 8.15 Average 4.30 3.88 3.15 3.02 3.10 3.06 20.76 21.26 6.07 21.03 21.42 4.31 Table 4.7 reports the power dissipation based on back-annotated gate-level simulation. The total power is divided into three groups, namely the clock network (Clock), register/sequential logic (Seq), and combinational logic (Comb). Table 4.8 compares the total power dissipation (mW ) on GR 3-phase designs with the total power on FF, master-slave latch, original 3-phase designs. The proposed retiming approach on 3-phase post designs achieves an average of 20.8%, 21.3%, and 6.1% power reductions in comparison to FF, master-slave, and the original 3-phase post designs. On the other hand, the GR 3-phase pre latch designs observe average 93 power savings of 21.0%, 21.4%, and 4.3%, with respect to FF, master-slave, and the original 3-phase post designs. Table 4.9: Power Saves (%) for original 3-phase post (O3Post), GR 3-phase post, original 3-phase pre (O3Post) and GR 3-phase pre latch-based designs, compared to ip- op (FF) and master-slave latch (M-S) designs. Design Power Save w.r.t. FF Power Save w.r.t. M-S O3Post GR3Post O3Pre GR3Pre O3Post GR3Post O3Pre GR3Pre ISCAS s1196 7.12 12.87 12.70 14.21 11.06 16.56 16.40 17.85 s1238 6.48 9.94 12.20 7.13 14.19 17.36 19.44 14.78 s1423 8.21 17.52 12.40 22.41 -19.62 -7.48 -14.15 -1.10 s1488 -0.06 3.64 2.80 -6.80 10.61 13.91 13.17 4.58 s5378 21.75 25.35 29.22 27.08 15.61 19.49 23.67 21.36 s9234 17.72 28.39 24.23 33.30 6.73 18.83 14.12 24.40 s13207 23.67 28.82 14.95 28.34 17.87 23.42 8.49 22.90 s15850 17.10 23.95 18.44 20.29 14.10 21.20 15.49 17.40 s35932 24.32 27.03 29.73 27.57 16.67 19.64 22.62 20.24 s38417 21.83 28.47 22.73 14.82 16.03 23.16 16.99 8.50 s38584 5.52 8.28 6.21 11.72 -3.01 0.00 -2.26 3.76 Average 13.97 19.48 16.87 18.19 9.11 15.10 12.18 14.06 CEP AES 56.72 56.49 55.03 55.16 42.99 42.68 40.77 40.93 DES3 21.42 27.16 21.71 25.62 3.18 10.26 3.55 8.37 SHA256 0.82 11.51 6.85 11.29 27.21 35.06 31.64 34.90 MD5 9.96 19.05 3.43 28.61 34.55 41.16 29.81 48.11 Average 22.23 28.55 21.76 30.17 26.99 32.29 26.44 33.08 CPU Plasma 19.03 20.29 19.51 28.04 16.54 17.84 17.04 25.83 RISC-V 8.99 16.19 3.72 15.93 26.63 32.44 22.39 32.24 ARM-M0 7.92 8.72 12.78 13.78 36.56 37.11 39.90 40.59 Average 11.98 15.07 12.00 19.25 26.58 29.13 26.44 32.89 Average 15.47 20.76 17.15 21.03 16.00 21.26 17.73 21.42 Table 4.9 summarizes the power savings achieved by four dierent 3-phase latch approaches. The GR 3-phase pre latch approach generates the highest power savings, i.e., average reductions of 21% and 21.4% compared to FF and master- slave latch-based designs. The GR 3-phase post latch approach observes a slightly lower power decrements in CEP and CPU designs, but a relatively higher savings among ISCAS benchmarks. 94 Table 4.10: Area Savings (%) in Combinational logic and Sequential Elements w.r.t. ip- op (FF) designs achieved by master-slave latch (M-S), original 3-phase post (O3Post), GR 3-phase post, original 3-phase pre (O3Post) and GR 3-phase pre latch- based designs Design M-S Orig 3-Phase Post GR 3-Phase Post Orig 3-Phase Pre GR 3-Phase Pre Seq Comb Total Seq Comb Total Seq Comb Total Seq Comb Total Seq Comb Total ISCAS s1196 13.04 1.89 5.02 34.78 -0.57 9.37 39.12 1.89 12.35 26.08 -11.90 -1.22 36.94 -1.32 9.43 s1238 13.04 0.29 3.90 34.78 2.01 11.30 39.12 -4.01 8.22 26.08 -4.39 4.25 36.94 -4.68 7.12 s1423 13.31 28.01 20.42 21.86 -5.70 8.53 17.85 -10.27 4.25 17.16 -13.92 2.13 22.82 -0.34 11.62 s1488 -13.06 -6.21 -6.92 13.06 -9.90 -7.52 13.06 -10.66 -8.19 -26.11 -10.74 -12.33 6.53 -10.23 -8.50 s5378 15.20 25.40 20.04 31.29 13.69 22.94 33.93 10.38 22.76 18.57 -5.59 7.11 35.77 2.78 20.12 s9234 11.07 24.80 16.81 28.78 2.42 17.76 28.50 6.71 19.38 30.46 -0.95 17.32 30.74 6.15 20.45 s13207 12.91 42.80 23.90 28.05 18.46 24.52 28.53 23.47 26.67 6.34 6.08 6.24 29.95 23.27 27.50 s15850 10.66 8.50 9.76 25.91 10.07 19.28 27.53 8.39 19.52 22.49 -19.61 4.87 23.74 -2.60 12.72 s35932 13.04 29.67 20.51 29.33 33.30 31.11 29.31 16.44 23.53 26.52 18.38 22.86 29.33 23.55 26.73 s38417 18.09 29.46 22.73 28.58 5.00 18.95 28.93 5.84 19.50 23.51 -6.06 11.43 29.09 -25.22 6.91 s38584 9.21 23.23 15.99 17.36 14.38 15.92 19.47 15.67 17.63 16.24 6.00 11.28 19.40 21.17 20.25 Average 10.59 18.89 13.83 26.71 7.56 15.65 27.76 5.80 15.06 17.03 -3.88 6.72 27.39 2.96 14.03 CEP AES 2.42 10.34 8.84 25.42 6.75 10.30 25.42 6.27 9.91 23.92 5.25 8.80 24.49 5.12 8.81 DES3 -9.53 5.32 -0.93 26.31 -2.75 9.49 29.91 0.80 13.06 26.18 0.97 11.59 32.32 0.85 14.11 SHA256 -19.02 20.34 4.14 9.92 15.69 13.31 9.88 14.06 12.34 11.06 16.19 14.08 11.06 22.14 17.58 MD5 -26.43 15.34 2.85 29.88 -10.70 1.43 31.38 -11.14 1.57 29.53 9.05 15.17 33.91 -5.84 6.05 Average -13.14 12.84 3.72 22.88 2.25 8.63 24.15 2.50 9.22 22.67 7.87 12.41 25.44 5.57 11.63 CPU Plasma 23.83 5.99 15.65 33.24 -16.93 10.22 33.52 -10.54 13.30 31.52 -10.01 12.46 33.55 -6.56 15.15 RISC-V -14.15 1.86 -5.84 13.85 -6.82 3.12 19.31 -6.40 5.96 21.33 -17.06 1.40 21.04 -4.39 7.84 ARM-M0 -0.53 -3.57 -2.41 13.84 -24.85 -10.05 27.32 -19.42 -1.54 29.81 -22.85 -2.70 30.88 -15.80 2.06 Average 3.05 1.43 2.47 20.31 -16.20 1.10 26.72 -12.12 5.91 27.56 -16.64 3.72 28.49 -8.91 8.35 Average 4.06 14.64 9.69 24.79 2.42 11.67 26.78 2.08 12.24 20.04 -3.40 7.49 27.14 1.56 12.55 Table 4.11: Power Savings (%) in Combinational logic and Sequential Elements w.r.t. ip- op (FF) designs achieved by master-slave latch (M-S), original 3-phase post (O3Post), GR 3-phase post, original 3-phase pre (O3Post) and GR 3-phase pre latch- based designs Design M-S Orig 3-Phase Post GR 3-Phase Post Orig 3-Phase Pre GR 3-Phase Pre Seq Comb Total Seq Comb Total Seq Comb Total Seq Comb Total Seq Comb Total ISCAS s1196 -12.44 0.81 -4.43 15.56 1.68 7.12 26.54 4.01 12.87 27.47 3.09 12.70 23.46 8.23 14.21 s1238 -12.10 -6.81 -8.98 15.33 0.35 6.48 26.85 -1.86 9.94 27.58 1.51 12.20 26.33 -6.29 7.13 s1423 21.32 30.52 23.26 6.31 15.26 8.21 19.30 10.92 17.52 12.03 13.82 12.40 21.16 27.05 22.41 s1488 -45.42 -1.49 -11.93 -7.53 2.19 -0.06 -1.91 5.32 3.64 -0.88 3.92 2.80 -1.41 -8.54 -6.80 s5378 -2.32 34.62 7.27 18.42 31.16 21.75 24.88 26.72 25.35 29.45 28.33 29.22 26.32 29.21 27.08 s9234 7.38 46.64 11.78 17.08 22.80 17.72 28.35 28.70 28.39 22.43 38.56 24.23 32.16 42.43 33.30 s13207 -0.28 50.81 7.06 19.80 46.74 23.67 26.39 43.44 28.82 9.55 47.07 14.95 25.26 46.50 28.34 s15850 0.32 17.90 3.49 13.46 33.53 17.10 21.02 37.26 23.95 14.73 35.37 18.44 17.23 34.24 20.29 s35932 4.15 28.14 9.19 22.94 29.21 24.32 29.24 21.04 27.03 27.61 37.14 29.73 27.05 29.78 27.57 s38417 -0.12 31.69 6.91 19.98 28.36 21.83 29.40 25.18 28.47 20.73 29.83 22.73 26.25 -25.48 14.82 s38584 -1.18 27.39 8.28 -7.58 30.29 5.52 0.15 24.46 8.28 -5.33 28.46 6.21 1.22 31.37 11.72 Average -3.70 23.66 4.72 12.16 21.96 13.97 20.93 20.47 19.48 16.85 24.28 16.87 20.46 18.96 18.19 CEP AES 23.84 14.26 24.08 57.54 -32.54 56.72 57.37 -38.83 56.49 55.61 -12.18 55.03 55.75 -12.69 55.16 DES3 17.61 19.77 18.83 23.23 19.98 21.42 31.95 23.41 27.16 24.51 19.51 21.71 28.10 23.67 25.62 SHA256 -88.21 29.79 -36.27 -4.25 7.26 0.82 15.37 6.60 11.51 10.18 2.59 6.85 17.67 3.11 11.29 MD5 99.71 99.94 -37.57 13.63 8.29 9.96 45.78 6.59 19.05 3.84 3.26 3.43 59.25 14.34 28.61 Average 13.24 40.94 -7.73 22.54 0.75 22.23 37.62 -0.56 28.55 23.53 3.30 21.76 40.19 7.11 30.17 CPU Plasma -14.53 30.67 2.98 20.76 16.30 19.03 20.73 19.57 20.29 13.91 28.39 19.51 28.94 26.60 28.04 RISC-V -50.23 20.02 -24.06 2.27 20.26 8.99 12.39 22.53 16.19 -5.23 18.74 3.72 15.29 16.95 15.93 ARM-M0 -70.58 -17.22 -45.14 27.71 -6.73 7.92 7.38 9.70 8.72 13.15 12.50 12.78 13.56 13.99 13.78 Average -45.11 11.16 -22.07 16.91 9.94 11.98 13.50 17.27 15.07 7.28 19.88 12.00 19.26 19.18 19.25 Average -6.84 25.41 -2.51 15.26 15.24 15.47 23.40 15.27 20.76 16.74 18.88 17.15 24.64 16.36 21.03 95 It is interesting to compare the dierence in area and power savings. Table 4.10 and 4.11 compare the area and power savings with FF-based designs and decom- pose the savings into two groups: combinational logic and sequential elements 2 . One could observe that 3-phase latch-based designs save both the sequential area and power by an average of25%. Interestingly, this is close to the maximum savings we would expect, ignoring the impact of retiming, if the design was a lin- ear pipeline. The combinational logic in 3-phase designs, on the other hand, has limited area savings (0.6%) but higher power savings (16%) on average. This sug- gests that 3-phase designs have less glitching than their FF-based counterparts. Note that the master-slave designs save little area and power in sequential logic compared to 3-phase designs, but have both area benets (15%) and power reduc- tions (25%) in combinational logic. This can be explained by the fact that the latch-based designs allow time-borrowing which relaxes the normal FF edge-to- edge timing requirements. Still, master-slave designs have more movable latches than 3-phase designs, which enables additional time-borrowing benets than 3- phase latch-based designs. The place-and-routed CPUs (RISC-V and Arm-M0) are also tested on Core- mark and Dhrystone, which are two standard benchmarks for measuring general process (CPU) performance. In Fig. 4.8, the decomposition and the total amount of power consumption is labeled in the middle and on the top of each bar. The 2 the power in the sequential group consists of clock tree and register power 96 Figure 4.8: Power dissipation (mW ) of RISC-V and Arm-M0 running Dhrystone and Coremark in ip- op (FF), converted master-slave latch (M-S), original 3-phase post latch (O3Post), the proposed graph-based retiming 3-phase Post latch (GR3Post), origi- nal 3-phase pre latch (O3Pre), and the proposed graph-based retiming 3-phase pre latch (GR3Pre) based designs proposed retiming algorithm on 3-phase post designs saves power by an average of 20.8%, 25.9%, and 6.2% in RISC-V and 10.8%, 22.2%, and -0.8% in Arm-M0 compared to FF, master-slave, and the original 3-phase post designs. Similarly, the GR 3-phase pre designs save power on average by 19.9%, 25.0%, and 9.2% in RISCV, and by 13.2%, 24.3%, and 0.5% in Arm-M0 compared to FF, master- slave, and the original 3-phase pre designs, respectively. The obtained average power savings are generally smaller than the savings using its simple CPU test- bench program. This can be explained by the decrease in clock-gating eciencies, especially the data-driven clock-gating (DDCG) eciency. Note that a positive 97 power saving in Arm-M0 with Coremark could be obtained by tuning the param- eters in Algorithm 4.1. For example, the total power is reduced to 1.84 mW (2% save compared to the original 3-phase post Arm-M0) by settingU en to 16, in other words, by allowing more enable domains to merge. Table 4.12: Run-time (sec) comparison for FF-, master-slave, the original 3-phase post latch, and GR 3-phase post latch-based designs Design FF M-S Orig 3-phase Post GR 3-phase Post Syn PnR Total Syn PnR Total ILP Syn PnR Total Cut Syn PnR Total ISCAS s1196 0.4 m 1.7 m 2.1 m 3.8 m 2.8 m 6.6 m 4 s 1.8 m 2.8 m 4.7 m 4 s 4.4 m 2.8 m 7.3 m s1238 0.4 m 1.5 m 1.9 m 3.8 m 2.9 m 6.7 m 4 s 1.8 m 2.8 m 4.6 m 4 s 2.0 m 3.0 m 5.1 m s1423 0.4 m 1.7 m 2.1 m 3.9 m 3.6 m 7.5 m 4 s 2.3 m 4.0 m 6.3 m 4 s 8.1 m 3.8 m 12.0 m s1488 0.3 m 1.5 m 1.8 m 3.9 m 2.8 m 6.8 m 3 s 1.6 m 3.0 m 4.7 m 4 s 2.0 m 3.0 m 5.0 m s5378 2.9 m 2.1 m 4.9 m 4.0 m 3.7 m 7.7 m 3 s 2.8 m 4.2 m 7.1 m 3 s 6.3 m 4.2 m 10.5 m s9234 0.4 m 2.2 m 2.6 m 3.8 m 3.9 m 7.7 m 4 s 2.4 m 4.1 m 6.6 m 4 s 3.2 m 3.8 m 7.1 m s13207 0.5 m 3.2 m 3.7 m 4.3 m 5.9 m 10.2 m 4 s 3.3 m 6.3 m 9.6 m 4 s 28.9 m 6.0 m 35.0 m s15850 0.5 m 3.5 m 4.0 m 4.8 m 8.2 m 13.0 m 5 s 4.2 m 7.0 m 11.3 m 5 s 40.2 m 7.8 m 48.0 m s35932 1.2 m 9.3 m 10.4 m 4.0 m 19.0 m 22.9 m 4 s 8.9 m 17.9 m26.9 m 7 s 34.8 m 23.8 m 1.0 hr s38417 1.0 m 8.6 m 9.5 m 6.4 m 15.8 m 22.2 m 5 s 13.4 m17.8 m31.3 m 10 s 62.2 m 21.1 m 1.4 hr s38584 1.0 m 9.4 m 10.4 m 7.2 m 18.9 m 26.1 m 4 s 12.3 m20.0 m32.3 m 19 s 20.1 m 18.1 m 0.6 hr Average 0.8 m 4.0 m 4.9 m 4.5 m 7.9 m 12.5 m 4 s 5.0 m 8.2 m 13.2 m 6 s 19.3 m 8.8 m 28.2 m CEP AES 14.0 m 3.0 hr 3.3 hr 1.8 hr 4.8 hr 6.6 hr 8 s 4.4 hr 5.4 hr 9.8 hr 11 m 6.5 hr 5.8 hr 12.4 hr DES3 0.8 m 4.3 m 5.1 m 5.1 m 8.1 m 13.2 m 4 s 5.3 m 8.6 m 13.9 m 4 s 46.0 m 7.7 m 53.8 m SHA256 0.9 m 10.2 m11.1 m 9.5 m 17.8 m 27.3 m 7 s 16.6 m18.4 m35.1 m 14 s 25.5 m 19.9 m 45.6 m MD5 2.0 m 10.0 m12.0 m 11.0 m16.7 m 27.7 m 8 s 10.2 m25.1 m35.4 m 17 s 51.7 m 0.4 hr 1.2 hr Average 4.4 m 51.9 m 0.9 hr 32.8 m83.0 m115.8 m 7 s 1.2 hr 1.6 hr 2.8 hr 3 m 128.4 m 1.6 hr 3.8 hr CPU Plasma 3.7 m 10.3 m13.9 m 8.9 m 20.4 m 29.3 m 16 s 6.3 m 21.8 m28.3 m 25 s 40.6 m 20.1 m 61.2 m RISCV 4.9 m 16.4 m21.3 m 19.6 m36.0 m 55.7 m 14 s57.3 m35.6 m 1.6 hr 34 s 65.3 m 35.7 m 1.7 hr ArmM0 5.1 m 13.5 m18.6 m 11.7 m27.2 m 38.9 m 27 s26.7 m52.9 m 1.3 hr 37 s 87.8 m 0.7 hr 2.1 hr Average 4.5 m 13.4 m17.9 m 13.4 m27.9 m 41.3 m 19 s30.1 m36.7 m 1.1 hr 32 s 64.6 m 32.1 m 1.6 hr Average 2.2 m 16.2 m18.5 m 12.3 m27.9 m 40.2 m 7 s 24.6 m31.9 m56.6 m 49 s 51.1 m 32.7 m 1.4 hr The run-time comparison of all designs is reported in Table 4.12 and Table 4.13. The design ow for GR 3-phase designs including min-cut solver, retiming and clock-gating, and place-and-route. According to Table 4.12, GR 3-phase post de- signs require an average of 4.6X, 2.1X, and 1.5X compared to FF, master-slave, and Orig 3-phase post latch-based designs. In particular, the min-cut solver consumes at most 11.3 minutes in GR 3-phase post designs among all the benchmarks and 98 Table 4.13: Run-time comparison for FF-, master-slave, the original 3-phase pre latch, and GR 3-phase pre latch-based designs Design FF M-S Orig 3-phase Pre GR 3-phase Pre Syn PnR Total Syn PnR Total ILP Syn PnR Total ILP Syn PnR Total ISCAS s1196 0.4 m 1.7 m 2.1 m 3.8 m 2.8 m 6.6 m 4 s 12.9 m 6.9 m 19.8 m 3 s 17.2 m 3.4 m 20.7 m s1238 0.4 m 1.5 m 1.9 m 3.8 m 2.9 m 6.7 m 4 s 4.2 m 3.2 m 7.4 m 3 s 12.7 m 4.2 m 16.9 m s1423 0.4 m 1.7 m 2.1 m 3.9 m 3.6 m 7.5 m 4 s 5.7 m 8.2 m 14.0 m 2 s 30.4 m 4.4 m 34.8 m s1488 0.3 m 1.5 m 1.8 m 3.9 m 2.8 m 6.8 m 4 s 4.3 m 7.2 m 11.5 m 2 s 10.8 m 3.9 m 14.7 m s5378 2.9 m 2.1 m 4.9 m 4.0 m 3.7 m 7.7 m 4 s 8.5 m 5.3 m 13.8 m 3 s 19.4 m 5.6 m 25.1 m s9234 0.4 m 2.2 m 2.6 m 3.8 m 3.9 m 7.7 m 4 s 2.8 m 4.0 m 6.8 m 2 s 18.8 m 5.0 m 23.8 m s13207 0.5 m 3.2 m 3.7 m 4.3 m 5.9 m 10.2 m 5 s 8.8 m 8.7 m 17.6 m 4 s 7.8 m 6.6 m 14.4 m s15850 0.5 m 3.5 m 4.0 m 4.8 m 8.2 m 13.0 m 6 s 25.6 m 11.6 m 37.3 m 5 s 10.5 m 10.9 m 21.5 m s35932 1.2 m 9.3 m 10.4 m 4.0 m 19.0 m 22.9 m 5 s 34.3 m 29.5 m 63.9 m 9 s 74.7 m 20.2 m 1.6 hr s38417 1.0 m 8.6 m 9.5 m 6.4 m 15.8 m 22.2 m 8 s 170.3 m 42.1 m 212.5 m 11 s 33.4 m 26.9 m 1.0 hr s38584 1.0 m 9.4 m 10.4 m 7.2 m 18.9 m 26.1 m 6 s 57.1 m 24.5 m 81.7 m 14 s 31.3 m 29.8 m 1.0 hr Average 0.8 m 4.0 m 4.9 m 4.5 m 7.9 m 12.5 m 5 s 30.4 m 13.7 m 44.2 m 5 s 24.3 m 11.0 m 35.3 m CEP AES 14.0 m 3.0 hr 3.3 hr 1.8 hr 4.8 hr 6.6 hr 13 s 2.9 hr 5.4 hr 8.3 hr 18 s 6.3 hr 8.1 hr 14.3 hr DES3 0.8 m 4.3 m 5.1 m 5.1 m 8.1 m 13.2 m 6 s 7.7 m 10.5 m 18.3 m 5 s 33.8 m 8.0 m 41.9 m SHA256 0.9 m 10.2 m11.1 m 9.5 m 17.8 m 27.3 m 9 s 17.4 m 17.2 m 34.7 m 10 s 19.6 m 15.6 m 35.4 m MD5 2.0 m 10.0 m12.0 m 11.0 m16.7 m 27.7 m 12 s 16.0 m 111.2 m127.4 m 11 s 45.0 m 0.3 hr 1.1 hr Average 4.4 m 51.9 m 0.9 hr 32.8 m83.0 m115.8 m 10 s 0.9 hr 1.9 hr 2.8 hr 11 s118.7 m 2.2 hr 4.2 hr CPU Plasma 3.7 m 10.3 m13.9 m 8.9 m 20.4 m 29.3 m 35 s 11.5 m 18.6 m 30.7 m 22 s 92.0 m 21.4 m 1.9 hr RISCV 4.9 m 16.4 m21.3 m 19.6 m36.0 m 55.7 m 22 s 38.5 m 34.4 m 1.2 hr 28 s120.4 m34.0 m 2.6 hr ArmM0 5.1 m 13.5 m18.6 m 11.7 m27.2 m 38.9 m 62 s 51.5 m 76.1 m 2.1 hr 35 s 46.5 m 0.6 hr 1.4 hr Average 4.5 m 13.4 m17.9 m 13.4 m27.9 m 41.3 m 40 s 33.8 m 43.0 m 1.3 hr 28 s 86.3 m 30.2 m 2.0 hr Average 2.2 m 16.2 m18.5 m 12.3 m27.9 m 40.2 m 12 s 36.2 m 41.3 m 77.6 m 10 s 55.6 m 41.0 m 1.6 hr is generally a tiny fraction (1%) of the overall run-time. Table 4.13 reports run- times for orignal 3-phase pre and GR 3-phase pre designs. GR 3-phase pre designs require an average of 5.2X, 2.4X, and 1.2X compared to FF, master-slave, and Orig 3-phase pre latch-based designs. Note that an average of 32% of the overall runtime is consumed by loading min-cut results to the graph, which can be e- ciently implemented with a more integrated coding approach. Nevertheless, these run-time increases are quite manageable, suggesting that our proposed approach is computationally practical for at least moderately-sized blocks. 99 4.6 Summary and Conclusions Chapter 3 shows that automatic conversion of FF-based designs to three-phase latch based designs can yield signicantly lower power with no degradation in performance. This chapter presented a new graph-based retiming algorithm with the consideration of clock-gating aspects. The new retiming algorithm enhances the design ow and further increases these gains. The post PnR results on a broad range of benchmarks show that signicant additional power savings can be achieved, yielding overall average power savings of over 20% where 13% comes from the savings in sequential logic and 7% from combinational logic. 100 Chapter 5 Summary This chapter concludes this dissertation. We summarize our work and propose some interesting future work. 5.1 Summary As VLSI design emerged, two devices the edge-triggered ip- ops (FFs) or level- sensitive latches were identied as viable means of synchronization and state stor- age. Compared to FFs, latches have the advantages of time borrowing, skew and jitter tolerance, smaller cell area, and lower capacitance. Latch-based designs can thus consume lower power and area than FF-based designs, particularly when pro- cess variation is considered. However, the VLSI community gravitated to using FFs because they more easily support the synchronous paradigm captured in most RTL specications. This thesis shows how these ows can be easily extended to yield latch-based designs that signicantly reduce power consumption with no loss in performance and area consumption. Moreover, the proposed ow for 3-phase latch-based de- signs is based on commercial synthesis and physical design tools supported by a 101 few custom optimization functions, making the adoption of the proposed approach much easier. We performed extensive experiments to evaluate the approach, in- cluding synthesis and place-and-route in a modern technology node. Our resulting latch-based designs save an average of 21.0% and 21.4% compared to more tra- ditional FF based alternatives across a board range of benchmarks that include three CPU designs. 5.2 Conclusions and Possible Next Steps This thesis presents algorithms to automatically convert a FF-based design into a 3-phase latch-based design that uses an ILP to minimize the number of required latches. To further enhance these gains, optimizations in latch insertion, clock duty cycle selection, retiming, and clock-gating strategies are incorporated to this new ow. The back-annotated results on a broad range of benchmark circuits show signicant savings are possible in both area and power, particularly for pipelined circuits such as multi-stage CPUs when compared to both FF and master-slave latch-based designs. Interesting future work includes several potential directions. First, latch-based designs have been demonstrated to have more power and area benets when the process, voltage, and temperature (PVT) variation is considered. One interesting 102 future work can be to quantify the benets of 3-phase latch-based designs asso- ciated with higher tolerance to PVT variations and increased robustness to hold failures. Second, our current graph-based retiming algorithm optimizes the power dis- sipation in sequential elements, but does not consider the area benets in com- binational logic. One area of interesting future work is to enhance the retiming algorithm by modeling the gate delay and rene the retiming graph to consider the impact of time-borrowing. It is also an important area of further exploring the choice of parameter settings we used to re ect the clock tree and clock-gating costs. Resilient designs have the advantage of removing the large margins in syn- chronous designs by operating at average-case path delays for most of the time and slowing down the circuit in the presence of timing errors. We believe there are advantages of this approach when applied to soft-error and timing resilient templates in which the decrease in latches also reduces the overhead of the nec- essary error detection logic. In particular, soft-error resilient templates can be used to built radiation-hardened designs that handle harsh environments like nu- clear power plants and space. The technology libraries for radiation-hardened designs, however, are relatively expensive in terms to performance, power, and/or area. Because converting to 3-phase latch-based designs reduces the number of latches needed, it inherently will reduce the number of radiation-hardened latches 103 required. In the case of linear pipelines, for example, 3-phase designs save 50% of slave latches compared to master-slave latch-based counterparts. The reduced 25% of total radiation-hardened latches save both area and power consumption in the sequential and error detecting logic (EDL). In addition, adopting 3-phase to time-resilient designs not only may save the number of latches, but may also reduce the number of EDLs by smartly choosing between latch phases p 1 and p 3 . In particular, the EDL may be reduced by intelligently assigning critical FF-to- FF paths such that the starting FF is converted to a latch clocked by p 3 and the ending FF is converted to a latch clocked by p 1 . In this case, the edge-to-edge timing is relaxed by one duty cycle and thus satisfying this path may not require error-detecting logic. Finally, scan designs has been the foundation to test densely packaged circuits. For instance, the shift register latch (SRL) is compatible with performing level sensitive scan design (LSSD) testing. One interesting future work could be to enhance the testability of 3-phase latch-based designs by implementing a new usage of LSSD scan structure targeting low overhead and higher performance. 104 Bibliography [1] R. S. Williams, \What's next?[the end of moore's law]," Computing in Science & Engineering, vol. 19, no. 2, pp. 7{13, 2017. [2] T. Singh, S. Rangarajan, D. John, R. Schreiber, S. Oliver, R. Seahra, and A. Schaefer, \2.1 zen 2: The amd 7nm energy-ecient high-performance x86-64 microprocessor core," in 2020 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2020, pp. 42{44. [3] R. A. Haring, R. Bellofatto, A. A. Bright, P. G. Crumley, M. B. Dombrowa, S. M. Douskey, M. R. Ellavsky, B. Gopalsamy, D. Hoenicke, T. A. Lieb- sch, J. A. Marcella, and M. Ohmacht, \Blue gene/L compute chip: Control, test, and bring-up infrastructure," IBM Journal of Research and Development, vol. 49, no. 2.3, pp. 289{301, March 2005. [4] K. Singh, H. Jiao, J. Huisken, H. Fatemi, and J. P. De Gyvez, \Low power latch based design with smart retiming," in Quality Electronic Design (ISQED), International Symposium on. IEEE, 2018, pp. 329{334. [5] M. Pons, T. Le, C. Arm, D. S everac, J. Nagel, M. Morgan, and S. Emery, \Sub-threshold latch-based icy ex2 32-bit processor with wide supply range operation," in 2016 46th European Solid-State Device Research Conference (ESSDERC), Sept 2016, pp. 33{36. [6] A. P Hurst and R. K Brayton, \The advantages of latch-based design under process variation," in Proceedings of the IWLS, 2006. [7] M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. Harris, D. Blaauw, and D. Sylvester, \Bubble razor: An architecture-independent approach to timing- error detection and correction," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International. IEEE, 2012, pp. 488{ 490. 105 [8] D. Hand, M. T. Moreira, H.-H. Huang, D. Chen, F. Butzke, Z. Li, M. Gibiluka, M. Breuer, N. L. V. Calazans, and P. A. Beerel, \Blade{a timing violation resilient asynchronous template," in ASYNC. IEEE, 2015, pp. 21{28. [9] J.-F. Lin, \Low-power pulse-triggered ip- op design based on a signal feed- through scheme," IEEE Transaction on Very Large Scale Integration (VLSI) Systems, vol. 22, no. 1, pp. 181{185, 2014. [10] S. Paik, L.-e. Yu, and Y. Shin, \Statistical time borrowing for pulsed-latch circuit designs," in Proceedings of the 2010 Asia and South Pacic Design Automation Conference. IEEE Press, 2010, pp. 675{680. [11] K. Singh, O. A. R. Rosas, H. Jiao, J. Huisken, and J. P. de Gyvez, \Multi- bit pulsed-latch based low power synchronous circuit design," in Circuits and Systems (ISCAS), 2018 IEEE International Symposium on. IEEE, 2018, pp. 1{5. [12] Y. Ding, W. Jin, G. He, and W. He, \Short path padding with multiple-Vtcells for wide-pulsed-latch based circuits at ultra-low voltage," in 2017 IEEE 12th International Conference on ASIC (ASICON), Oct 2017, pp. 985{988. [13] K. Yoshikawa, Y. Hagihara, K. Kanamaru, Y. Nakamura, S. Inui, and T. Yoshimura, \Timing optimization by replacing ip- ops to latches," in Proceedings of the Asia and South Pacic Design Automation Conference. IEEE Press, 2004, pp. 186{191. [14] V. Singhal, S. Malik, and R. K. Brayton, \The case for retiming with ex- plicit reset circuitry," in Proceedings of the 1996 IEEE/ACM international conference on Computer-aided design. IEEE Computer Society, 1997, pp. 618{625. [15] J. Cortadella, A. Kondratyev, L. Lavagno, and C. P. Sotiriou, \Desynchro- nization: Synthesis of asynchronous circuits from synchronous specications," IEEE Trans. on CAD, vol. 25, no. 10, pp. 1904{1921, 2006. [16] A. Branover, R. Kol, and R. Ginosar, \Asynchronous design by conversion: Converting synchronous circuits into asynchronous ones," in Proceedings of the conference on Design, Automation and Test in Europe-Volume 2. IEEE Computer Society, 2004, pp. 870{875. [17] A. Saifhashemi, D. Hand, P. A. Beerel, W. Koven, and H. Wang, \Performance and area optimization of a bundled-data Intel processor through resynthesis," in ASYNC, May 2014, pp. 110{111. 106 [18] Y. Zhang, H. Cheng, D. Chen, H. Fu, S. Agarwal, M. Lin, and P. A. Beerel, \Challenges in building an open-source ow from RTL to bundled-data de- sign," in Asynchronous Circuits and Systems (ASYNC), IEEE International Symposium on, 2018. [19] H. Cheng, H.-L. Wang, M. Zhang, D. Hand, and P. A. Beerel, \Automatic retiming of two-phase latch-based resilient circuits," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2018. [20] K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, \Optimal clocking of synchronous systems," in In ACM International Workshop on Timing Issues in the Specication and Synthesis of Digital Systems, 1990, pp. 1{21. [21] L. Gurobi Optimization, \Gurobi optimizer reference manual," 2018. [Online]. Available: http://www.gurobi.com [22] D. Chinnery and K. Keutzer, Closing the gap between ASIC & custom: tools and techniques for high-performance ASIC design. Springer Science & Busi- ness Media, 2002. [23] J. Kathuria, M. Ayoubkhan, and A. Noor, \A review of clock gating tech- niques," MIT International Journal of Electronics and Communication Engi- neering, vol. 1, no. 2, pp. 106{114, 2011. [24] P. E. Gronowski, W. J. Bowhill, R. P. Preston, M. K. Gowan, and R. L. Allmon, \High-performance microprocessor design," IEEE Journal of Solid- State Circuits, vol. 33, no. 5, pp. 676{686, 1998. [25] D. R. Gonzales, \Micro-risc architecture for the wireless market," IEEE Micro, vol. 19, no. 4, pp. 30{37, 1999. [26] M. Donno, A. Ivaldi, L. Benini, and E. Macii, \Clock-tree power optimization based on rtl clock-gating," in Proceedings of the 40th annual Design Automa- tion Conference. ACM, 2003, pp. 622{627. [27] M. Donno, E. Macii, and L. Mazzoni, \Power-aware clock tree planning," in Proceedings of the 2004 international symposium on Physical design. ACM, 2004, pp. 138{147. [28] A. G. Strollo, E. Napoli, and D. De Caro, \New clock-gating techniques for low-power ip- ops," in ISLPED. ACM, 2000, pp. 114{119. [29] S. Wimer and I. Koren, \The optimal fan-out of clock network for power minimization by adaptive gating," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, no. 10, pp. 1772{1780, 2011. 107 [30] I. Han and Y. Shin, \Simplifying clock gating logic by matching factored forms," IEEE Transactions on very large scale Integration (VLSI) Systems, vol. 22, no. 6, pp. 1338{1349, 2013. [31] D. Gluzer and S. Wimer, \Probability-driven multibit ip- op integration with clock gating," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 3, pp. 1173{1177, 2016. [32] Y. Zhang, H. Cheng, D. Chen, H. Fu, S. Agarwal, M. Lin, and B. Peter, \Challenges in building an open-source ow from RTL to bundled-data de- sign," in 24th IEEE International Symposium on Asynchronous Circuits and Systems, 2018, pp. 26{27. [33] \ISCAS89: International symposium on circuits and systems sequential benchmark. http://www.pld.ttu.ee/~maksim/benchmarks/iscas89/verilog/." [34] \MIT-LL common evaluation platform (CEP)," https://github.com/mit-ll/ CEP, available: 2019. [35] \Plasma CPU," http://opencores.org/project,plasma, available: 2014. [36] \Rocket chip," https://github.com/freechipsproject/rocket-chip, available: 2016. [37] \ARM Cortex M0," https://developer.arm.com/products/processors/ cortex-m/cortex-m0. [38] H. Cheng, L. Xi, Y. Gu, and P. A. Beerel, \Saving power by converting ip- op to 3-phase latch-based designs," in Design Automation & Test in Europe (DATE), 2020, accepted. [39] F. Brglez, D. Bryan, and K. Kozminski, \Combinational proles of sequen- tial benchmark circuits," in IEEE International Symposium on Circuits and Systems, 1989, pp. 1929{1934. [40] C. E. Leiserson and J. B. Saxe, \Retiming synchronous circuitry," Algorith- mica, vol. 6, no. 1-6, pp. 5{35, 1991. [41] N. Maheshwari and S. Sapatnekar, \Ecient retiming of large circuits," IEEE Transactions on very large scale integration (VLSI) systems, vol. 6, no. 1, pp. 74{83, 1998. [42] J. Monteiro, S. Devadas, and A. Ghosh, \Retiming sequential circuits for low power," International journal of high speed electronics and systems, vol. 7, no. 02, pp. 323{340, 1996. 108 [43] C. V. Schimp e, S. Simon, and J. A. Nossek, \Optimal placement of registers in data paths for low power design," in IEEE International Symposium on Circuits and Systems. Circuits and Systems in the Information Age, vol. 3, 1997, pp. 2160{2163. [44] R. K. Ranjan, V. Singhal, F. Somenzi, and R. K. Brayton, \On the opti- mization power of retiming and resynthesis transformations," in IEEE/ACM International Conference on Computer-Aided Design, 1998, pp. 402{407. [45] C. Chu, E. F. Young, D. K. Tong, and S. Dechu, \Retiming with interconnect and gate delay," in International Conference on Computer Aided Design, 2003, pp. 221{226. [46] C. E. Leiserson and J. B. Saxe, \Retiming synchronous circuitry," Digital (Palo Alto, CA US ; Cambridge, MA US). Systems research center, Tech. Rep. D-SRC-13, 1986. [47] S. S. Sapatnekar and R. B. Deokar, \Utilizing the retiming-skew equivalence in a practical algorithm for retiming large circuits," IEEE Trans. on CAD, vol. 15, no. 10, pp. 1237{1248, 1996. [48] N. Shenoy, \Retiming: Theory and practice," Integration, vol. 22, no. 1-2, pp. 1{21, 1997. [49] N. Maheshwari and S. Sapatnekar, \Ecient retiming of large circuits," IEEE Trans. on VLSI, vol. 6, no. 1, pp. 74{83, 1998. [50] D. K. Smith, \Network ows: theory, algorithms, and applications," Journal of the Operational Research Society, vol. 45, no. 11, pp. 1340{1340, 1994. 109
Abstract (if available)
Abstract
The growing use of portable/wireless electronic systems and Internet-of-Things (IoT) applications motivates the desire for small energy-efficient designs. However, with the ending of Moore's law, the market must adapt to a relatively fixed technology base which will make improvements in area and energy-efficiency more challenging. This thesis presents a novel way to reduce power consumption and increase energy efficiency by revisiting a decision made in the early days of Very Large Scale Integration (VLSI). ❧ As VLSI design emerged, two devices the edge-triggered flip-flops (FFs) or level-sensitive latches were identified as viable means of synchronization and state storage. Compared to FFs, latches have the advantages of time borrowing, skew and jitter tolerance, smaller cell area, and lower capacitance. Latch-based designs can thus consume lower power and area than FF-based designs, particularly when process variation is considered. However, the VLSI community gravitated to using FFs because they more easily support the synchronous paradigm captured in most RTL specifications. In fact, over the past four decades, the computer-aided-design industry focused on building sophisticated software tools that compiled these RTL specifications to a mixture of combinational logic and FFs with limited support for latch-based alternatives. This thesis shows how these flows can be easily extended to yield latch-based designs that significantly reduce power consumption with no loss in performance and no increase in area. %no significant increase in area consumption. ❧ The approach taken involves the introduction of a novel 3-phase clocking scheme that requires a remarkably small number of latches as storage elements to achieve the same performance as FF-based equivalents. In fact, to enable the use of this scheme to both new and legacy designs, this thesis develops a novel conversion algorithm that takes any traditional FF-based designs and converts it to a more efficient 3-phase latch-based design. Moreover, our proposed flow for 3-phase latch-based designs is based on commercial synthesis and physical design tools supplemented with a few custom optimization functions, making the adoption of the proposed approach much easier. We performed extensive experiments to evaluate the approach, including synthesis and place-and-route in a modern technology node. Our resulting latch-based designs save an average of 21.0% and 21.4% compared to more traditional FF based alternatives across a board range of benchmarks that include three CPU designs. ❧ More specifically, this thesis addresses the following topics: ❧ Development of an automated conversion flow that converts any synchronous RTL specification with a single clock domain to a 3-phase latch-based design. ❧ Development of optimization strategies for latch insertion, clock duty cycle, retiming, and clock-gating Implementation of placement, clock tree synthesis, and routing physical design flows for latch-based designs using commercial tools. ❧ Evaluation of the benefits of the 3-phase latch-based designs, including its impact on the number of registers, area, and power dissipation.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
Gated Multi-Level Domino: a high-speed, low power asynchronous circuit template
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Theory, implementations and applications of single-track designs
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Optimizing power delivery networks in VLSI platforms
PDF
Power-efficient biomimetic neural circuits
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
Clustering and fanout optimizations of asynchronous circuits
PDF
An asynchronous resilient circuit template and automated design flow
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Automatic test generation system for software
PDF
A joint framework of design, control, and applications of energy generation and energy storage systems
Asset Metadata
Creator
Cheng, Huimei
(author)
Core Title
Automatic conversion from flip-flop to 3-phase latch-based designs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
07/21/2020
Defense Date
06/01/2020
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
clock-gating,latch-based design,logic synthesis,low-power design,multi-phase clocking,OAI-PMH Harvest,optimization,re-timing
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter A. (
committee chair
), Gupta, Sandeep (
committee member
), Nakano, Aiichino (
committee member
)
Creator Email
huimeich@usc.edu,huimeich92@gmail.com
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-335750
Unique identifier
UC11664336
Identifier
etd-ChengHuime-8708.pdf (filename),usctheses-c89-335750 (legacy record id)
Legacy Identifier
etd-ChengHuime-8708.pdf
Dmrecord
335750
Document Type
Dissertation
Rights
Cheng, Huimei
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
clock-gating
latch-based design
logic synthesis
low-power design
multi-phase clocking
optimization
re-timing