Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Average-case performance analysis and optimization of conditional asynchronous circuits
(USC Thesis Other)
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Average-Case Performance Analysis and Optimization of Conditional Asynchronous Circuits by Mehrdad Najibi A dissertation presented to the FACULTY OF THE USC GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Electrical Engineering) December 2013 Doctoral Committee: Professor Peter Beerel, Chair Professor Melvin Breuer Professor Massoud Pedram Professor Leana Golubchik Ming Hsieh Department of Electrical Engineering c Mehrdad Najibi 2013 All Rights Reserved A dedication to my wife, Paniz, my father and mom, Hassan and Mina. ii ACKNOWLEDGEMENTS I would like to express my gratitude to my advisor, Peter Beerel, for his support, patience, and encouragement throughout my graduate studies. It is not often that one finds an advisor and colleague that always finds the time for listening to the little problems and roadblocks that unavoidably crop up in the course of performing research. His technical and editorial advice was essential to the completion of this dissertation and has taught me innumerable lessons and insights on the workings of academic research in general. My appreciations also go to the members of my committee, Melvin Breuer, Mas- soud Pedram, Jeffry Draper, and Leana Golubchik for reading previous drafts of this dissertation and providing many valuable comments that improved the presentation and contents of this dissertation. I would like to sincerely thank Andrew Lines, Jonathan Dama, Georgios Dimou, and Prasad Joshi for their insightful comments and valuable discussions that greatly helped me in declaring the scope and various applications of this thesis. I wish to thank many student colleagues for providing a stimulating and fun envi- ronment in which to learn and grow. I am very grateful to my colleagues and friends HoomanHamidi, KamranSaleh, MostafaSalehi, MohammadMirzaAghatabar, Hadi Goudarzi for their great support thought-out these years. Arash Saifhashemi, my of- fice mate and close friend, who were always available for help deserves special appre- ciation. I learned a lot from Arash and benefited from his great efforts on improving Proteus design flow. iii Lastly and most importantly, I wish to thank my dearest wife, Paniz, my par- ents, Hassan and Mina, and my brothers, Mehran and Mahyar who encouraged me, supported me, loved me. To them I dedicate this thesis. iv TABLE OF CONTENTS ACKNOWLEDGEMENTS : : : : : : : : : : : : : : : : : : : : : : : : : : iii LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : x CHAPTER I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Proteus Design Flow . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . 8 1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 II. Petri Net Preliminaries . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 Petri net Basics . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Unfolded Execution and Timing Simulation . . . . . . . . . . 15 2.3 Segments and Modes . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Super Segments and Elevation . . . . . . . . . . . . . . . . . 20 2.5 Sequences and their associated Marked Graphs . . . . . . . . 22 III. Performance Analysis of Asynchronous Circuits . . . . . . . . 25 3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Mode-Based Performance Bounds. . . . . . . . . . . . . . . . 32 3.3.1 Unconditional Bound (UB) . . . . . . . . . . . . . . 33 3.3.2 Analytical Closed-Form Bound (ACFB) . . . . . . . 33 3.3.3 Linear Programming Mode-Based Bounds . . . . . . 38 3.4 Sequence-Based Bounds . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Critical paths in Alternating Sequences . . . . . . . 42 3.4.2 Formalizing Sequences . . . . . . . . . . . . . . . . 42 3.4.3 Linear Program Bound based on Sequences (LPBS) 43 v 3.4.4 Selective Elimination of Sequences . . . . . . . . . . 45 3.4.5 Complexity analysis . . . . . . . . . . . . . . . . . . 47 3.5 Proof of Correctness . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 53 3.6.1 Benchmark Circuits . . . . . . . . . . . . . . . . . . 53 3.6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . 54 3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 IV. Mode-Based Conditional Slack Matching . . . . . . . . . . . . 61 4.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.3 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . 66 4.4 Suggested MILP . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.5 MILP Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.6 Mode Propagation Algorithm . . . . . . . . . . . . . . . . . . 74 4.7 Guaranteed Average Cycle-Time . . . . . . . . . . . . . . . . 77 4.8 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 77 4.8.1 Analysis of Results . . . . . . . . . . . . . . . . . . 78 4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 V. Integrated Fanout Optimization and Slack Matching . . . . 81 5.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 82 5.2 MILP for Integrated Fanout Optimization and Slack Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.3 Relaxation Algorithm . . . . . . . . . . . . . . . . . . . . . . 87 5.4 Buffer Sharing and Tree Implementation . . . . . . . . . . . . 91 5.4.1 Modifications to the LP to Improve Fanout Sharing 92 5.4.2 Implementing the Fanout Trees - The Greedy Algo- rithm . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 97 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 VI. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . 100 6.1 Performance Driven Clustering of Asynchronous Circuits . . . 101 6.2 Automatic mode assignment and mode clustering . . . . . . . 103 6.3 Reconditioning with Performance Constraints . . . . . . . . . 104 A. Theorems and Proofs . . . . . . . . . . . . . . . . . . . . . . . . . 106 Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 115 vi LIST OF FIGURES Figure 1.1 Asynchronous ALU with conditional communication. . . . . . . . . 3 1.2 Proteus design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Send and receive cells. . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 A Petri net with conditional behavior. . . . . . . . . . . . . . . . . . 13 2.2 Split-merge FBCN model for ALU. . . . . . . . . . . . . . . . . . . 14 2.3 Unfolded execution of the Petri net of Figure 2.1. . . . . . . . . . . 15 2.4 Example of cuts and segments. . . . . . . . . . . . . . . . . . . . . . 19 2.5 Unfolded execution of the Petri net of the ALU. . . . . . . . . . . . 19 2.6 Marked graphs for the segment types illustrated in Figure 2.4. . . . 20 2.7 A modified unfolded execution with s 1 elevated to s 12 . . . . . . . . . 22 2.8 Example of sequences (a) hMULT;ADDi, (b) hMULT;MULTi. . . 23 2.9 Associated marked graphs to sequences (a) hMULT;ADDi, and (b) hMULT;MULTi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1 Asynchronous ALU with conditional communication. . . . . . . . . 25 3.2 Impact of pairwise switching probabilities between modes on mea- sured average cycle time in ALU. . . . . . . . . . . . . . . . . . . . 26 3.3 Problem definition: (a) Sample graph with associated modes, (b) Markov chain model, and (c) modes of operations. . . . . . . . . . . 30 3.4 Linear program to calculate cycle-time in a marked graph. . . . . . 33 3.5 Cycle extraction for a given mode sequence. . . . . . . . . . . . . . 36 3.6 Linear program for LPB. . . . . . . . . . . . . . . . . . . . . . . . . 39 3.7 Linear program for the LPBG. . . . . . . . . . . . . . . . . . . . . . 39 3.8 Inaccuracy of LPBG for high switching probabilities in ALU. . . . . 41 3.9 Linear program for the LPBS. . . . . . . . . . . . . . . . . . . . . . 45 3.10 Potential critical path example. . . . . . . . . . . . . . . . . . . . . 51 3.11 Management chain benchmark. . . . . . . . . . . . . . . . . . . . . 54 3.12 Impact of arrival time gap in MgmtChain Example. . . . . . . . . . 57 3.13 Improvement of the sequence based bounds for Asynchronous ALU. 58 3.14 Sequence-based bound accuracy. . . . . . . . . . . . . . . . . . . . . 59 4.1 Asynchronous ALU: (a) unconditional (b) conditional communication. 62 4.2 Split-merge FBCN model for ALU. . . . . . . . . . . . . . . . . . . 67 4.3 MILP for conditional slack matching. . . . . . . . . . . . . . . . . . 69 vii 4.4 Mode propagation example. . . . . . . . . . . . . . . . . . . . . . . 76 4.5 Results for MgmtChain p MN = 4p NM . . . . . . . . . . . . . . . . 79 5.1 Implementation of fanout trees of X 2 and X 3 . . . . . . . . . . . . . 85 5.2 MILP to solve fanout optimization and slack matching (FOSM). . . 87 5.3 Interdependency of branch assignment to fanout tree levels. . . . . . 89 5.4 Different implementations of a fanout tree and its impact on buffer sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.5 Example: Implementation of a tree associated with hierarchical clus- tering f1;3;ff4;f5;6;7g;2g;8g. . . . . . . . . . . . . . . . . . . . . 95 5.6 Integrated Fanout & Slack Matching Results. . . . . . . . . . . . . . 98 viii NOMENCULTURE ACFB Analytical Closed-Form Bound FO Fanout Optimization FOSM Fanout Optimization and Slack Matching LP Linear Program LPB Linear Programing based Bound LPBG Linear Programing based Bound with Gap LPSB Linear Programing Sequence based Bound MILP Mixed Integer Linear Program PCHB Pre-Charged Half Buffer QDI Quasi Delay Insensitive SM Slack Matching SVC SystemVerilogCSP UB Unconditional Bound ix ABSTRACT Average-Case Performance Analysis and Optimization of Conditional Asynchronous Circuits by Mehrdad Najibi Chair: Peter A. Beerel A SYNCRONOUS circuits continue to gain interest as an attractive alternative to synchronous design for both low-power and high-performance applications [YVR13, BZL + 12, JE12, MNW03, RVFG05]. In both applications, however, provid- ing accurate performance bounds is important to guide performance-aware optimiza- tions as well as provide a guaranteed performance to the user. While in synchronous technologies, the performance is largely dictated by the clock frequency, which is usu- ally invariant and independent of data values or circuit operational modes, the cycle time of a conditional asynchronous circuit depends on the values of input stimuli and can get very complex. In the first part of this thesis, we present accurate average-case performance bounds for conditional asynchronous circuits which demonstrate mode-based behav- ior. Analyzing the performance of these circuits is challenging as the critical paths cannot be identified without knowing the exact sequence of modes of operations, usually unknown at design time. Markov chain processes are used to model mode x switchingandPetrinetsareusedastheperformancemodel. Weadoptaperformance analysisschemebasedondecomposingthebehaviorofthePetrinetintomarkedgraph components to reason about performance. The bounds are derived using analytical and linear programing approaches based on the switching probability matrix of the Markov chain. In the second part of this thesis, we demonstrate the application of the proposed average-case performance bounds in performance-aware optimization of conditional asynchronous circuits. In particular, we propose an exact slack matching algorithm for conditional asynchronous circuits operating in distinct modes of operations with potentially different performance requirements; given the probability of transition- ing between modes of operation and desired cycle times for each mode, a minimum number of slack-matching buffers is inserted into the circuit and an upper bound on the overall average cycle time is derived. The problem is formulated as a Mixed Integer Linear Program and solved through relaxation. Experimental results on a benchmark of circuits indicates that our algorithm can save a remarkable number of slack-matching buffers. Moreover, as a natural continuation of this work, we develop a solution to the integrated conditional slack matching/fanout optimization problem which is the prob- lem of jointly building fanout trees and slack matching the circuit by inserting a minimum number of asynchronous pipelined buffers to achieve an average-case per- formance target such the fanout count of all the nodes of the circuit is less than specified limits. Finally, we propose several further extensions of this research, revisiting several problems that previously have been conservatively solved in the absence of accu- rate average-case performance bounds. This includes reconditioning for power reduc- tion under an average-case performance constraint and clustering of asynchronous pipelines under an average-case performance constraint. xi CHAPTER I Introduction Asynchronous design methodology is a paradigm shift away from synchronous de- sign in which synchronization between components of the system is achieved through local handshaking protocols instead of a global clock. The increasing uncertainty in the design due to variability of CMOS technology and the growing complex- ity of System-On-Chips have led to more challenges for the global synchronization paradigm. Asynchronous methodologies, on the other hand, provide viable solu- tions to some of these challenges for a broad range of application from high perfor- mance, ultra low power, low noise systems to global synchronization schemes such as GALS[YVR13, BZL + 12, JE12]. Asynchronouscircuitscanachievehigherperformanceandlowerlatencyincertain applications compared to their synchronous counterparts [MLM + 97, BDL11, JE12]. This advantage comes partly from efficient fine-grain pipelining [Lin98] and pipeline optimization techniques which balance asynchronous pipelines to prevent stalls and starvation. Asynchronous circuits can also be designed to consume lower power by taking ad- vantage of conditional communication [SB12, vBBK + 94, BZL + 12]. In this paradigm, data tokens are conditionally sent to the subset of system components that perform useful computation, leaving the rest of the system idle. An asynchronous circuit can 1 then be modeled as a system that transitions among different modes of operations with associated pairwise probabilities for switching between modes. In each mode only a subset of channelsare activeand the performance requirements of these modes can indeed be different. This fact justifies the need for conditional optimization techniques such as the mode-based conditional slack-matching problem addressed in Chapter IV. As an example, consider the abstract block diagram of an asynchronous ALU depicted in Figure 1.1 in which individual blocks (often acting as pipeline stages) communicate by sending and receiving tokens along asynchronous channels. Note that these channels abstract the underlying handshaking protocols that implement asynchronous flow-control and support the pipelining across blocks. This ALU uses a SPLIT module that sends tokens to only the appropriate functional unit and a MERGEresponsibleforreceivingtokensfromthefunctionalunitsinthecorrectorder. This type of design is called a conditional asynchronous circuit as not all channels are always active. For example, the SPLIT module does not communicate with the M in channel if the opcode is addition. Similarly, the MERGE expects to receive a token on M out only for multiplication operations. Note that the multiplier is idle during additionsbecausenotokensaresenttoitsinputs. Thistypeofasynchronousdesignis essentialforpower-efficiencybut complicatesperformance analysis. Inparticular, the average performance of the block depends on the relative performance of the pipeline stages that make up the ADD and MULT units as well as the probability of the ADD and MULT. In fact, the overall performance also depends on the transient behavior when switching between ADD and MULT operations [NB12] and thus we assert that the performance also depends on the rate of switching between modes of operations. We propose to model a conditional asynchronous circuit as a system that transi- tions among different modes of operations, with associated pairwise probabilities for switching between modes. Each mode is defined as a subset of channels which are 2 Figure 1.1: Asynchronous ALU with conditional communication. active at the same time under a certain operational condition. Since the probabilities and the structure of the modes of the circuit can be quite different, the performance requirements of these modes can be set differently by the user to tradeoff perfor- mance, power, and area. This motivates the need for optimization techniques that understand conditionality and optimize the average-case behavior of such systems. For example in the ALU example of Figure 1.1, all the channels which are activated together during a multiplication operation is considered as the MULT mode. Simi- larly, the ADD mode is composed of all the channels which are activated together to processes and addition. Switching between these modes happens as the value of the opcode changes from ADD to MULT or vice versa. The main problem this thesis addresses is how to optimize such conditional asyn- chronous circuits under a guaranteed average cycle time constraint for the circuit. We formulate various forms of such optimization problems given individual target cycle time for each mode of operation and pairwise probabilities of switching between different modes of operations. The key to formulate and solve these problems is to provide accurate performance bounds to guide performance-aware optimizations. Providing performance bounds for conditional asynchronous circuit is changeling as unlike unconditional circuits, the exact sequence of mode switching has a direct impact on the average long time cycle time of the circuit. In general, the average 3 performanceofaconditionalasynchronouscircuitscanbeanalyzedbasedonanotion of a globally critical path that depends on the exact sequence of mode switching. To derive average case performance bounds, we propose to specify the mode switching using a Markov chain that captures the switching probabilities between all modes. We then derive upper bounds on average case delay that hold for any mode sequence which complies with the specified Markov chain probabilities. The bounds we are interested in are those which are completely mode sequence agnostic. A mode sequence agnostic performance bound holds for any possible mode sequence which complies with the given Markov chain model without assuming any specific mode switching order. Incorporating prior knowledge of mode sequence fragments is supported in our formulation by clustering a sequence of modes as a sequence to obtain a better bound for circuits. Our performance modeling and bounding approach is applicable to a wide vari- ety of asynchronous full- and half-buffer channel models, typically used to abstract asynchronous pipeline stages [BLDK06]. In particular, we believe the bound applies to QDI [Lin98], GasP [SF01], and mousetraps [SN07] pipeline templates 1 . As a direct application of the proposed bounds, we formulate the slack-matching problem of conditional asynchronous circuits using a Mixed Integer Linear Program (MILP)to guaranteeanoveralllong-termaveragecycletimeforthecircuit. Inpartic- ular, Slack Matching is a unique problem of asynchronous design generally defined as adding a minimum number of pipeline buffers to an asynchronous circuit to achieve a given performance metric [GS90, BLDK06]. Based on the given pairwise probabilities of switching between different modes of operations we find the minimum number of slack-matching buffers to be inserted into the circuit to hit the required performance. In one variation of this problem, 1 Some asynchronous circuits can calculate their outputs eagerly as soon as they receive enough input values to determine the value of the output. Modeling eager evaluation which is closely related to OR causality is not addressed in this work and is left as future work. 4 individualtargetcycletimeforeachmodeofoperationcanalsobegiven. Conditional slack-matching can reduce the number of required slack-matching buffers. As an instance, in the conditional ALU in Figure 1.1, given that the multiplier may be activated less frequently than the adder, it may make sense to be slack-matched to a lower target cycle time to reduce the number of required slack-matching buffers. In addition, conditional slack matching decreases the number of slack-matching buffers in the adder since there is no need to slack-match the adder’s shorter latency paths to match the longer latency paths of the multiplier. The overall expected result of conditional slack matching is lower circuit area and power. To the extent of our knowledge, this thesis presents the first exact slack-matching solution for conditional asynchronouscircuitsthatprovidesaguaranteedaverage-caseperformanceboundfor any possible sequence of mode switchings. Asanotherinnovation,weformulateandsolveageneralizationoftheslackmatch- ing problem which concurrently builds fanout trees during slack matching in an at- tempt to reduce the overheads associated to pipeline optimization of asynchronous circuits. The fanout optimization in asynchronous circuits, is the problem of building pipelined fanout trees to distribute each high-fanout channel among its fanouts such that the fanout count of each node in the circuit is below a certain limit while the target cycle time constraint is also met. We postulate that the two problems are highly correlated such that the result of fanout optimization can highly impact slack matching and therefore the two problems have to be solved jointly. The joint MILP formulation of the two problems is developed and turned out to be intractable even for moderate size circuits. To solve the problem we design a relaxation algorithm which can find high quality solution for large circuits in a matter of minutes 2 . The next section introduces, Proteus design flow which is used as a framework to 2 While the general formulation for the conditional integrated fanout optimization and slack matching is provided in this thesis, our current implementation only handles unconditional circuits and support for conditionality is left as a future work. 5 implement and evaluate the proposed schemes for performance bound derivation and conditional slack-matching. 1.1 Proteus Design Flow The performance bounds calculation and performance-aware optimization algo- rithms developed in this thesis are implemented in the Proteus ASIC design flow [BDL11], shown in Figure 1.2. The design can be specified in standard synthesizable RTL or SystemVerilogCSP (SVC) [SB11], the standard SystemVerilog language en- hanced with predefined channel interfaces that support abstract asynchronous com- munication. In particular, the tool SVC2RTL compiles the SVC specification into islands of synthesizable RTL blocks surrounded by SEND and RECV cells that im- plement the conditional communication specified in SVC. It is also possible to start the design by providing a synthesizable RTL specification and skip SVC2RTL. The latter is called the legacy RTL flow. A key feature of the flow is that it relies on mature synchronous synthesis and place-and-routetools(shownindarkgrayinthefigure)andonlyusesafewasynchronous- specific CAD tools for asynchronous specific optimizations. The RTL specification, generated by SVC2RTL or provided by the designer, is synthesized using Cadence RTL Compiler using a library of image single-rail gates to generate what we call an image netlist. This netlist is inputted to the tool ClockFree that optimizes the logic, clusters logic cells into pipeline stages, optimizes the pipeline structure for per- formance (aka slack-matching [BLDK06]), and converts the resulting netlist into an asynchronous netlist based on one of several asynchronous design styles. Our pro- posed optimization algorithms and performance bound analysis are implemented as a part of ClockFree. Thedesignflowiscommercializedusingahigh-performancequasi-delay-insensitive design style based on a 65nm domino-logic based library but alternatives have also 6 Figure 1.2: Proteus design flow. (RECV) R (SEND) S Data Cond Out Data Cond Out Figure 1.3: Send and receive cells. been explored [GB11, DBL11, CLH + 12]. Conditional communication in this flow is implemented by Proteus conditional cells called SEND and RECV, shown in Figure 1.3. Conditional cells may be automatically inserted into the netlist by Proteus for poweroptimizationusingoperandisolation[SB12]. TheSENDcellreceivesanytoken onitsinputs,Data, andCond channelsunconditionally. This meansthatthemodule will wait until it receives a token on both channels before it can proceed to the next step. Following receiving both inputs, if the value of Cond is a logical one, it sends the token received on Data on its output. Otherwise, the received token is discarded and the module wait to receive both inputs to initiates its next cycle. The output of 7 the SEND, channel Out, is thus a conditional channel. TheRECVcell, ontheotherhand, hasaconditionalinputchannel. Inparticular, the RECV cell unconditionally waits for a token on channel Cond but it only wait for a token on channel Data, to be send to the output, if the value of Cond is a logical one. If Cond is zero, the module proceeds by sending a zero token on the output whether or not there exists a token on channel Data. Any token on channel Datawillremainsittingonthechannelwaitingtobepropagatedtotheoutputwhich happens only after a logical one token arrives on Cond. However, the RTL legacy flowreferredtointhisworkdoesnotincludethisoptimizationandthusthegenerated Proteus circuits do not initially contain any conditional cells. 1.2 Summary of Contributions Asthemainresultofthisthesis,weconveyfiveperformanceboundsforconditional asynchronous circuits and several useful intuitions. 1. Unconditional Bound (UB): A trivial, not so precise bound for the cycle time of any conditional asynchronous circuit is the cycle time of its unconditional marked graph model. We prove this simplistic performance bound for condi- tional mode based circuits to show that one, for example, can slack match a conditional circuit conservatively by slack-matching its unconditional marked graph model; the approach which is undertaken to commercialize asynchronous technology in absence of any better performance bound prior this work. 2. Analytical Closed-Form Bound (ACFB): Our second bound is an analytical, closed form, slightly better performance bound for conditional circuit with the minimum knowledge of the target cycle time of each mode of operation and the overall frequency of such modes. This is the first closed form bound for asynchronous conditional circuits which formulate a performance bound based 8 ondifferenttargetcycletimesfordifferentmodesofoperation. Theperformance boundisbeneficialincaseswhereslowmodesaredramaticallylessfrequentthan faster modes otherwise the cycle time of the slower mode dominates and this boundquicklyapproachestheUBinitsvalue. Despitethisdrawbackweinclude the bound in the thesis for its theoretical value as it led to numerous intuitions regarding behavior of mode-based conditional asynchronous circuits. Derivation of the performance bound, presented in [NB12], is based on de- composing the circuit behavior into its marked graph components and then bounding the delay of the global critical path using properties of the marked graph associated with the super-position of the modes of operation. 3. Linear Programing Bounds: Thirdly, we develop accurate mode based perfor- mance bounds (LPB, LPBG) for conditional asynchronous circuits using linear programing. Markov chain processes are used to model mode switching and Petri nets are used as the performance model. As we will show, the correctness of the proposed performance bounds is the direct implication of the existence of static arrival time values, associated with the transitions of the Petri net, derived in the proposed LP formulations. Existence of static arrival time values enables us to break down any possible globally critical path into path-segments enclosed within consecutive cutlines without actually identifying the globally critical path 3 and then bound the delay of each path-segment to find the upper bound on any possible globally critical path for any given sequence of modes complying to the given probabilities. 4. The fifth performance bound is developed to improve precision of the mode based bound at higher switching probabilities. While the mode based bounds 3 Note that given only the pairwise mode switching probabilities, without knowing the exact sequence of mode, the exact unfolded execution of the Petri net and therefore globally critical paths are unknown. 9 are efficient, they can be overly conservative when the probability of mode switching is high. This problem is addressed by introducing sequence based performance bounds. In these bounds, instead of analyzing each mode indi- vidually, a short sequence of modes are analyzed to more accurately model the critical paths at high switching probabilities. Sequence based bound enables the tradeoff between accuracy and complexity. While the complexity of these bounds is an exponential function of the sequence length, our results indicates that in high probabilities considerable improvements can be obtained at very short sequence lengths. We also presented a selective elimination heuristic to reduce the number of sequences which has to be analyzed to reduce the com- plexity of the formulation. As another set of contributions, we demonstrate the application of the formu- lated performance bounds in optimization of asynchronous circuits with average-case performance constraints. The problem of conditional slack-matching of asynchronous circuits is defined and solved to guarantee an average cycle time, given the pairwise switching probabilities between modes, without knowing the exact sequence of modes of operations with optimal number of slack matching buffers. The presented results indicates remarkable reduction in the number of slack-matching buffers. The problem of integrated fanout optimization and slack-matching is defined as an MILP. Efficient heuristic relaxation algorithms are developed to solve moderate to large size problems. We also suggest to revisit the problem of clustering under average-case performance constraintsthroughapplicationoftheaveragecaseperformancebounds. Atheoretical foundation to solved this problem is presented in this thesis and a linear program is suggested for the problem, however the implementation is left as future work. 10 1.3 Organization The rest of the document is organized as follows. Chapter II provides the defini- tions used throughout the thesis. As these definitions are only used in formal proofs, this chapter can be skipped by the reader and then referred back to later as needed. Chapters III and IV, each provides a problem statement using well known graph the- ory notations. Chapter III formulates the proposed average performance bounds for conditional asynchronous circuits. The slack matching problem for conditional asyn- chronouscircuitsispresentedinChapter IVwhichdemonstratetheapplicationofthe proposed performance bounds in a practical problem. Chapter V presents the MILP and relaxation algorithm for the problem of integrated fanout optimization and slack matching. And finally our conclusions and future work are discussed in Chapter VI. Note that the theorems and lemmas are presented in Appendix A. 11 CHAPTER II Petri Net Preliminaries ThischapterpresentsdefinitionsrelatedtoPetrinetswhichareusedastheformal model for performance in our work. The basic definitions of Petri nets are borrowed from [Mur89]. We also introduced several new notions such as segments and super- segment to formally capture the notion of circuit’s operational modes and to make it possibletodecomposedconditionalcircuit’sbehaviortoitsunconditionalcomponents for deriving performance bounds in Chapter III. 2.1 Petri net Basics Definition II.1. A Petri net is a tuple hP;T;F;M 0 i where P is the set of places, T is the set of transitions, F P T [T P is a flow function, and M 0 :P 7!N is the initial marking. Forx2P[T we define the preset ofx asx =fyjhy;xi2Fg and the postset of x as x =fyjhx;yi2Fg. A transition t is enabled in marking M if 8p2t;m(p) 1 and may eventually fire by removing a token from each place in t and adding a token to places in t. A Petri net is called k-safe if no more than k token appears in any place for any reachable marking [Mur89]. Petri nets which are k-safe are called bounded and a 1-safe Petri net is called safe. A Petri net is called live if from any reachable marking 12 every transition can eventually fire. A Petri net is called reversible if the initial marking can be reached from any reachable marking. A place is a merge if its preset contains more than one transition. A place is a choice if its postset includes more than one transition. A choice place, p, is a free-choice place if each transition in its postset has one and only one place in its preset. That is, 8t 2 p : j tj = 1. A choice place is called unique choice if 8t 1 ;t 2 2p;t 1 [t 2 6=fpg, andt 1 andt 2 cannot be enabled simultaneously. A Petri net is called free choice if all choice places are free choice. DefinitionII.2. A Petri net is called unique choice if all choice places are either free choice or unique choice. DefinitionII.3. a Petri net is called a marked graph if all places have only one input and one output transition. As an example, consider the Petri net shown in Figure 2.1 which is a free-choice Petri net. Place A is a merge place and the place that follows transition t is a choice place. The presence of either the choice and merge places indicate that this Petri net is not a marked graph. t t a t b t d t c C B A D t e Figure 2.1: A Petri net with conditional behavior. Asynchronous circuits can be modeled using Petri nets for performance analysis. As an example, Figure 2.2 shows the simplified Petri net model for the conditional 13 1-Stage Add t A1 2-Stage Mult. t M2 q Q t M1 r R t M p P t S Figure 2.2: Split-merge FBCN model for ALU. ALU example depicted in Figure 1.1. Note that for simplicity, we do not show timing arcs related to the opcode channel in the ALU. However, in the actual performance model, the opcode and related control channels are active in both Add and Mult modes. This model is an extension to the Full Buffer Channel Net model (FBCN) in which each asynchronous pipeline stage is modeled using a transition and asyn- chronouschannelsbetweenstagesaremodeledwithapairofplaces;aforward(circles) and a backward place (squares) which are labeled with forward and backward laten- cies of the corresponding channel [BLDK06]. Forward latency is the time needed by the pipeline stage in its initially ready state to generate the output after it receives all of its inputs. Backward latency models the time it takes for the pipeline stage to get back to its initially ready state after it generates the output. The sum of forward and backward latencies represents the local cycle time of the channel. When performing timing analysis, it is typical to limit the analysis to unique choice Petri nets as this allows the set of possible executions to be independent of timing [HB95]. In this work, to simplify our analysis we further restrict ourselves to the sub-class of live, safe and reversible unique choice Petri nets for which there exists a reachable marking that marks all simple cycles and for simplicity we will assume that the initial marking satisfies this condition. We believe this subclass of Petri-nets still captures most underlying models for slack-matching asynchronous 14 systems [BLDK06][PM06] excluding rare cases in which a cycle of pipeline stages (someexhibitingconditionalbehavior)existsthatcontainsno token buffer [BLDK06]. 2.2 Unfolded Execution and Timing Simulation Definition II.4. An unfolded execution of a unique choice Petri net is an acyclic marked graph U = h ^ P; ^ T; ^ F; ^ M 0 i in which all the free choices are resolved by the mapping R : P N 7! T, and all source places are initially marked. ^ P = P N and ^ T = T N are the set of unfolded places and transitions and function l : ( ^ P [ ^ T) 7! (P [T) maps unfolded places and transitions back to the original places and transitions. Formally, ^ F is defined as follows: 8 > > > > > > < > > > > > > : hp;ti2F )8 k 0;hp (k) ;t (k) i2 ^ F ht;pi2F ^m(p) = 0 )8 k 0;ht (k) ;p (k) i2 ^ F ht;pi2F ^m(p) = 1 )8 k> 0;ht (k1) ;p (k) i2 ^ F where x (k) 2 ^ T [ ^ P is the event of k th occurrence of x. An unfolded execution for the unique choice Petri net of Figure 2.1 is illustrated in Figure 2.3. Note that the choice is resolved via the sequence ht a ;t e ;t a i. C (0) B (0) A (0) D (0) ) 0 ( e t ) 1 ( t ) 2 ( t ) 1 ( a t ) 1 ( c t ) 1 ( d t ) 1 ( b t ) 0 ( t ) 0 ( a t ) 0 ( c t ) 0 ( d t ) 0 ( b t Figure 2.3: Unfolded execution of the Petri net of Figure 2.1. DefinitionII.5. Inanunfoldedexecutionwedefineapath,, assequencehx 1 ;x 2 ;:::;x m i 15 where x i 2 ^ P [ ^ T and hx i ;x i+1 i 2 ^ F. A path is a simple path if the following two properties hold: 1. 8 ^ t 1 ; ^ t 2 2 ^ T;l( ^ t 1 )6=l( ^ t 2 ) 2. 8 ^ p 1 ; ^ p 2 2 ^ P;l(^ p 1 )6=l(^ p 2 ) A path is a place-simple path if only property (ii) holds. A path, hx 1 ;x 2 ;:::;x m i, is an unfolded cycle if l(x 1 ) =l(x m ) Definition II.6. For a given unfolded execution of a unique choice Petri net with delays assigned to places by d : P 7! R + we define a timing function 0 : ^ T 7! R + such that: 8 > > < > > : 0 ( ^ t i )d(^ p) if h^ p; ^ t i i2 ^ F ^m(^ p) = 1 0 ( ^ t i ) 0 ( ^ t j )+d(^ p) if h ^ t j ;^ pi;h^ p; ^ t i i2 ^ F ^m(^ p) = 0 Definition II.7. A timing simulation is the unique timing function which for all 0 8 ^ t2 ^ T;( ^ t) 0 ( ^ t): Assuming that the source places of the unfolded execution has an initial token available at time 0, the timing simulation is the unique timing function which assigns the lowest possible firing time to each transition t denoted by ( ^ t). Definition II.8. A path is critical if for all edges, h ^ t i ;p; ^ t j i, along the path ( ^ t j ) = ( ^ t i ) +d(p) holds. A critical path is globally critical if it spans the entire unfolded execution. Definition II.9. Given an unfolded execution, U, the time separation of events be- tween two transitions instances is defined as: (k) U (s;t;) =(t (k+) )(s (k) ): 16 Definition II.10. The average cycle time for a transition, t, is defined to be U (t) = lim n!1 (0) (t;t;n) n When not otherwise clear, we will use a subscript U to denote the underlying unfolded execution for which the timing simulation is applied, such as in (k) U (s;t;e) and U (t). 2.3 Segments and Modes TheunfoldedexecutionofaPetrinetisacyclic, therefore ^ F definesapartialorder on the unfolded places and transitions. Definition II.11. In particular, a pair of unfolded places or transitions is ordered, x i x j , if there exists a path from x i tox j in the unfolded execution. We use x i x j to show that either x i and x j are not ordered or x i x j . Definition II.12. A cut is a maximal set of unordered places in the unfolded execu- tion. Every such cut in the unfolded execution is associated with a reachable marking of the Petri net [BD90]. Definition II.13. Two cuts are ordered, c i c j , if 8 p2c i ^p 0 2c j ; pp 0 : We say c i c j if c i c j and c i 6=c j . Definition II.14. For a place or transition, x, and cut c we say x c if (9 p 2 c^xp)_x2c. Similarly, we say cx if (9 p2c^px)_x2c: Using consecutive cuts we decompose the unfolded execution into segments which as we will see enables us to define the notion of modes formally. 17 Definition II.15. A segment is defined as s i =h ~ P i ; ~ T i ; ~ F i i ~ P i = ^ P \X i ~ T i = ^ T \X i ~ F i = ^ F \(X i X i ) X i = f^ x2 ( ^ P [ ^ T)jc i ^ xc i+1 g @ c j : c i c j c i+1 where c i , c i+1 , and c j are cuts with the same marking M 0 . S(M 0 ) is the set of all segments associated with marking M 0 . As the cuts are ordered and clearly denote segments boundaries of an unfolded execution, we often refer to the segment marked by c i and c i+1 as s i . We also use U[i : j] to refer to the segment sequence hs i ;s i+1 ;:::;s j i as a portion of unfolded execution Definition II.16. For each segment sequence, U, we definejUj to denote the number of segments in the sequence. Definition II.17. Let us define the span of each path, jjjj, as the number of cutlines that intersect the path. We use (c i ) to denote the place where cutline c i and path intersect. Figure 2.4 illustrates the notion of segments using an unfolded execution of Petri net of Figure 2.1. Note that the choice is resolved via the sequence ht a ;t e ;t a i. There existfourcutlines: c 0 ;:::;c 3 . Cutlinesc 1 andc 2 sharethreeplacesbutaredistinguished by different instances of place A. Any two consecutive cutlines mark the boundaries of a segment. In this example, there are two classes of segments for the two possible behaviors of the Petri net. In particular, each segment corresponds to a mode of operation in a conditional asynchronouscircuit. Figure2.5showsanunfoldedexecutionoftheALUperformance model depicted in Figure 2.2 for the sequence of modeshADD;MULT;ADDi. Note thattoimprovereadability,thefiguredoesnotexplicitlyshowtheplaceswhicharenot 18 C B A D ) 1 ( t ) 2 ( t ) 2 ( b t ) 2 ( c t ) 2 ( d t M1 M1 M2 c c c c C B A D ) 1 ( t ) 2 ( t ) 0 ( b t ) 0 ( c t ) 0 ( d t ) 0 ( t ) 0 ( t Figure 2.4: Example of cuts and segments. c 2 c 3 Q (0) R (0) P (0) c 0 c 1 ) 0 ( 2 M t ) 0 ( 1 M t 1 Mult ) 0 ( s t ) 0 ( 1 A t ) 0 ( M t ADD 1 ) 1 ( s t R (1) P (1) Q (1) R (2) P (2) ) 1 ( M t ) 2 ( s t ) 2 ( 1 A t ) 2 ( M t ADD 2 R (3) P (3) Figure 2.5: Unfolded execution of the Petri net of the ALU. initiallymarkedinthePetrinet. Alsonotethattheunfoldedexecutionincludesallthe timing paths of the performance model and thereby captures the overall performance of the circuit for the particular mode sequence. Definition II.18. In particular, a marked graph, (s i ) = hP i ;T i ;F i ;M i i, can be associated with each segment h ~ P i ; ~ T i ; ~ F i i, where P i = l( ~ P i c i \c i+1 ), T i = l( ~ T i ), F i =l( ~ F i ), and M i =l(c i ): (s i ) denotes the set of all the simple cycles in (s i ). As an example, consider again the Petri net in Figure 2.1. The marked graph associated with the segment types M 1 and M 2 are shown in Figure 2.6(a) and (b). DefinitionII.19. Two segmentss i ;s j are defined to be equivalent if(s i ) is the same as (s j ). 19 (a) (b) t t a t b t d t c C B A D t A t e Figure 2.6: Marked graphs for the segment types illustrated in Figure 2.4. Definition II.20. Each equivalence class of segments identifies a distinct mode of operation, and we define the function (s i ) to map a segment to its equivalence class, i.e., mode. To simplify our analysis we will make one more restriction on the systems we analyze. Wewillassumethatthetransitionofinteresttmustfireineveryequivalence class. Thisisapracticalrestrictionformode-basedcircuitsasotherwisethecycletime target of different modes may need to be based on different transitions of interest. 2.4 Super Segments and Elevation Definition II.21. We define a super segment as a segment s ij = s i s j where the cross operator, , first takes the union of the two segments and then replaces every introduced choice or merge with a fork or join structure respectively. Formally,thecrossoftwosegmentsisdonebytakingtheunionoftheirtransitions, places, and flow relations and choice/merge places are replaced with fork/join struc- turesbyaddinganadditionaljointransitionaftereachmergeplaceandanadditional fork transition after each choice place. The delay of the places in the super-segment isthesameasthatoftheindividualsegmentsexceptfortheextraplacesthatconnect 20 to the extra fork/join transitions which are assigned a delay of 0. Definition II.22. We say segment s is included in super segment s , s s , iff ss =s . Definition II.23. Any unfolded execution, U[i : j], can be elevated by recursively replacing one of its segments, s k ;ikj, with any arbitrary super segment, s , as long as s k s . We often refer to the elevated unfolded execution as U [i :j]. The cross operator has an important inclusion property for paths. For any path, , between two unfolded places or transition,x i andx j inU[i :j], there exists a path, , between them in U [i :j] for which ; meaning that one can reproduce by dropping some places or transitions from . Let us denote the sum of the delay of all places along a path, , by D(). This property guarantees thatD()D() since the transitions and places added by the cross operator all have non-negative delays. Theelevationofunfoldedexecutionisanimportantpartofouroverallproofstrat- egy. To illustrate it, consider the modified unfolded execution of Figure 2.7 in which segments 1 oftheunfoldedexecutionofFigure2.4iselevatedtos 12 . Considerthepath =ht (0) b ;t (0) d ;D (1) ;t (2) b ibeforeelevation. Inthiscase =ht (0) b ;t (0) d ;D (1) ;t (1) b ;t (1) c ;t (1) d ;D (2) ;t (2) b i from which we can reproduce by dropping ht (1) b ;t (1) c ;t (1) d ;D (2) i. Using elevation we proved that for any arbitrary unfolded execution of a live safe and reversible unique choice Petri net, U, we have U (t) (s m ) where s m is the largest super-segment. This bounds the cycle time of any arbitrary unfolded execution of the Petri net with conditionals without any assumption about the order of the segments (modes) or their frequencies. This result was intuitively accepted by industry and the bound is used for slack matching of asynchronous circuits with conditional behavior by treating them as unconditional. 21 c c C B A D c c ) 1 ( b t ) 1 ( c t ) 1 ( d t F t J t * 12 s s s ) 2 ( t ) 2 ( b t ) 2 ( c t ) 2 ( d t ) 1 ( t C B A D ) 0 ( b t ) 0 ( c t ) 0 ( d t ) 0 ( t ) 0 ( t ) 2 ( t ) 1 ( a t ) 1 ( t Figure 2.7: A modified unfolded execution with s 1 elevated to s 12 . 2.5 Sequences and their associated Marked Graphs DefinitionII.24. Similartosegments, weformallydefinea sequence, i =h ~ P i ; ~ T i ; ~ F i i, as a portion of the unfolded execution of a Petri net which is marked by two cutlines but unlike a segment the cutlines do not need to be consecutive. ~ P i = ^ P \X i ~ T i = ^ T \X i ~ F i = ^ F \(X i X i ) X i = f^ x2 ( ^ P [ ^ T)jc i ^ xc j g where c i and c j are arbitrary cutlines with the same marking M 0 . We define the length of a sequence as the number of segment it includes. DefinitionII.25. Similar to segments, a marked graph, ( i ), can be associated with each sequence i by folding its last cutline back to the first cutline through relabeling each place on the last cutline with the same label as its corresponding place on the first cutline. Each place, p, is then marked by p C tokens where p C is the number of cutlines passing through that place in the sequence ignoring the last cutline. As before we insert a pair of dummy place-transition to the first cutline of the sequence and we 22 c 1 Q (0) c 0 q ) 0 ( 2 M t ) 0 ( 1 M t Mult ) 0 ( s t R (0) P (0) Q (1) R (1) P (1) ) 0 ( M t c 2 ) 1 ( 2 M t ) 1 ( 1 M t Mult ) 1 ( s t Q (2) R (2) P (2) ) 1 ( M t c 1 c 2 Q (0) c 0 q ) 0 ( 2 M t ) 0 ( 1 M t Mult ) 0 ( s t R (0) P (0) Q (1) R (1) P (1) ) 0 ( M t ) 1 ( s t ) 1 ( 1 A t ) 1 ( M t ADD R (2) P (2) (b) (a) Figure 2.8: Example of sequences (a) hMULT;ADDi, (b) hMULT;MULTi. denote the set of dummy transitions by T C . Figure 2.9 shows the marked graphs associated with the sequences shown in Figure 2.8. The arcs highlighted as dashed lines are arcs resulted from folding the last cutline back to the first cutline. In particular, note that placeQ (0) is marked with more than one token because we first renamed the place Q (1) in the original sequence to Q (0) then since besides the last cutline, two other cutlines passes through Q (0) we marked the associated place with two tokens. It turns out that in the ALU example, the associated marked graph to sequence hMULT;ADDi has a lower cycle time compared to the sequence hMULT;MULTi as a result of more concurrent operations. This fact can also be observed in Figure 2.9(a); the two highlighted cycles in the figure include one instance of place q which we assumed to have higher associated delay. These cycles, however, are marked by two tokens which lowers the cycle metric of these cycles and reduces the cycle time 23 Q (0) q ) 0 ( 2 M t ) 0 ( 1 M t ) 0 ( s t R (0) P (0) R (1) P (1) ) 0 ( M t ) 1 ( s t ) 1 ( 1 A t ) 1 ( M t ADD (a) Q (0) q ) 0 ( 2 M t ) 0 ( 1 M t ) 0 ( s t R (0) P (0) Q (1) R (1) P (1) ) 0 ( M t q ) 1 ( 2 M t ) 1 ( 1 M t ) 1 ( s t ) 1 ( M t (b) Figure 2.9: Associated marked graphs to sequences (a) hMULT;ADDi, and (b) hMULT;MULTi. of the marked graph effectively. We will use this observation, to formulate a more accurate performance bound in the next section. Here we will mention a few useful properties for the sequences and their associated marked graph which enables us to applythelinearprogramofFigure 3.4onthesemarkedgraphstocalculatetheircycle times. 24 CHAPTER III Performance Analysis of Asynchronous Circuits The performance model we use for conditional asynchronous circuits is described in Chapter II and is based on a large sub-class of Petri nets (Definition II.1). Conse- quently,ourboundingapproachisapplicabletoawidevarietyofasynchronouscircuit families. In particular, the bound applies to both full-buffer and half-buffer models typically used to abstract asynchronous channels [BLDK06, BLD11]. We will present the application of the derived performance bounds to a set of benchmark circuits implemented with Precharged Half Buffers [Lin98] in Proteus. Providing accurate performance bounds is not a trivial task in conditional asyn- chronous circuits as the cycle time of an asynchronous conditional circuit depends on the values of input stimuli which in turn determine circuit operational modes. There- fore, It is desirable to account for an average case behavior of an asynchronous circuit in response to its environment in order to obtain an accurate performance bound. To appreciate some of the complexities of this problem, let us revisit the condi- Figure 3.1: Asynchronous ALU with conditional communication. 25 !!"# !$"# $""# $%"# $&"# $'"# $("# "# ")&# ")(# ")*# ")!# %# !"#$%&'()*+,-.*%/012% 345676-)-8'%59%:5;*%<=-8(>-?#% Figure 3.2: Impact of pairwise switching probabilities between modes on measured average cycle time in ALU. tional ALU example, depicted in Figure 3.1. As we mentioned before, the average- cycletimedependsnotonlyontheprobabilityofeachmodebutalsoontheprobabil- ity of switching between modes. To demonstrate this, let us assume that in this ALU with the ADD and MULT modes, each mode has a 50% probability of occurrence and that both ADD and MULT modes are designed to run in isolation at a cycle time of 900ps. Figure 3.2 plotstheaverage-cycletimeoftheALUasafunctionoftheprobability ofmodeswitching. Inparticular,onthefarleftofthegraph,theprobabilityofswitch- ingisnear0%andtheALUtendstostayinonemodeforalongtimebeforeswitching to the other mode in which it stays for a similarly long period before switching again. Onthefarrightsideofthegraph, theprobabilityofmodeswitchingisnear100%and the mode sequence generally alternates hADD;MULT;ADD;MULT;:::;i. Interest- ingly, the figure shows that the ALU’s average-cycle time changes non-monotonically as the probability of mode switching rises. However, as the probability of mode switching begins to rise, the average cycle time rises because of the different timing requirements of the two modes and the circuit slows down. As we will see our mode based performance bounds are capable to bound this effect using a notion we refer to as switching penalty gap. However, as the probability of mode switching further rises,theswitchingpenaltygapdropsbackdownto0becausethepipelinesassociated with each mode are used at half the rate. This allows any local pipeline bottlenecks, 26 including those associated with switching penalties, to be hidden at the system level. 1 Our goal is to provide accurate performance bounds which capture this non- monotonic impact of mode switching. To do this, we propose to specify the mode switching using a Markov chain that captures the switching probabilities between all modes. We then derive upper bounds on average case delay that hold for any mode sequence which complies with the specified Markov chain probabilities. Traditional performance models for asynchronous circuits use marked graph (Def- inition II.3) (or similar mathematical notions, e.g. Event Rule System [Bur91]). Extensive theory is available to analyze the performance of marked graph models based on the notion of cycle metric [Das04]. However marked graphs are not capa- ble of modeling performance in systems with conditional behavior (usually referred to as choices) and therefore cannot be utilized to achieve accurate average-case per- formance bounds. On the other hand, performance analysis using Generalized Petri Nets which can model conditionality using choices is computationally difficult due to state explosion problem [XKB99]. It also should be pointed out that the bound we are interested in are those which are completely mode sequence agnostic. The transient cycle time of an asynchronous circuit depends not only on the current mode of operation of the circuit, but also on the past history of the modes of operations. In other words, the exact sequence of circuit modes has a direct impact on the average long time cycle time of the circuit 2 . A mode sequence agnostic performance bound holds for any possible mode sequence and therefore should not assume any specific mode sequence 3 . 1 Interestingly, this reverse impact of increasing switching probabilities on average cycle time has been observed in real chips [JBRS10]. 2 As we will see, one major challenge is that given only the pairwise mode switching probabilities, without knowing the exact sequence of mode, the exact unfolded execution of the Petri net and therefore globally critical paths are unknown. 3 Incorporating foreknown mode sequence fragments can be supported in our formulation by clustering a sequence of segments as a quasi-segment to obtain a better bound for circuits with unique choice behavior 27 To demonstrate how an accurate conditional performance bound can lead to a better trade-offs in circuits design and optimization, consider a network switch ap- plication. While the main responsibility of the switch is to route incoming packets, at the same time certain management operations must be performed adequately to utilize system resources effectively. The circuit can be considered in two distinctive modes of operations; in the normal mode, packets are routed based on the routing algorithm, and in the management mode, accounting and system management tasks are undertaken. Since the functionality of the circuit is quite different in its different modes of operations, performance requirements and circuit structure of each mode are not necessarily similar. For example lower cycle time is targeted for the circuit in its normal mode to achieve higher performance in the common case, while the cycle time constraint of the system can be relaxed in management modes to achieve low- power and better area as management cycles are quite infrequent. One interesting question that the design team need to answer is how frequently management cycles can be executed so that the overall longterm average performance of the circuit is not less than a desired target. While issuing more management cycles can lead to more effective utilization of system resources, it can degrade performance. Resolution of such design paradigms can be achieved using performance bounds we formulate in this chapter. Anothermainapplicationoftheperformanceboundsisoptimizationofconditional circuits, e.g. mode-based slack matching of conditional circuits. If we manage to develop more precise average case performance bounds closer to the circuit’s actual longtermaverageperformance, itmaybepossibletosavesomeslackmatchingbuffers and recover circuit area. These are some of the ideas that motivates formulating precise, easy to compute performanceboundsforconditionalasynchronouscircuits. Theseboundswillserveas theoretical bases for correctness of our proposed conditional slack-matching scheme 28 proposed in Chapter IV. In addition, our bounds can be useful to address other op- timization problems related to conditional asynchronous circuits with performance considerations. Example of such problems includes but is not limited to power opti- mization of conditional asynchronous circuits with performance constraint, clustering and Fanout optimization of conditional asynchronous circuits. Whileformalproofsareprovidedforalltheboundsweformulated,wealsousedan asynchronous version of the ISCAS89 benchmark suite [Naj12] to verify the correct- ness and quality of the bound through comparison to the actual circuit performance. The results are provided in Section 3.6. Thefirstperformanceboundwepresentinthechapteristheunconditionalbound (UB). We show that analyzing the circuit as if all of its channel are unconditionally active, leads to a loose upper bound on the cycle time of the conditional circuit. We then drive an analytical bound on the cycle time (ACFB) which is extremely efficient in calculation but can quickly approach the value of UB as the number of slow modes increases in the sequence of modes. The third family of bounds we develop (LPB and LPBG) are calculated by solving a linear program and use the notion of gap penalties to bound the impact of mode switching. These bounds while being efficient and accurate can be overly conservative at higher probability of switching as they are not able to capture the reverse impact of increasing mode switching. Finally, we present a generalized framework based on the notion of sequences (Definition II.24) that enables a tradeoff between analysis complexity and accuracy which can predict the reverse impact of increasing mode switching probabilities. 3.1 Problem Statement Formulating performance bounds for asynchronous circuits with conditional be- havior can be informally stated as follows. We are given the following: 29 Figure 3.3: Problem definition: (a) Sample graph with associated modes, (b) Markov chain model, and (c) modes of operations. The netlist of an asynchronous circuit as a directed graph G c (V c ;E c ) where v 2V c is an asynchronous node, and c ij =hv i ;v j i2E c is a channel from v i to v j . A set of circuit’s operational modes = fm 1 ;:::;m M g where m k E c is the subset of channels activated in the k th mode. Forward and backward latency values for each channel. Markov chain matrix containing pairwise mode transitional probabilities, P = [p mm ]. p mm is the probability of switching from mode m to mode m. Our goal is to derive accurate and computationally tractable upper bounds on the long term average cycle time of the circuit. The derived performance bounds should hold for any arbitrary sequence of mode transitions complying with the given transitional probabilities. Consider the graph shown in Figure 3.3(a) in which channels belong to mode M 0 and M 1 are shown in blue and purple colors, respectively. Note that some channels may belong to multiple modes of operations simultaneously. Figure 3.3(b) depicts the Markov chain which models the conditional behavior of the sample circuit. The Markov chain has a state associated with each mode of operation and the matrix 30 P = [p 00 ;p 01 ;p 10 ;p 11 ] defines the probability of switching between modes. The two modesofoperationsdefinetwodistinctgraphsassociatedwitheachmodeofoperation as shown in Figure 3.3(c). Since mode switching takes place in accordance to a Markov chain, stationary probabilities of each mode can be calculated based on pairwise transitional probabil- ities P = [p mm ]. In particular, we let m denote the stationary probability of mode m. This is the percentage of circuit cycles that mode m occurs in an adequately long run of the circuit. We assume that the Markov chain model is ergodic to ensure that thestationaryprobabilitiesexistsandcanbecalculatedthroughsolvingthefollowing linear system. 8 > > < > > : m = P m p mm m P m m = 1 The frequency of switching from mode m to mode m can then be calculated as m p mm . 3.2 Related Works Performance analysis of asynchronous circuits has been long studied, but most methods do not apply to such conditional circuits or are computationally intractable. In particular, in [GS90], Greenstreet et al. provided upper and lower bounds on throughput for any unconditional pipeline where the delays are independent random variables. Williams et al. [WHAY87] and Lines [Lin98] derived canopy graph dia- grams for unconditional pipelines based on the number of stages of the pipeline. In [PG97], Pang and Greenstreet derived conservative bounds for the asymptotic per- formance of an asynchronous mesh. Several works addressed the problem of deriving exact bounds on time separation between events (TSE) for unconditional systems. Hulgaard et al. [HBAB93] suggested an algorithm to find the exact TSE bound, 31 Chakraborty et al. [CDY01] proposed a pseudo-polynomial-time conservative algo- rithm, and McGee et al. [MN07] proposed a polynomial algorithm to derive TSE in choice-free concurrent systems. Yahya et al. [YR07] proposed a methodology to evaluate the performance of linear pipelines under delay variability. Smirnov and Taubin [ST09] proposed an analysis method for bottleneck detection for uncondi- tional pipelines based on enumeration of simple cycles, which can be computationally impractical for large circuits. Techniques applicable to conditional circuits are more complex than their uncon- ditional counterparts and their analysis requires more sophisticated approaches. For example, Xie et al. [XB97] developed a Markov analysis based method that can be applied to a wide class of circuits with conditional behavior, but the approach suffers from the well-known state explosion problem, which limits its use to relatively small circuit models. They also proposed an efficient Monte-Carlo approach that analyzes free-choice Petri nets in which free-choice decisions are modeled with probability dis- tribution functions [XKB99]. The approach only provides statistical guarantees of the performance. Gill et. al. [GGS08] proposed an efficient method for estimating the performance of hierarchical asynchronous circuits with conditional behavior. The methodextendsthecanopygraphmethoddevelopedin[WHAY87][Lin98],butitdoes not bound the transient impact of arbitrary switching between modes. 3.3 Mode-Based Performance Bounds y This section formulates four bounds for performance of conditional asynchronous circuits by providing an upper bound for the long term average cycle time. For each bound formulation we present the sketch of the proof as well in an effort to provide intuitive demonstration of the proof process and the formal proofs can be found in the Appendix. y The content of this section is partly published as conference papers [NB12, NB13a] 32 Minimize MG subject to 8ht i ;p;t j i2F M a j =a i +d p M M (p) MG +f p (i) a i 0; 8 t i 2T M ; f p 0 Figure 3.4: Linear program to calculate cycle-time in a marked graph. 3.3.1 Unconditional Bound (UB) TheoremA.1, presentedintheAppendix, statesthatthecycle-timeofthemarked graph associated with the largest super-segment of any Petri net, obtained through sequential application of the operator (Definition II.21) on the segments of the Petri net, is an upper bound on the average cycle-time of the conditional circuit for any possible mode sequence order. Once the marked graph associated to the conditional circuit is obtained one can use the following linear program to calculated the cycle-time efficiently. For any marked graph MG = hP M ;T M ;F M ;M M 0 i, the cycle-time ( MG ), or the cycle-metric, can be obtained through solving the linear program of Figure 3.4 [Mag84]. In this program a i is a static arrival time variable associated with transi- tion t i , M(p) is 1 if place p is initially marked and 0 otherwise, and d p is the delay associated with place p. The constraint (i) states that if t j follows t i through place p, the associated arrival time, a j , has to be larger than a i plus d p when the place is not initially marked. If the place is initially marked, we have to use the value of the firing time of t i in the previous execution cycle of the marked graph and therefore a MG term is subtracted from the right-hand side. 3.3.2 Analytical Closed-Form Bound (ACFB) The ACFB is an analytical bound considering only the frequency of the modes and cycle-times of the marked graph component (Definition II.18) associated with 33 each mode of operation. The importance of ACFB resides in its simplicity and ease of calculate using already known parameters of the circuit. Frequency of modes are assumed to be provided by the design team through either simulation trace analysis or environment known behaviors. For example is some systems the environment is allowed to issue slow modes with certain maximum frequencies. The cycle time of the circuit mode are either calculated post synthesis or target cycle times enforced in during circuit synthesis, e.g. target cycle times in slack matching. One last parameter used in calculation of this bound is the maximum number of tokens in place simple cycles of the largest super-segment of the circuit (). While finding the exact value for this circuit parameter is NP-Complete, a trivial upper bound exists which is the total number of nodes in the circuit. The ACFB can be used directly to slack match unconditional components of the circuit independently to achieve a guarantee average performance. To do this we need to calculate the values of target cycle times for each component using ACFB formulation given the value of the desired average cycle time. However, the quality of the bound dramatically decreases as the number of nodes in the circuit increases or as the frequency of slow modes approaches the frequency of fast modes divided by number of circuit nodes. Therefore this bound is less effective for large circuit with comparable frequency of slow and fast modes. The ACFB bound is derived in the following two steps. (i). Analyzing the cycle time of a given segment sequence In the following analysis we assume that the exact order of modes are known, and therefore the exact global critical path and average cycle time are also known. We consider an unfolded execution of arbitrary large length N with segment sequence U =hs 0 ;:::;s N i and derive an upper bound on (0) U (t;t;N). The bound is constructed first by extracting place-simple cycles, i , from the 34 global critical path, , and then bounding the delay of these cycles using elevation. Place-simple cycle extraction: The idea behind cycle extraction is that the span of each place-simple path is bounded by the number of places on the cutline, namely =jM 0 j, therefore a place has to be replicated on the global critical path in any window of size +1 segments. Formally, cycle extraction can be defined as a recursive operation which partition U to a set of fE i g;1iN as follows: 8 > > > > > > < > > > > > > : U 0 =U[0 :N] U i =hU i1 [0 :c i s 1];U i1 [c i t +1 :N]i E i =U i1 [c i s ;c i t ];1iN where 0 c i s < c i t are the smallest cutline indices for which (c i s ) = (c i t ). Figure 3.5 shows the result of cycle extraction for a given mode sequence with = 3. Cycle extraction results in a partition of the unfolded execution such that eachE i contains a portion of the globally critical path which is a place-simple cycle. Let us denote the portion of the critical path enclosed in E i excluding the last cutline place as i . In general D( ) = P i D( i )+D( ), where is the residue of the global critical path when (c 0 )6= (c N ). Bounding place-simple cycles: To bound each i we will elevate all segments in E i to the maximum super-segment in E i , denoted by s Max (E i ). We prove that D( i )jE i j(s Max (E i )): The intuition behind this bound is that i can be repro- duced from an unfolded cycle in (s Max (E i )) and therefore its delay can be bounded usingthelargestcyclemetricofthemarkedgraph. Finally,weprovethatforaknown mode sequence, U, the following bound holds U (t) lim N!1 1 N X i (jE i j)(s Max (E i ) : 35 s s s s s s A B C A B C 1 s c 2 1 , s t c c 2 t c 3 s c 3 t c s s s s s s A B C A B C E E E 1 s c 2 1 , s t c c , 2 t c 3 s c 3 t c Figure 3.5: Cycle extraction for a given mode sequence. (ii). Bounding cycle time given mode probabilities Theaboveboundisderivedassumingthatthespecificsequenceofmodesisknown. However, recall that the goal of this work is to address applications where the specific mode sequence is not known. Thus, we again assume m modes and a sequence of sorted super segments S = fs 1 ;:::;s m g;(s 1 ) ::: (s m ). Also, let f i denotes the given relative frequency for mode i. Theoretically, there exists the worst case mode sequence which results in the longest globally critical path. If we could magically construct the worst case mode sequence, we could apply the cycle extraction technique as discussed above to bound the delay of the longest critical path and derive the desired mode sequence agnostic bound. Alternatively,sincewedon’tknowtheexactmodesequencefortheworstcase, E i can be treated as random entity meaning that its length and mode distribution are random variables. Imagine fE i g, 1i, as a set of bins we wish to fill with 36 super segments from S such that the total number segments of type s i equals f i N. We calculateD B using the following greedy algorithm. Lets i be the largest available super segment, and U i MIN (k) denoted the sorted sequence of k smallest available super-segments at step i. For each step we construct ^ E i =hs i ;U i MIN (1)i. For an arbitrary mode assignment let us denote: D B (fE i g) = X i jE i j(s Max (E i )) We prove through an exchange argument that for any arbitrary mode assignment to fE i g,D B (fE i g)D B (f ^ E i g). It is important to realize than the bound holds for any arbitrary mode assignment including the one which results in the worst case mode sequence. Finally, as our main result we formulate the following bound on the average cycle time using the relative frequencies of the mode. U (t) lim N!1 1 N D B (f ^ E i g) = m X j=1 ^ f j (s j ) avg m X j=1 ^ f j (s j ) where ^ f j is the relative frequency of mode j after elevation in ^ E i and can be calculated as follows: 8 > > < > > : ^ f m =Min(f m ;1) ^ f j =Min(f j ;1 P m k=j+1 ^ f k ); 1j <m It is notable that the above closed-form bound formula, can be calculated fairly efficiently given the cycle times of the modes and their frequencies. 37 3.3.3 Linear Programming Mode-Based Bounds In this section, we propose two linear programs to derive performance bounds for the subset of unique choice Petri nets specified (Definition II.1) as natural extensions of the linear program proposed for marked graphs. These bounds are derived by solving a linear programming formulation through finding static arrival time values associated with circuit nodes at different modes of operation. Similar to ACFB the LP bounds presented in this section also provides an upper bound on the long term average cycle time of the conditional asynchronous, but unlike ACFB pairwise mode switching probabilities are considered and structural information of the modes are captured in the form of static arrival times. In fact the static arrival values, calcu- lated in the process of LP minimization, effectively provide latency bounds on the exponential set of paths in the unfolding of circuit behavior in time. To avoid unnecessary new formalism, we add a pair of dummy cutline transition and place to each cutline place as shown in Figure 3.10. This replacement ensures that any path in the unfolded execution of the Petri net is cut by a dummy transition with known static arrival time. Let a i be the static arrival time variable associated with transition t i (including the dummy cutline transitions) and d i be the delay associated with place p i . Since mode switching takes place in accordance to a Markov chain, stationary probabilities of each mode can be calculated based on pairwise transitional probabil- ities P = [p mm ]. In particular, we let m denote the stationary probability of mode m. This is the percentage of circuit cycles that mode m occurs in an adequately long run of the circuit. The frequency of switching from mode m to mode m can then be calculated as m p mm . Bound(i)(LPB):Wewillshowthatforanyarbitrarymodesequencecompatible with the given Markov chain probabilities, LPB derived by the linear program of Figure 3.6 is an upper bound for the overall average cycle time of the Petri net. In 38 thisformulation, m isavariablewhichcanbeinterpretedastheoptimalcycle-timeof (s m ), themarkedgraphassociatedwithmodem, whichalsosatisfiestheconstraints of the linear program. Minimize LPB subject to LPB = X m m m 8ht i ;p;t j i2 ~ F m ;m2 a j =a i +d p M(p) m +f (m) p a i 0; m 0; f (m) p 0 Figure 3.6: Linear program for LPB. Bound (ii) (LPBG): A more accurate performance bound can be obtained using the LP formulation of Figure 3.7. LPBG formulation extends the LPB bound by introducing the notion of pairwise arrival time gaps between modes of operation ( mm ). The idea of LPBG is that in conditional circuits, because each mode represents an unconditional component of the circuit we can apply the LP of Figure 3.4 to each mode. The resulting LP is shown in Figure 3.7 in which a separate arrival time vari- able is associated with each transition in each mode. For example, a (m) j is the arrival Minimize LPBG subject to LPBG = X m m m + X m X m m p mm mm 8ht i ;p;t j i2 ~ F m ;m2 a (m) j =a (m) i +d p M(p) (m) +f (m) p (i) 8 m;m 2 a (m) i a (m) i mm (ii) a (m) i 0; m 0; mm 0; f (m) p 0 Figure 3.7: Linear program for the LPBG. 39 time associated with transition j in mode m. The constraint set (i) in this figure is the same as constraint set (i) in Figure 3.4 replicated for each mode. Constraint set (ii), however, is new and calculates the gap penalties, mm , which represent the maximum difference between arrival time variables of the same transition across the two modes. Compared to the linear program of Figure 3.6, this linear program has an arrival time constraint per mode per triple, ht i ;p;t j i. Despite the more constraints, our experimental results show that the proposed LP can be solved in reasonable time for relatively large circuits. Note that m p mm is the percentage of time the circuit switches from mode m to modem and therefore, intuitively, mm can be interpreted as timing penalties for switching between operation modes. Also note that LPBG and LPB formulation are equivalent if mm = 0 for all modes, therefore the solution space of LPB is a subset of the solution space of LPBG which in turns means: LPBG LPB : The more accurate bound comes at the cost of a more complex LP formulation as a different set of arrival times is required per mode of operations in the LPBG formu- lation. Ingeneral,theLPBGhasanadvantageoverLPBincircuitsforwhichthestructure of the modes are quite different such that accepting the mode switching penalties leads to lower values for m variables and in a significantly lower overall bound for the average cycle time. 40 !!"# $""# $%"# $&"# $'"# $!"# ("""# ("%"# ("&"# ("'"# "# ")%# ")&# ")'# ")!# (# !"#$%&'()*+,-.*%/012% 345676-)-8'%59%:5;*%<=-8(>-?#% *+,-./01# 234# Figure 3.8: Inaccuracy of LPBG for high switching probabilities in ALU. 3.4 Sequence-Based Bounds The mode based bounds, introduced in Section 3.3.3, are efficient but can be overly conservative when the probability of mode switching is high. To demonstrate thispoint,letusrevisittheALUexample. Figure3.8plotsthevalueoftheLPBGand themeasuredvalueoftheaveragecycletimeversusswitchingprobabilityp MA =p AM . As the switching probability increases, the LPBG bounds approaches a maximum value and remains constant, failing to capture the reduction in the switching gap penalties due to an increase in alternative switching between modes, explained in Section 3.4.1. This behavior of the LPBG can be explained by noting that the linear program of Figure 3.7 trades off between the two terms of the cost function. Once the value of the second term, associated with the gap penalties, becomes too large due to the high switching probability, the gap values drop to zero and the gap penalties become zero. Asaresult,thevalueoftheboundbecomesindependentoftheswitchingprobabilities and remains constant. Our solution to this problem is to analyze short sequences of modes which can more accurately model the critical paths at high switching probabilities. 41 3.4.1 Critical paths in Alternating Sequences It is well known that the average cycle time of a conditional circuit may drop as switchingprobabilitiesincreases[NB12][GGS08][JBRS10]. Toillustratethis, consider again an ALU with ADD and MULT operations and assume that the MULT’s cycle time is considerably slower than that of the ADD’s cycle time. This can be modeled by assigning a large delay value to the place q in Petri net model depicted in Fig- ure 4.2. When sequences of MULT operations happen back-to-back, as illustrated in unfolding in Figure 2.8(b), this delay shows up in the critical path in every cycle and it slows down the average cycle time. However, with alternating sequence such ashMULT;ADDi, the ADD and MULT units can be activated concurrently, as illus- trated in the unfolding in Figure 2.8(a), reducing the average cycle time. In essence, the MULT is used at half-rate, allowing the large latency to be effectively removed from the critical path. Based on this observation, it is expected that in some cases analyzing sequences instead of individual segments will improve the accuracy of our bound. 3.4.2 Formalizing Sequences Similar to segments, we formally define a sequence, as a portion of the unfolded execution of a Petri net which is marked by two cutlines but unlike a segment the cutlines do not need to be consecutive (Definition II.24). Refer to Figure 2.8 for examples of sequences. Similar to segments, a marked graph, can be associated with each sequence (Def- inition II.25). Example of such marked graphs are presented in Figure 2.8. It turns out that in the ALU example, the associated marked graph to sequence hMULT;ADDi has a lower cycle time compared to the sequence hMULT;MULTi as a result of more concurrent operations. This fact can also be observed in Figure 2.9(a); the two highlighted cycles in the figure include one instance of place q which 42 we assumed to have higher associated delay. These cycles, however, are marked by two tokens which lowers the cycle metric of these cycles and reduces the cycle time of the marked graph effectively. We will use this observation, to formulate a more accurate performance bound in the next section. Here we will mention a few useful properties for the sequences and their associated marked graph which enables us to applythelinearprogramofFigure 3.4onthesemarkedgraphstocalculatetheircycle times. 3.4.3 Linear Program Bound based on Sequences (LPBS) In this section we present an alternative linear program to solve the problem stated in section 3.1 based on the concept of sequences. For simplicity we pick a fixed sequence length, , and formulate the bound. However, the method can be generalized to variable sequence length, one such extension is provided in Section 3.4.4. The sequence based bound (LPBS) is a generalization of the LPBG, in the sense that while in the LPBG we decomposed the unfolded execution into segments, here we decompose it into fixed length sequences. This enables us to bound the delay of each path segment inside a sequence, based on the properties of the marked graph associatedwiththatsequence. Similartothemodebasedbounds, toboundthedelay along an arbitrary globally critical path we use the arrival time gaps of sequences on every switching from one sequence to the other. To formulate the linear program, we need to calculate the stationary probabilities of sequences and switching probabilities between sequences. This can be achieved by forming an extended Markov chain based on the pairwise switching probability matrix, P = [p mm ], introduced in Section 3.1. Let us define P = [^ p ss ] as the pairwise switching probability matrix for the extended Markov chain for sequences withlength inwhich ^ p ss istheprobabilityofswitchingfromsequencestos. Alsolet 43 s(i) denote the mode corresponding to the i th segment in sequence s. The switching probabilities for sequences can be calculated as follows. ^ p ss =Prob sjs =Prob sjs( ) : The above is true since in a Markov chain the probability of next state does not depend on past states of the Markov chain. Therefore, ^ p ss =p s( )s(1) p s(1)s(2) :::p s( 1)s( ) =p s( )s(1) 1 Y i=1 p s(i)s(i+1) Similarly, the stationary probabilities for sequence s can be calculated as: ^ s = s(1) 1 Y i=1 p s(i)s(i+1) The linear program for the performance bound based on sequences is presented in Figure 3.9 and closely resembles the formulation of the LPBG. In fact, it can be easily verified that for = 1 the LPBS is reduced to the LPBG formulation. The major difference between the two formulations is the 1 multiplicand for the second term of the cost function in the LPBS formulation. The intuition behind this factor is that the gap penalties in the LPBS must be applied only every cutlines, reducing the impact of gap penalties. To appreciate the potential benefits of the sequence based approach, consider the ALU example when transitional probabilities close to 100% where the circuit essen- tiallyalternatesbetweenADD andMULT modes. Inthesequencebasedanalysisthe unfolded execution of the circuit can be decomposed into a series of hMULT;ADDi sequences rather than alternating between MULT and ADD. Using this sequence will likely yield a better bound if the cycle time of the sequence is smaller than the 44 Minimize LPBS subject to LPBS = X s ^ s s + 1 X s X s ^ s ^ p ss ss 8ht i ;p;t j i2 ( s ) a (s) j =a (s) i +d p M(p) s +f (s) p (i) a (s) i a (s) i ss ; 8 t i 2T C (ii) a (s) i 0; s 0; ss 0; f (s) p 0 Figure 3.9: Linear program for the LPBS. average of the cycle times of the individual MULT and ADD modes plus the poten- tially conservative penalty gap between them. In particular, this can happen because thearrivaltimesassociatedwiththecutlinebetweentheMULT andADD operations within the sequence are not as constrained as in the mode-based approach. The proof of correctness for the LPBS can be carried out in exactly the same way as done for the LPBG (refer to Section 3.5) bound and therefore will be omitted to avoid redundancy. The underlying reason is that Lemma III.1 which is the core of both proofs works for any k-safe marked graph. 3.4.4 Selective Elimination of Sequences Unfortunately analyzing sequences does not always improve the bound beyond mode-based analysis. For this reason, we propose to selectively using only those sequences which we predict will help, yielding further improvements. In particular, we propose a simple heuristic to identify potentially bad sequences and show how the formulation of the sequence-based LPBS can be modified to benefit from the results obtained from the mode-based LPBG. Note that this generalization is a special case of supporting sequences with variable lengths as once a sequence is eliminated we effectively decompose it to a series of sequences with length one for which the LPBG result can be applied. 45 Assumewefirstcomputethemode-basedLPBGandasanexampleletusconsider the sequence hMULT;ADDi for the ALU. The LPBG yields arrival time values for each transition in each mode and the delay of any path enclosed in such a sequence can be bounded by ~ s = MULT + MA + ADD : Motivated by this analysis, we propose to solve the LPBS with all the sequence gap constraints removed to obtain the lowest possible values of s . If this value is not less than ~ s by more than an arbitrary threshold we deem the sequence potentially bad and eliminate it. This elimination is carried out by removing the constraints associated with the potentially bad sequence from the LPBS formulation, forcing the arrival times of the transitionsinthesequencetobeequaltotheirLPBGvalues, andfixing s tobeequal to ~ s . In a subsequent linear program run, we add back all the gap constraints to the LPBS.Thegapconstraintstotheeliminatedsequencesarekeptwithnochangeasthe gap values will capture the arrival time difference between any sequence to the first mode of the eliminated sequence. The gap constraints from the eliminated sequence are modified to represent the gap of the last mode of the eliminated sequence to any other sequence. Note also that the gap between the modes internal to the sequence are already taken into account in the fixed value of ~ s . Wecanprovethecorrectnessoftheeliminationmethodinthesamewayweproved LPBG and sequence based bounds. Here, we decompose any potential critical path to the boundary cutlines of the remaining sequences and individual modes wherever a sequence is eliminated. Note that the LPBG arrival times satisfies the constraints for individual modes and gap constraints between them. These arrival times also result in a gap values between each mode and each remaining sequence which can be used in the proof processes. Also note that by forcing the arrival time of the modes of an eliminated sequence to the corresponding LPBG values, we trade off between 46 calculation time and the accuracy of the result as the LPBG arrival times are not necessarily optimal for the resulted sequence based formulation after elimination. 3.4.5 Complexity analysis The cost paid to improve the accuracy of the bound is the larger complexity of the LPBS formulation. In particular, a sequence length of two yields a linear program with four times as many variables and constraints. This increase, however, is somewhat mitigated by using variable length sequences through the elimination of useless sequences as suggested above. For example, for two modes, rather than having four mode sequences 00;11;01;10 a reduced set of sequences 0;1;01 may produce similar results with only two times more variables and constraints. In fact, ourexperimentsshowdiminishingreturnsforlargersequencelengthsandthusmodest length sequences may be sufficient in many applications. 3.5 Proof of Correctness Since LPB is the special case of LPBG, we present the proof for the latter case. The proof is based on the following basic lemma. Lemma III.1. In a marked graph, the delay of any path, , started from transition t s and ended to transition t e is bounded by: D()a(t s )a(t e )+ M() where a(t)’s are the set of arrival time satisfying the linear program of Figure 3.4, is the cycle-time of the marked graph, M() is the total number of initially marked places on the path, and D() is total delay of the places on the path. Proof. The constraints of the linear program of Figure 3.4 can be written as: Ax =b 47 in which A and x can be rewritten as: A jPj(jTj+jPj) = C jPjjTj I jPjjPj x (jTj+jPj)1 = 2 6 4 a jTj1 f jPj1 3 7 5 C = [c ij ] is the connectivity matrix, c ij = 8 > > > > > > < > > > > > > : +1 t i 2p j 1 t i 2p j 0 otherwise a is the set of arrival times associated to transitions, I is the identity matrix, and f are positive slack variables. Finally b = [b i ], in which b i = d i +m(p i ), and d i is the delay associated to place p i . In this setting, path = [ i ] is the vector where i = 8 > > < > > : 1 p i 2 0 otherwise To complete the proof, multiply both side of the constraint equation by . T (Ax) = T b This operation sums up all the rows on both sides associated with the places on the 48 path. For any transition on the path besides t s and t e , there is exactly one incoming place, c ij = +1, and one outgoing place, c jk = 1, and therefore the arrival time variables of these transitions cancel out resulting in the following: a(t e )a(t s ) X p2 f p = X p2 (d p m(p)) Note that P p2 f p 0 and as a result: X p2 d p a(t e )a(t s )+ X p2 m(p) Asstatedin[NB12],anyarbitraryunfoldedexecutionofaPetrinetcanbedecom- posedintoclasses of segments, eachrepresentingamode ofoperationof thePetrinet. This property enables us to use of the above lemma on the marked graph associated to the segment of each modes ((s m )) and bound the delay of the path-segments between any place on the input cut-line and any place on the output cutline of the segment. To do so we need a set of arrival times which satisfy the constraints the LP formulation of Figure 3.4 for (s m ). Lemma III.2. Any set of arrival times which satisfy the linear program of Figure 3.7 will satisfy linear program of Figure 3.4 for (s m ) with = m . Proof. Since the LPBG formulation includes all the constraints of the linear program inFigure 3.4, it’ssolutionspaceisasubsetofthesolutionspaceof 3.7 andanysetsof arrival time which satisfy LPBG formulation also satisfy the linear program of Figure 3.4. Using Lemmas III.1 and III.2 the delay of each path-segment between any two places located on the input and output cutlines of a segment can be bounded as follows. 49 The order of mode switching governs the overall cycle time of each unfolded exe- cution and without knowing the exact sequence of mode switching it is not possible to identify globally critical path in an unfolded execution. Therefore, in our proof we assume that any path can be potentially a globally critical path. Each potential glob- ally critical path in an unfolded execution of the performance model is decomposed into several path segments by the cut-line dummy transitions. Consider the sample unfolded execution shown in Figure 3.10 in which one potential globally critical path which span across three cutlines is highlighted. This path starts from cutline transi- tiont (0) R , passes throught (1) P andt (2) R , and ends att (3) R . Lemma III.1 can be applied to the path segments cut by cutline transitions. For the first path segment, 1 , fromt (0) R to t (1) P since this is a path in the marked graph associated with the mode addition, (s ADD ), one can write: D( 1 )a (ADD) (t (1) P )a (ADD) (t (0) R )+ ADD Note that the only place on1 which is initially marked in (s ADD ) isP. We can use the arrival times derived by the linear program of Figure 3.7 in the formulation by application of Lemma III.2. In the same way we can bound the latency of the path, 2 , from t (1) P to t (3) R as: D( 1 )a (MULT) (t (3) R )a (MULT) (t (1) P )+ MULT If the value of ADD;MULT is zero then a (ADD) (t (1) P ) =a (MULT) (t (1) P ) and therefore we can get a bound on the delay of path segment, 12 , starting fromt (0) R tot (2) R by simply adding the above two inequalities to get: D( 12 )a (ADD) (t (0) R )a (MULT) (t (3) R )+ MULT + ADD 50 c2 c3 Q (0) R (0) P (0) c0 ADD1 R (1) P (1) Q (1) R (2) P (2) ADD1 R (3) P (3) ) 0 ( P t ) 0 ( R t ) 3 ( P t ) 3 ( R t c1 ) 1 ( P t ) 1 ( R t ) 0 ( Q t Mult ) 2 ( P t ) 2 ( R t ) 1 ( Q t Figure 3.10: Potential critical path example. In the presence of non-zero gap values however there would be an added ADD;MULT term every switching from mode ADD to MULT and we will have the following. D( 12 )a (ADD) (t (0) R )a (MULT) (t (3) R )+ MULT + ADD + ADD;MULT : TheoremIII.1. LPBG derived by the linear program of Figure 3.7 is the upper bound on the long-term average cycle-time for any arbitrary sequence of mode switching compatible with the pair-wise switching probabilities. U (t) LPBG Proof. For any potential globally critical path, , we can write the following for, i , the i th path-segment of enclosed between cutline transitions t (i) C and t (i+1) C , the i th and i+1 th cutline transitions on . D( i )a(t (i) C )a(t (i+1) C )+ (i) In the above inequality (i) is considered to be equal to m if thei th segment is in the class of mode m. For an unfolded execution with N segment, with N is adequately large, given that the total number of segments of type m is N m and the number of switching from mode m to mode m is p mm N we can write: 51 D( ) = N X i=0 D( i ) m N m +p mm N mm since we assume that the transition of interest fires in all modes of operation, the total number of firing of transition of interest is also N and therefore in the limit we get the following which completes the proof. (t) m m +p mm mm The beauty of the proof is that the result is totally independent of the selection of globally critical path and therefore the bound holds for any sequence of modes complying to the given switching probabilities. In conclusion, the number of constraints are increase in the worst case by a factor jMj andjM1j extra variables are added for cycle times of each mode. It is notable that in practical designs only a subset of channels belong to more than one mode and it is expected that conditional performance bounds are not effective in circuits where all the channels belong to almost all the nodes. This fact suggest that for practical conditional circuits the added constraints should be way below jMj. Also note for circuits where many large number of modes can be defined, several modes can be combined together to form a super mode to tradeoff computational complexity and qualityofthebound. WewillcomparetheexecutiontimeofLPsolverforconditional circuitswiththeunconditionalequivalentcircuitbyintroducingboundprecisiontimes LP execution time to provide some sense of such trade-offs. 52 3.6 Experimental Results TheformulationoftheLPBandLPGboundsprovidedareimplementedinProteus to measure the accuracy of the proposed bounds on a set of benchmark circuits including randomly generated conditional asynchronous benchmarks (ISCAS89a) as well as one example inspired from industry called MgmtChain. 3.6.1 Benchmark Circuits There is a lack of a conditional asynchronous circuit benchmarks that can be used toevaluateperformanceanalysisandoptimizationthatconsidersconditionalcommu- nication. Earlier unconditional slack matching work [BLDK06] was evaluated using ISCAS89 netlists [Naj12] that were automatically converted to unconditional asyn- chronous circuits. In particular, the gate-level structure of the base benchmark in ISCAS89 was maintained and it was assumed that each gate represents a distinct un- conditional asynchronous pipeline stage and that each wire between gates represents an asynchronous channel. For this work, we randomly add conditionality to these circuits. In particular, for each benchmark circuit and for each mode of operation, a subset of channels are randomly selected as seeds of the mode, and in subsequent steps adjacent channels are identified using Breath First Search (BFS) and proba- bilistically added to the mode. To implement the resulted conditional circuit, a new input is added to the benchmark circuit which encodes the mode of operation of the circuit and SEND and RECV cells [BDL11], which implement conditional communi- cation, are added to the circuit to ensure that each channel is activated in the set of associated modes. Our industrial benchmark, shown in Figure 3.11, is composed of an ALU with variable data width and a parametric Management Chain (MgmtChain) and is in- spired by an industrial design of asynchronous circuits [Dam12]. It is composed ofN computational nodes, each calculating a watermark function carried on a stream of 53 r e g f W r e g f W r e g f W DataIn 1 DataIn 2 DataIn N DataOut 1 DataOut 2 DataOutN ManCMD ManData ManAddr Figure 3.11: Management chain benchmark. input data and some locally stored value. The local values, however, are needed to be set by a central controller and in order to reduce implementation cost, computational nodes are organized in a serial chain instead of a parallel distribution tree. During the Normal mode of operation the circuit is determining if the input value is below the watermark at maximum throughput. The controller issues commands to update the stored value of a specific node through sending a command which includes node address through the chain, shown with dotted lines, at a lower targeted throughput duringararerManagementmode. TheblockdiagramofthecircuitisshowninFigure 3.11. Since the result generated atDataOut output channels are needed to be stored in a wide memory, data token are needed to arrive at the same time to avoid any performance stalls. Note however that, due to completely different functionality of thetwomodes,thearrivaltimevariablesassociatedwitheachcircuitnodeindifferent modes of operation can be quite different for large values of N. 3.6.2 Analysis ThebenchmarksaresynthesizedbyProteususingaquasi-delay-insensitiveprecharged half buffer pipeline template [Lin98]. The proposed performance bound formulations are then applied to a conservative Full Buffer Channel Net model proposed for half buffer circuits [BLD11]. To measure the accuracy of the proposed bounds, the half buffer circuits are then simulated post synthesis. Timing information for bound for- mulation and post synthesis simulation is extracted from a library of gates with back 54 Bench LPBG (ps) 0 (trans.) 1 (trans.) Max (trans.) sim (ps) SWC sim (ps) p 01 (prob.) p 10 (prob.) 0 (prob.) 1 (prob.) Err LPBG (%) as27 1064 32 18 7 1014 1014 0.04 0.01 0.23 0.77 4.74 as298 2849 54 58 2 2499 2600 0.5 0.5 0.51 0.49 8.75 as344 1080 34 18 18 1012 1023 0.04 0.01 0.21 0.79 5.26 as386 1062 18 34 12 975 989 0.01 0.04 0.81 0.19 6.88 as349 1503 30 30 4 1500 1501 0.04 0.01 0.17 0.83 0.11 as382 3007 46 78 32 2483 2735 0.01 0.01 0.57 0.43 9.04 as400 3151 58 70 10 2823 2823 0.01 0.01 0.59 0.41 10.39 as420 1105 36 18 35 975 1058 0.04 0.01 0.20 0.80 4.22 as444 1120 36 18 26 992 1034 0.04 0.01 0.22 0.78 7.67 as510 2002 38 42 8 1612 1849 0.01 0.01 0.50 0.50 7.68 as526 1091 18 34 14 989 991 0.01 0.04 0.78 0.22 9.11 as526a 1067 18 36 21 1003 1004 0.01 0.04 0.83 0.17 5.98 as713 1420 30 28 1 1301 1379 0.04 0.01 0.19 0.81 2.88 as820 1387 36 26 7 1274 1368 0.04 0.01 0.17 0.83 1.40 as820 1534 29 36 7 1401 1464 0.01 0.04 0.78 0.22 4.52 as832 1468 34 28 6 1357 1413 0.04 0.01 0.21 0.79 3.78 as838 1071 35.33 18 34 978 1024 0.04 0.01 0.17 0.83 4.36 as1196 1154 36 18 44 1010 1067 0.04 0.01 0.24 0.76 7.50 as1488 1548 35.60 30 8 1415 1516 0.04 0.01 0.15 0.85 2.10 as1238 1120 35.33 18 37 1040 1061 0.04 0.01 0.22 0.78 5.26 as5378 4680 98 90 12 3741 4112 0.01 0.01 0.43 0.57 12.14 as9234 4952 98 100 9 3703 4203 0.01 0.01 0.52 0.48 15.13 as15850 4778.26 98 94 16.00 3500.00 3697.62 0.01 0.01 0.36 0.64 22.62 Table 3.1: Performance bound result for ISCAS89a. annotated post layout timing information. Table 3.1 presents the resulting LPBG performance bound for the ISCAS89a benchmark. In this table the following values are resulted from solving the linear program of Figure 3.7: 0 , 1 are the cycle time variables associated with mode 0 and mode 1, and is the maximum value of the gap between the two modes. sim is the average cycle time of the benchmark resulted from post synthesis simulation. Since our bounds do not address eager evaluation of the QDI templates used in the imple- mentation of the benchmarks, we also report the simulation results for a semi-weak conditioned (SWC) version of the library which always waits for all inputs to arrive before evaluating SWC sim . 0 and 1 are the stationary values calculated post simula- tion. Note that since the simulation is performed with a finite Markov Chain stimuli, the stationary probabilities post simulation slightly differ from the theoretically cal- culated values. To cancel out the impact of finite simulation, stationary values are back annotated in the LPBG formula and the resulted value is reported in the table. 55 The inaccuracy of the performance bound is calculated as: Err LPBG = SWC sim LPBG SWC sim . Although this work does not guarantee an error margin for the proposed bounds, our result show the average inaccuracy for this ISCAS89a benchmark is about 7%. To illustrate the impact of arrival time gaps on improving the quality of the derived performance bounds, the resulted values from LPB and LPBG formations for MgmtChain example for different number of computational nodes are depicted in Figure3.12. ThevaluesoftransitionalprobabilitiesbetweenNormalandManagement modes are set top NM = 0:01 andp MN = 0:04 which result in stationary probabilities of N = 0:8 and M = 0:2. The diagram indicates that as the number of nodes increasesthearrivaltimegapvaluesbetweenthetwomodesofoperation, ,increases and therefore the impact of gap becomes more apparent. As shown in the Figure the error resulted from LPB bound increases radically with an increase in number of nodes while in LPBG bound remain almost constant. The reason is that when the structure of the modes are quite different, the LPB constraints can only be satisfied by larger values for N and M . In contrast, these values can remain constant in LPBG formulation while the constraints are satisfied through larger pairwise arrival time gap values. Tointroduceconditionalityandmimicdifferentmodesofoperation,foreachmode we randomly select a set of edges in the circuit as cluster seeds to form conditional islands and then grow each cluster through BFS traversal of the graph until a certain number of edges are added to the mode. For each mode the algorithm takes the numberofclusterseedsandnumberofedgesineachclusterasinputsandcreatemode islands by instantiating send and receive modules appropriately on the boundary of 56 !"!#$ %"&#$ '"&#$ !%"(#$ &)$ '($ &*)$ ))+$ +,)$ !**&$ !*+*$ !*(($ !&**$ !%(*$ &*!($ &%&&$ &*"-#$ &+"'#$ )+"(#$ -%"!#$ *"*#$ !*"*#$ &*"*#$ %*"*#$ )*"*#$ -*"*#$ ,*"*#$ '*"*#$ (*"*#$ +*"*#$ !**"*#$ *$ -**$ !.***$ !.-**$ &.***$ &.-**$ ($ !,$ %&$ ,)$ !"#$%&'!()%*'+',-.-/%#%.0'12-3.' Figure 3.12: Impact of arrival time gap in MgmtChain Example. each island. It is important to do the processes such that circuit remain deadlock free for any order of modes, otherwise performance measurement would not be possible. To control modes of operation of the circuit we also added a tree of control signals to thecircuittodistributethemodecontrolvaluestotheinstantiatedsendandreceives. To illustrate the benefits of examining sequences, we implemented a 32-bit asyn- chronous ALU, similar to the one example shown in Figure 4.1 and examine all sequences of ADD/MULT modes of lengths 2 and 3. The result is plotted in Fig- ure 3.13 for the LPBG and the two sequence based bounds labeled as SL2 and SL3 respectively as a function of mode switching probability. As it can be seen in the figure, the inaccuracy of the SL3 is better than 1%. Moreover, the accuracy of the bounds appear to demonstrate diminishing returns on sequence length very quickly. In particular, sequences of length 2 already captures the cycle-time reduction caused by the reserve impact of mode switching on the average cycle time. This is important to note as the run-time of the linear-program increases quickly from 10 seconds for sequences of length 1 to 25 minutes for sequences of length 3. Figure 3.14 shows impact of sequence base bound with length 3 (SL3) for the ISCAS89a benchmarks. In this experiment, the probabilities of switching, p 01 = p 10 57 !!"# !$"# $""# $%"# $&"# $'"# $("# $)"# $*"# $+"# "# ",&# ",(# ",*# ",!# %# !"#$%&'()*+,-.*%/012% 345676-)-8'%59%:5;*%<=-8(>-?#% -./# -0&1234# -0'1234# Figure 3.13: Improvement of the sequence based bounds for Asynchronous ALU. are swept from 0.5 to 0.9 with step size of 0.1. Figure 3.14(a) shows the inaccuracy of the LPBG for these benchmarks for the range of probabilities. Note that LPBG is equivalent to the sequence bound with length one (SL1). Figure 3.14(b), however, plots the effect of sequence based bound with length ( = 3) including impact of the heuristic selective elimination algorithm discussed in Section 3.4.4. The selective elimination of the sequence improves the bound precision on average by 4% and reduces the calculation of the bound on average by 17%. Compared to the LPBG, the calculation time of the SL3 is increased by a factor of 50 times. The maximum calculation time of the SL3 occurs for as1196 and is 45 minutes compared to the 10 seconds for the LPBG. 3.7 Conclusions Asynchronous circuits with conditional behavior often have distinct modes of op- eration each of which can be modeled as a marked graph with its own performance target. In this chapter we derived performance bounds for conditional circuits based on the cycle times of successively larger collections of these underlying modes. This is particularly challenging as cycles caused by the transition between modes can af- fect the globally critical path for many cycles. This section bounds this impact. In doing so, it proves the somewhat intuitive result that slack matching conditional cir- cuits with a conservative unconditional model guarantees the circuit will meet the 58 !"#$ !"%$ !"&$ !"'$ !"($ !$ )!$ *!$ +!$ ,!$ #!$ %!$ &!$ '!$ (!$ -.*&$ -.*('$ -.+,,$ -.+,($ -.+'*$ -.,*!$ -.,,,$ -.#)!$ -.#*%$ -.#*%-$ -.&)+$ -.'*!$ -.'+*$ -.'+'$ -.))(%$ -.)*+'$ -.),''$ -.(*+,$ !""#"$%&'()*#+$,-.$$ /#012"&'$3#$,-4$567$ /01$ !"#$ !"%$ !"&$ !"'$ !"($ !$ #$ )!$ )#$ *!$ *#$ +!$ +#$ ,!$ -.*&$ -.*('$ -.+,,$ -.+,($ -.+'*$ -.,*!$ -.,,,$ -.#)!$ -.#*%$ -.#*%-$ -.&)+$ -.'*!$ -.'+*$ -.'+'$ -.))(%$ -.)*+'$ -.),''$ -.(*+,$ !""#"$#8$-9:;<,-4$567$ /-1$ Figure 3.14: Sequence-based bound accuracy. 59 performance of the unconditional model. Our future work is to apply these bounds to slack matching mode-based condi- tional circuits and quantify the resulting improvement. In particular, we suggest a direct application these improved bounds to mode-based slack matching algorithms that guarantee a given performance with fewer slack matching buffers and thus cir- cuits with smaller area and reduced power consumption. 60 CHAPTER IV Mode-Based Conditional Slack Matching y The problem this chapter addresses is how to slack-match a conditional asyn- chronous circuits and guarantee an overall long-term average cycle time for circuit given individual target cycle-time for each mode of operation and probabilities of switching between different modes of operations. Let us revisit the conditional ALU example in comparison to the unconditional ALU shown in figure 4.1. In Figure 4.1(a) all channels are unconditional. Since the inputs of the MUX are unconditional, it has to receive tokens on both M out andA out independent of the value of the opcode before it can compute and send the result on channelOut 1 . Similarly, theDEMUX needs to finish sending a value to both outputs M in and A in before processing the next input. On the other hand in Figure 4.1(b), the SPLIT and MERGE have conditional channels. The SPLIT will only send a value to M in when opcode is a Mult and keeps A in inactive. Similarly, the MERGE will only wait for a value on M out during a Mult. Conditional communication saves power. For example in Figure 4.1(a) because of theabsenceofconditionality,adummyvaluemustbesentthroughthemultipliereven if the desired operation is an addition to avoid a deadlock in the MUX. Although y Content of this section is partially published as a conference paper [NB13b] 1 We assumed that theMUX module cannot eagerly evaluate when onlyM out has received a value and opcode is Mult. Slack-matching taking advantage of eager evaluation is an open problem and out of the scope of this work. 61 1-Stage Add 2-Stage Mult. D E M U X M U X In A in A out M in M out Out (a) op (b) 2-Stage Mult. 1-Stage Add S P L I T M E R G E In A in A out M in M out Out op Figure 4.1: Asynchronous ALU: (a) unconditional (b) conditional communication. theMUX module will drop the dummy result, the multiplier has unnecessarily burnt power processing the dummy values. More relevant to this chapter, conditionality can improve performance. The dummy token, passing through the long-latency path of the multiplier, may arrive at the MUX later than the actual result from the adder, causing an unnecessary performance stall and reducing throughput. By introducing conditionality using the SPLIT and MERGE, fewer stalls will occur because tokens are not sent along the long-latency multiplier path during additions. These performance stalls can be removed using slack matching 2 . In the absence of a conditional slack matching approach, unconditional slack matching approaches have been applied to the conditional circuits by assuming that all the paths are active during each operation (e.g., [BLDK06][PM06][BDL11]). The problem with these approaches is that they unnecessarily add extra pipeline buffers to balance all pipeline paths, many of which are, in reality, not simultaneously active. The conditional slack matching approach presented in this paper, however, can save many slack matching buffers by more intelligently balancing conditional pipeline paths. In addition, our approach enables efficient tradeoffs between area and performance. For 2 Note that due to the complete flow control implemented with asynchronous handshaking, slack matching buffers only affect performance and not functional correctness. 62 example, given that the multiplier may be activated less frequently than the adder, it may make sense to design the multiplier to have a higher target cycle time and save more slack matching buffers within the multiplier. Anunderlyingchallengeinconditionalslackmatchingisanalyzingtheaverageper- formance of such conditional circuits which is addressed in Chapter III. This Chpater goesbeyondanalysisanddiscusseshowtooptimizesuchcircuitsforaverage-caseper- formance. Specifically, this paper proposes the first exact slack-matching algorithm for such conditional asynchronous circuits that guarantees an average cycle time for the circuit, given an individual target cycle time for each mode of operation and pairwise probabilities of switching between modes of operations. Ourexperimentsdemonstratethatourlinear-programming-basedsolutionisboth efficient, handling circuits with over 2000 channels in less than 4 minutes, and effec- tive, savings up to 73% of slack-matching buffers (an average of 33%) compared to the traditional approach. Moreover, because our performance model is based on a wide-class of Petri nets and a generic model of an asynchronous channels, we believe the approach is applica- ble to a wide variety of styles of asynchronous pipelines, including PCHB and WCHB [Lin98], GasP [SF01], and mousetrap [SN07]. 4.1 Problem Definition The conditional slack matching problem for asynchronous circuits with mode- based behavior can be informally stated as follows. We are given the following: The netlist of the asynchronous circuit as a directed graph G c (V c ;E c ) where v 2 V c is an asynchronous node, and c = hv i ;v j i 2 E c is the asynchronous channel from v i to v j . The set of operational modes for the circuit, =fm 1 ;:::;m M g, wherem k E c 63 is the subset of channels activated by the k th mode. Abstract performance metrics of each channel in the form of their forward and backward latencies (described in more detail in Section 4.3). Desired target cycle time vector for modes, = [ (m) ]. Matrix of pairwise transitional probabilities, P = [p mm ]. p mm is the conditional probability of switching from mode m to mode m. Our goal is to insert a minimum number of slack-matching buffers in the graph, such that: The cycle time of the circuit running solely in mode m is less than the desired target cycle time, (m) . The average cycle time of the circuit is lower than an upper bound function, B (P;), for any arbitrary sequence of mode transitions complying with the given Markov chain. WeassumethattheMarkovchainisergodicandthereforestationaryprobabilities ofeachmodecanbecalculatedasafunctionofthepairwisetransitionalprobabilities. In particular, we let m denote the stationary probability of mode m. This is the percentage of circuit cycles that mode m occurs in an adequately long run of the circuit. The frequency of switching from mode m to mode m can then be calculated as m p mm . In particular, the upper bound function we propose is of the following form, B (P;) = X m m (m) + X m X m m p mm mm : The mm arethegap valueswhichareobtainedbyourproposedslackmatchingMILP formulation and represent the natural asymmetry of timing requirements between 64 pairs of modes. The first term in the above function is the average of the target cycle times of each mode weighted by their probability of occurrences. The second term bounds the total impact of switching between modes. Recall that m p mm represents the total switching from mode m to modem and each time a switching happens, the mm penalty has to be added to the weighted average. 4.2 Related Work In [BLDK06], Beerel et al. presented an exact mixed integer linear program (MILP)formulationthatguaranteesatargetcycletimeforunconditionalasynchronous pipelines, modeled with a class of marked graphs, by adding the minimal number of pipeline buffers. Theory that proves conditions under which the MILP formulation admits a polynomial time solution is also provided. Concurrently, Prakash et al. [PM06] proposed a more complex MILP formulation for a more detailed model of unconditional pipelines. Gill et al. [GGS08] proposed an efficient method for estimating performance of hierarchical asynchronous circuits with conditional behavior. The method extends the “canopy graph" method developed in [Lin98] for conditional behavior. In [GS09] the method is utilized for heuristic bottleneck detection in asynchronous circuits. While slack matching of conditional circuits is addressed, no proof is provided for correctnessofthemethodandnoworst/average-caseperformanceboundisclaimedof theslack-matchedcircuit. Venkataramanietal. [VG06]proposedaheuristic,iterative algorithm for slack matching an asynchronous circuit with conditional behavior. The performance evaluation is done based on simulation of a given input sequence. No proofhasbeenprovidedontheoptimalityoftheapproach,andnoperformancebound is derived on anything other than the given input sequence. A practical slack matching approach used in the industry models conditional cir- cuits as if they are unconditional [BDL11]. The general idea is that this may lead to 65 requiring more slack-matching buffers than absolutely needed but enables the use of efficient LP-based formulations [BLDK06] and guarantees the cycle time of the con- ditional circuit would be no worse than that of the cycle time of the slack-matched unconditional model. In [NB12], it has been proved that treating conditional circuits as unconditionally in slack matching leads to a guaranteed bound but as our results indicates the number of slack-matching buffers is far from optimal. [NB12] has also introduced the first performance bound for conditional mode-based circuits, but it is easytoshowthattheperformanceboundguaranteedbyourapproachismoreprecise and leads to better area-performance tradeoff 4.3 Performance Modeling In order to model the performance of asynchronous circuits we use the notion of Petri Nets as presented in (Definition II.1). We use an extension to the Full Buffer Channel Net model (FBCN) as the performance model in which each asynchronous pipelinestageismodeledusingatransitionandasynchronouschannelsbetweenstages are modeled with a pair of places; a forward (circles) and a backward place (squares) which are labeled with forward and backward latencies of the corresponding channel [BLDK06]. Forward latency is the time needed by the pipeline stage in its initially ready state to generate the output after it receives all of its inputs. Backward latency models the time it takes for the pipeline stage to get back to its initially ready state after it generates the output. The sum of forward and backward latencies represents the local cycle time of the channel. As an example, Figure 4.2 shows the simplified Petri net model for the conditional ALU example. Note that for simplicity, we do not show timing arcs related to the opcode channel in the ALU. However, in the actual performance model, the opcode and related control channels are active in both Add and Mult modes. 66 1-Stage Add t A1 2-Stage Mult. t M2 q Q t M1 r R t M p P t S Figure 4.2: Split-merge FBCN model for ALU. 4.4 Suggested MILP This section presents the MILP formulation of the conditional slack matching problemasanaturalextensionoftheunconditionalslack-matching. Anunconditional asynchronous circuit is slack matched for target cycle time, , if the following MILP can be solved [BLDK06]. Minimize X cost c s c subject to 8 c =hv i ;v j i2E c 8 > > < > > : a j =a i +l(c)M(c) +f c +l s s c 0f c c +s c ( s ) a i 0; 8 v i 2V c s c 2Z + Wheres c andcost c arenumberandcostofslack-matchingbuffersaddedtochannel crespectively. a i isthearrivaltimevariableassociatedwiththeasynchronousnodev i . M(c)is1ifchannelchasaninitialdatatokenuponresetandis 0otherwise 3 . l(c)and l(c)aretheforwardandbackwardlatencyassociatedwiththechannel. c =l(c)+l(c) 3 In Proteus, a TokBuf is module which send zero token to its output upon reset and channels connected to the output of TokBufs will have M(c)=1. 67 is the local time of the channel, and finally l s and s are forward latency and local cycle time of slack-matching buffers. OurproposedMILPformulationoftheconditionalslackmatchingproblem,shown in Figure 4.3, is based on the FBCN model described in Section 4.3. Given to the problem are the following constants for all channels c and modes m. (m) : desired target cycle time for mode m. cost c : cost of a slack matching buffer on channel c. c : the local cycle time of channel c. m(c): initial marking of channel c. 4 l c : forward latency of channel c. l s : forward latency of a slack matching buffer. s : local cycle time a slack matching buffer. The program is solved for the following variables. s c : number of slack matching buffer inserted on each channel c. a (m) i : arrival time variable associated with each node v i in every mode m. mm : pairwise static arrival time gap variables. f (m) c : positive auxiliary variables. Thisformulationisbasedonthelinearprogramformode-basedperformanceanal- ysisofPetrinetsinFigure3.7assumingthecircuitisdescribedusingtheFBCNmodel presented in Section 4.3. Recall this means that each pipeline stage is represented by a transition and each channel is associated with one forward and one backward place. 4 A token buffer is a pipeline module that sends a token on its output upon reset and subsequently acts as a buffer. Channels connected to the output of token buffers will have m(c)=1. 68 Minimize X cost c s c subject to 8 c =hv i ;v j i2E c ;m2 ( a (m) j =a (m) i +l c m(c) (m) +f (m) c +l s s c (i) 0f (m) c (m) c +s c ( (m) s ) (ii) 8 m;m 2 ; a (m) i a (m) i mm (iii) a (m) i 0; 8 v i 2V c s c 2Z + Figure 4.3: MILP for conditional slack matching. Constraint (i) and (ii) together implement constraints for forward and backward places modified to capture the impact of s c buffers added to each channel c. In particular, constraint (i) is based on constraint (i) in Figure 3.7. f (m) 0 guarantees that for each channel c = hv i ;v j i with no initial token, the arrival time of node v j is at least as large as the arrival time of node v i plus the forward latency of the channel (l c ) plus the forward latency of the s c slack matching buffer that will be added to the channel. Note that the constraint for channels with an initial token includes a minus (m) to account for the execution cycle difference between the two transitions. TheconstraintassociatedwiththebackwardplacesintheFBCNmodelareimple- mented by constraining the upper bound of the auxiliary variablef (m) c . In particular, the (m) c term represents the free-slack of channel c in mode m. This free-slack represents the maximum time the channel can be stalled without impacting the tar- get cycle time. Similarly, s c ( (m) s ) represents the additional available free-slack of this channel with the addition of s c buffers. These free-slack-based constraints are an extension of the unconditional slack matching formulation in [BLDK06] to mode-based conditional circuits. Note that our current formulation obtains the maximum pairwise gap values but 69 does not bound them 5 . Once the MILP is solved, it can be shown that the slack matched circuit works at the guaranteed performance bound derived by the linear program of Figure 3.7. This can be achieved by finding the arrival time for the inserted buffers in each mode and showing that these arrival times also meet the constraints of Figure 3.7. Arrival timesofthebufferscanbederivedbyspreadingthebuffersarrivaltimesappropriately between the arrival time of the source, a (m) i , and the sink of the channel, a (m) j . In the simplest case where the slack matching buffers have the same forward latency and cycle time as the channel, the arrival time values can be distanced evenly though the channel and it is easy to verify that this set of arrival times satisfy the constraints of each mode and pairwise gap penalties between all mode pairs. Since the general form of the slack-matching problem has been demonstrated to be NP-Complete, the presented MILP is solved approximately by first solving the relaxed LP version of the problem in which s c can be non-integer values and then by obtaining a sub-optimal integer solution through applying a ceiling function to the fractional similar to [BLDK06]. Note that complexity of the new formulation compared to unconditional slack- matching has increased by O(m) since a constraint per channel per each mode is now added to the formulation. Our results indicate that the relaxed LP solution and sub-optimal integer solution can be found in practical time. 4.5 MILP Relaxation It has been shown that slack-matching of asynchronous circuits is an NP-Hard problem [KB00]. The MILP formulation of even relatively small practical instances of the problem is intractable. Traditionally, the MILP formulation of the problem is 5 Our experiments show that the problem cannot be solved for arbitrary bounds on the pairwise gap variables. While it is possible to minimize the gap values by adding them to the cost function, the general solution of the problem with bounds on gap is left as future work. 70 relaxed to an LP version where fractional number of buffers are accepted. The result of the LP problem is then converted to integers using the ceiling function which rounds the fractional values of buffers up to the next integer value. An LP relaxation algorithm for slack-matching problem is provided in [BLDK06]. The relaxed LP problem is formulated using variables for free slack of the channels instead of variables for number of slack-matching buffer because the former LP is to- tallyuni-modularinmostcasesandyieldsintegervaluesforfreeslackofthechannels. Since each slack-matching buffer adds a foreknown amount of free slack to its input and output channels, given the free slack values of each channel one can calculate the number of buffers which can be a fractional number. The calculation is simply done by dividing the free slack of the channel resulted from LP to the slack of the slack-matching buffer to get the actual number needed buffers on each channel. The resultedfractionalnumber ofbuffers arethen roundedup ifthe algorithmcycleis not violated. The algorithmic cycle is the cycle metric of associated to the circuit consid- eringonlyforwarddelays. Ithasbeenexperimentallyshownthatrelaxationgenerates close to optimal solution for unconditional slack matching. If rounding a fractional variables violates the algorithmic cycle time, problem wont have a feasible solution post buffer insertion unless the target cycle time constraint is relaxed. The method in [BLDK06] utilizes an algorithm to mark globally critical channels conservatively to avoid rounding fractional number of buffers on critical channels. Also note that any relaxation algorithm may need to iterate several times either for convergence or for trying to achieve lower total number buffers. While it has been shown that the relaxation algorithm of [BLDK06] results in good sub-optimal solutions with in 10% range of the optimal result for small examples which can be solved using MILP optimally, it is not clear how far the solution is from optimal for larger more practical circuits. The MILP relaxation problem becomes even harder for the case of conditional 71 slack matching specially considering the notion of arrival time gaps. In conditional slack matching with arrival time gap constraint, the arrival time of a node in two modes can be different as long as the gap constraint is met. The gap constraint makes the LP relaxation harder in the following sense: As a result of relaxation the arrival time of a node may be updated only in one of the modes which then may result in a violation of the arrival time gap constraint. Consider a case that some fractional number of buffers in needed on an input channel of a node. If the input channel belongs only to a single mode, the arrival time of the node in other modes is not needed to be updated unless a gap constraint is violated. In that case, if the arrival time of the node is updated in a second mode more slack matching buffers might be needed in that mode. The chain of relaxation dependency can lead to a change in arrival time in a third mode and so on. If these sorts of relaxation chains createacirculardependencymorebuffersthatexpectedmightbeaddedtothecircuit to make the solution feasible 6 . Another, complication which makes application of the relaxation algorithm of [BLDK06] impossible for conditional slack matching is the fuzziness of the notion of global criticality in conditional asynchronous circuits. While we can easily define the notion of critical cycles in unconditional asynchronous circuits based on the notion of algorithmic cycle in marked graphs, in conditional asynchronous circuits global criticality can only be defined based on infinite unrolling of the Petri Net model of the circuit. Due to the above complications we decided to develop a new relaxation algo- rithm, which will be described in details here. The main idea is that no notion of globalcriticalityisneededforrelaxationalgorithmastheconditionalorunconditional slack-matchingLPformulationalreadycapturesallthetimingarcsofthecircuit. The secondary idea is that relaxation of a fractional slack variable value can be done in 6 A similar complication may arise in unconditional slack-matching with arrival time alignment constraint. 72 the same LP formulation by adding a pseudo-binary variable to hold the fractional part and maximizing the pseudo-binary variable round fractional values to 1. Note that the LP formulation for slack-matching minimizes the number of buffers while the modified LP maximizes the pseudo binary variables. After we solve the original LP to get fractional values for the number of buffers, we iteratively solve the same LP with the following modifications. For each slack variable with fractional value in the solution, the slack variable value is fixed to the floor of the resulted value and a pseudo-binary variable is added to all channel constraint 7 . The pseudo binary is a continuous variable between zero and one. For each added pseudo-binary variable, a term with a large negative coefficient is added to the cost function of the LP. As a result the pseudo-binary variables are hardly pushed toward 1. The modified LP is then solved iteratively. Since pseudo-binary variables are added to the cost function with relatively large negative coefficients, it is ex- pected that all the pseudo-binary variables are maximized to 1 otherwise the problem has no solution and the target cycle time must be relaxed. Since one iteration of this algorithm may introduce fractional values on slack variables which already had integral values several iterations of the algorithm might be needed for convergence. It is expected that the above algorithm relaxes the results faster than suggested method in [BLDK06] since a set of fractional values can be relaxed in one iteration of LP while in the other method after rounding each fractional value, the global critical 7 Recall that in the conditional formulation a forward and a backward constraint exists per mode per channel. 73 channels must be reexamined. Our results shows up to 30 times faster convergence for some examples. Another advantage of the proposed algorithm is that the quality of relaxation and time of execution can be traded off for each other. For example the current version of the algorithm uses a linear schedule to increase the pseudo-binary coefficients in the cost function instead of using a very large value from the first iteration of relaxation. The intuition is to relax only those fractional at the early iterations which have the least negative impact of the on the number of buffers and hoping that the added extra slack reduces the need to round costly fractional at further iterations. A second heuristicistorelaxonlyasubsetoffractionsbasedonacriteriafunctionandgradually enlarge the set until all fractional are relaxed. Another heuristic which we currently use is to update the coefficient of the slack variables in order to reduce the number of fractional values prior to relaxation. 4.6 Mode Propagation Algorithm In the current implementation of conditional slack matching mode assignment done by the user. The designer is responsible for decomposing the circuit channels into multiple modes of operations through assignment of input and output ports of the design components or specific channels to different modes of operation and for assigning appropriate target cycle time values for each mode. Any channel which is not implicitly assigned to a mode is then assigned to appropriate modes of operation conservatively through a simple mode propagation algorithm. The procedure is done based on the principle that synchronized channels should be assigned to the exact same set of modes assuming that the circuit is slack elastic [MM98] 8 . Two channels in a circuit are called synchronized if the number of completed send actions on them 8 Slack elastic circuits are circuits in which adding arbitrary number asynchronous buffers to different channels will not cause a deadlock. All the circuits we are dealing with in this work are assumed to be slack-elastic 74 is always bounded. Inputs and outputs of any unconditional gate, such as an and2 with inputs A a and B a and output O a , are synchronized channels. Let jjCjj denoted the number of completed send actions on channel C. Then, 1jjA a jjjjB a jj 1 0jjA a jjjjO a jj 1 0jjB a jjjjO a jj 1 The reason is that the and2 cannot complete any communication with channel A a until it receives a token on channelB a and vice versa. Similarly no token can be send at the output until corresponding tokens are received at both inputs. In the same manner one can show the following for SEND module with input A s , condition E s , and output O s , 1jjE s jjjjA s jj 1 0jjA s jjjjO s jj1 0jjE s jjjjO s jj1 Note that while E s and A s are synchronized, while A s to O s is not. It must be noted that if synchronized channels are activated with inappropriate conditions there is possibility of deadlock and since asynchronous circuit are always designed to avoid deadlock one can safely assume that synchronized channels are in the same mode of operation. Based on this concept the mode propagation algorithm works as follows. Modes assigned to each channel, perhaps initially by the designer, must be propagated to synchronized channels, therefore For unconditional gates, modes are needed to be propagated from any input or outputs to all other terminals as all terminals are synchronizes channels. For SEND, modes are propagated only from input to enable, and enable to 75 R S 4 S 3 S 1 S 2 1,2,3 1,2,3 1,2,3 4 5 5 5 5 R S 4 S 3 S 1 S 2 1 2 3 4 5 (a) (b) Figure 4.4: Mode propagation example. input. For RECV, modes are propagated only from enable to output, and output to enable. AsanexampleconsiderthesimplenetlistofFigure 4.4. Usermarkedthechannels and shown in Figure 4.4(a) and using the above mode propagation algorithm the modes are propagated to get the mode assignment of Figure 4.4(b). It is notable that as a result of mode propagation, modes 1,2,and 3 are ended up have the exact same set of channels therefore they are equivalent. Note that the whole process is a top-down approach started from the user who has clear understanding of the functionality of the circuit and specifies the mode assignment to the channels. Mode assignment, however, can impact the quality of any optimization. Although an experienced user may prefers to have full control over detailed mode assignment and target cycle time selection, the procedure is tiresome and time-consuming and can be automated. 76 4.7 Guaranteed Average Cycle-Time The proposed MILP formulation, slack match the circuit such that the over all average cycle-time of the circuit is bounded by the B function presented in section 3.1. In this section we provide the sketch of the proof and the detailed proof is provided in A. The proof is carried out based on the following interesting theorem. Theorem IV.1. The delay of any path segment, , enclosed within two consecutive cut-lines started at place p I and ended to place p O is bounded by: D(hp I ;p O i)a (m) (p O )a (m) (p I )+ (m) where m is the mode marked by the two cut-lines, a (m) (p I ) and a (m) (p O ) are the arrival times associated with the places in mode m derived by our proposed MILP formulation and (m) is the stationary target cycle-time associated with mode m. Each globally critical path in the unfolded execution of the performance model is decomposed to several path segments by the cut-lines. Continuous application of the abovetheoremoneachpathsegmentsresultsinanupperboundonthelatencyofthe globally critical path. Note that whenever a mode switching happens in the unfolded execution, there would be a gap between the arrival times of the starting and ending places of the path segments and a m m has to be added to the sum to guarantee an upper bound on the latency of globally critical path. Similar to [NB12], we assume thereexistsatransitionofinterestwhichfiresinallthemodesandthereforebounding thedelayofthecriticalpathalsoboundsthelong-termaveragecycle-timeatthelimit. 4.8 Experimental Results Ourproposedconditionalslack-matchingschemeisappliedtoasetofbenchmarks including synthetic conditional asynchronous benchmarks ISCAS89a), and Manage- ment Chain, already proposed in Section 3.6.1 and used for Performance bound eval- uation. 77 4.8.1 Analysis of Results The proposed MILP is implemented as a part of Proteus design flow and the results are compared to unconditional slack-matching algorithm [BLDK06] which is already implemented in Proteus and has been used in industry for several years. Table 4.1 presents the results of the conditional slack-matching approach com- pared to unconditional slack-matching which treats all the conditional channels un- conditionally. The probabilities of mode switching is set to p 01 = 0:01 andp 10 = 4p 01 which result in stationary mode probabilities of 0 = 0:8 and 1 = 0:2. Column B shows the value of the bound in terms of number of signal transitions. Given that in Proteus flow each signal transition takes place in 50ps, It can be verify that in all cases the reported average cycle ( sim ) is bounded by B 50ps. The number slack matching buffers is reduced on average by 33% for this set of benchmarks. The table also lists the increase in cycle-time compared to unconditional slack matching where the circuit is slack matched according to the smaller mode target cycle time to demonstrate the tradeoff between number of slack matching buffers and average cycle-time. Figure 4.5 shows the results for MgmtChain circuit with different number of com- putational nodes. As the number of node increases, the difference between the arrival times of the nodes in Normal and Management modes increase and therefore the gap values also increase. This effect can be observed by in the figure as the values of , maximum gap between the two modes is increases from 24 transitions for 8 nodes to 449 transitions for 64 nodes. In unconditional slack-matching, buffers usually have to be inserted along the shorter paths to slack match re-convergent fanouts. In con- ditional slack matching, whenever a re-convergent fanout is formed between the two modes, far less number of buffers are needed as the gap values can get larger to ac- commodate for larger gap between arrival times in different modes. A reduction of 24% with 8 nodes to 32% with 64 nodes is reported in the number of slack-matching 78 !"#!$% &&#!$% &'#($% (&#)$% &"% *'% &+"% "",% ,)*% !++"% !+,'% !&"*% ,-"% ,*+% !+!&% !+',% +#+$% -#+$% !+#+$% !-#+$% &+#+$% &-#+$% (+#+$% (-#+$% "+#+$% "-#+$% -+#+$% +% &++% "++% )++% '++% !+++% !&++% !"++% '% !)% (&% )"% !"#$%&'!()%*'+',-.-/%#%.0'12-3.' ./0123%4567#8% Figure 4.5: Results for MgmtChain p MN = 4p NM . buffers in this example. On the other hand, large gap values can result in performance loss as B increases when the gap values and the switching transition probabilities between the modes are significantly higher. To demonstrate, the impact of gap on the performance, Figure 4.5 also plots the average cycle-time of the circuit resulted measured from post-synthesis simulation. When probability of transitioning from Normal to Man- agementisincreasedfrom 0:01 to 0:02 thecycle-timeincreasesbyapproximately 12% for 64 nodes. It is notable that the number of slack matching buffer remains the same for both cases as the stationary probabilities which are part of the MILP formulation are kept constant ( 0 = 0:2; 1 = 0:8)by setting p MN = 4p NM . One can reduce the overall average cycle times by decreasing stationary target cycle-times of either modesinpresenceoflargegapvaluesbutingeneralitisdesirabletobeabletobound mm values in the MILP formulation. 79 Bench SM- Buf (count) SM Time (sec.) MAX (trans.) B (trans.) 0 (trans.) 1 (trans.) sim (ps) SM- Buffs (Red.%) sim (Incr.%) as27 9 0 7 20.90 18 32 1014.06 35.71 12.67 as298 77 2 4 19.65 18 26 935.97 6.10 4.00 as386 105 2 12 21.33 18 34 989.59 18.60 -0.97 as349 38 1 4 30.06 30 30 1501.55 59.14 23.33 as382 114 3 4 25.63 24 26 1273.84 31.74 16.00 as400 131 2 12 21.56 18 35 1004.90 20.61 11.66 as420 166 2 35 21.90 18 36 1058.66 31.97 17.63 as444 137 3 26 21.82 18 36 1034.22 42.44 14.91 as510 36 6 4 29.26 28 34 1457.98 73.91 12.16 as526a 176 3 19 21.77 18 36 1043.59 39.93 15.95 as713 133 3 1 28.41 28 30 1379.17 27.72 6.29 as820 86 8 7 30.52 29 36 1464.75 42.28 4.57 as838 436 11 28 21.44 18 34 1004.55 22.28 11.62 as1196 811 51 40 21.79 18 35.33 1081.56 26.00 20.17 as1488 124 35 8 31.23 30 35.6 1516.00 62.87 1.72 as1238 1006 52 37 21.80 18 35.33 1061.75 17.94 17.97 as5378 1384 215 36 23.49 20 36 1107.04 12.85 0.64 as9234 316 99 18 42.58 36 44 2011.46 30.55 -4.22 Table 4.1: Conditional slack matching results for ISCAS89a. 4.9 Conclusions High performance in asynchronous circuits often comes at the expense of higher area and power partly due to the need for adding pipeline buffers to balance asyn- chronous pipelines. The problem of minimizing this cost for mode-based conditional asynchronous circuits given mode switching probabilities has been defined and solved using an MILP formulation. Our solution guarantees an upper bound on the aver- age performance of the circuit for any arbitrary sequence of mode switching based on a notion of mode switching gap penalties. Our results show an average of 33% reductioninthenumberofslack-matchingbuffersforanewbenchmarkofconditional asynchronous circuits based on ISCAS89a gate-level netlists. 80 CHAPTER V Integrated Fanout Optimization and Slack Matching Fanout Optimization is the standalone problem of building pipelined buffer trees for nodes with high-fanout to make the circuit implementable in a target technology. The problem must be solved under the maximum fanout per gate constraint and should guarantee the average targeted cycle time. The goal of fanout optimization for asynchronous circuits is to form fanout trees such that: Each high fanout channels is distributed among all its sink nodes. Every node of the circuit including the fanout buffers used to build the tree have a fanout count which satisfies the maximum fanout constraint. And finally, the post-fanout optimized circuit satisfies the algorithmic target cycle time. We postulate that fanout optimization and slack-matching (FOSM) are highly correlatedproblemsandthereforehavetobesolvedinanintegratedfashiontoachieve optimal result. In particular, we believe the shape of the fanout tree determines the static arrival times associated to the asynchronous nodes which highly impact slack matching. Therefore we expect that by properly skewing the fanout trees the total buffer count, including both slack matching and fanout buffer, will be reduced. 81 However, given the large set of feasible fanout trees and the fact that slack matching problem is itself NP-Complete, solving the two problems together is very challenging. The state of the art approach, currently used by industry, is to solve the fanout optimization problem first with no or limited slack-matching considerations. Unfor- tunately, this means fanout optimization may generate fanout trees which are not properly skewed for slack matching which may result in higher number slack match- ing buffers and therefore may produce inferior results. In this section we provide a complete MILP formulation and a heuristic algorithm to solve the problem through LPrelaxation. WhiletheMILPversionisintractableforevensmallsizedcircuits,the LP relaxation technique is quite fast and can easily handle large circuits in a matter of few minutes. A generalization of this work can be defined as conditional slack matching and fanout optimization which is left as future work. Similar to conditional slack match- ing, for a circuit with multiple modes of operation we can define a mode based ap- proach for fanout optimization. The idea is that since different modes of operations have different assigned target-cycle times, the same gate may have higher maximum fanoutconstraintsinslowermodesofoperations. Asaresultfanouttreesintheslower modes may be implemented with less number of slack-matching buffer and lower tree height This generalization is left as future work. 5.1 Problem Formulation The new problem of integrated slack-matching and fanout optimization can be defined as follows. We are given the following: Thenetlistofanasynchronouscircuit(imagenetlist)asadirectedgraphG c (V c ;E c ) where v 2V c is an asynchronous node, and c ij =hv i ;v j i2E c is a channel be- tween nodes v i and v j . 82 Forward and backward latency values for each channel. Maximum number of acceptable fanout count (outgoing edges from each node) for node v i as FO i . Our goal is to concurrently build fanout trees by inserting fanout buffers into the graphsuchthatthefanoutcountofnodev i islessthanFO i andinsertslack-matching buffers into the graph to achieve the performance target with a minimum number of inserted fanout and slack-matching buffers. The conditional version of the integrated slack-matching and fanout optimization adds the following items to the givens of the problem 1 . A set of circuit’s operational modes = fm 1 ;:::;m M g where m k E c is the subset of channels activated in the k th mode. Markov chain matrix containing pairwise mode transitional probabilities, P = [p mm ]. Maximum number of acceptable fanout count for node v i at modem asFO (m) i . The problem has to be solved under targeted average case performance for any arbitrary sequence of mode transitions compatible with the given transitional proba- bilities instead of unconditional performance bound 2 . Although there exists no prior work which considers solving the two problems jointly, in [Dim09] the relation between slack matching and fanout out optimization is discussed for buffer sharing. The method suggests the use of one slack variable at the root of each fanout tree to represent the maximum value of the slack variables on 1 The algorithms presented in this chapter are believed to be applicable to the conditional fanout optimization and slack matching problem in an straight forward manner, but the current implemen- tation only handles unconditional circuits. 2 Note that in Proteus, since conditionality is implemented using send and receive cells the fanout branches always belong to the same modes and therefore the mode of each gate can be determined statically as long as send and receives are not moved. 83 the fanout channels. By using only the maximum slack matching variable in the cost function of the slack matching MILP, Dimuo demonstrated that the slack matching buffers can be shared among different branches of the tree and the cost of buffer sharing is accurately modeled. 5.2 MILP for Integrated Fanout Optimization and Slack Matching In this section we present the general form of the proposed MILP to solve the FOSM problem concurrently. We refer to the high-fanout node along with its buffer tree as a fanout cone and the fanout channels of the node as branches. For the i th fanout cone, with a fanout of F i , a feasible fanout tree is a tree which distributes the value of the root channel to all the sinks such that all the fanout buffer and the source gate have met their fanout constraints. There are many such fanout trees with different skews and such implementations can be partially specified by assigning a level to each branch, using an F i F i matrix, X i = [x i jl ]. x i jl = 8 > > < > > : 1 branch j is assigned to level l 0 otherwise Obviously, at the minimum each branch has to be assigned to one and only one level, meaning that X l x jl = 1. This property however does not guarantee that the tree specified by an arbitrary X is feasible. As an example, consider a single rooted fanout tree 3 withF i = 4 and fanout buffers with maximum 2 outgoing branches. For example, X 1 is not implementable because branches 1 and 2 are assigned to level 3 Some fanout trees may have multiple roots as the root node may support multiple fanouts by itself. 84 (X 2 ) (X 3 ) Figure 5.1: Implementation of fanout trees of X 2 and X 3 . one and since each buffer has maximum two branches and there is only a single root the tree cannot grow to branches 3 and 4. On the other hand, X 2 is implementable with all branches are assigned to level 2. X 3 as also implementable forming a skewed fanout tree. Both X 2 andX 3 can be implemented with 3 buffers, as shown in Figure 5.1, but as these implementations can result in different static arrival times for slack matching, one of the them may become superior. X 1 = 2 6 6 6 6 6 6 6 4 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 3 7 7 7 7 7 7 7 5 X 2 = 2 6 6 6 6 6 6 6 4 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 3 7 7 7 7 7 7 7 5 X 3 = 2 6 6 6 6 6 6 6 4 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 3 7 7 7 7 7 7 7 5 To eliminate unimplementable matrices, we need to keep track of used and avail- able branches in each level as follows. Let us define an integer b l as the number of buffers connected to the tree at level l. b 0 always equals to the number of roots of the fanout tree. The number of branches available at level l + 1 can be calculated as F b l1 where F is a constant equal to the maximum outgoing branches of each fanout buffer. For instance in X 2 example, b 0 = 1, b 1 = 2, and b 2 = 0. For a tree to be implementable we need to guarantee that there are enough available buffers at each level of the tree to support all the assigned branches and available buffers for 85 the next level of the tree. This property can be stated as follows: F b i l1 X j x i jl +b i l Given the b l and x jl values one can optimally build the tree as the number of buffers and the levels of each branch is determined. The total number of fanout buffers can be calculated as P l b l . Based on the above formulation an MILP can be used to find optimal implementable trees with appropriate cost function although solvingsuchMILPforoptimaltreescanbequiteimpracticalasthesetofsuchfeasible trees can exponentially grow as the fanout count increases. The MILP presented in Figure 5.2 jointly solves the fanout optimization and slack-matching. NotethattheMILPisdevelopedbasedonasimplifyingconservative assumption that fanout buffers does not have any free slack [BLDK06] and therefore the impact of a fanout buffer is modeled only through the forward latency constraint. This assumption is not always true, as in some rare cases the fanout buffers can provide some free slack and therefore can also help slack matching of the circuit. But our assumption is conservative, meaning that ignoring the free slack of fanout buffers leads to a correct slack matching although at a slightly higher number of slack matchingbuffer. Modelingthetruefreeslackoffanoutbuffersseemstobeimpossible in the current formulation and is left as future work. In the above formulation, l f is the latency of each fanout buffer and superscript i represents the i th fanout cone rooted at node v i . Note that the forward (i) and backward (ii) constraints on the static arrival times can be tracked back to the slack matching formulation shown in Figure 4.3. The only change here is that the delay of fanout buffers are accounted for in the forward paths by the term l f P l lx i jl . Also note that the backward delay constraints are kept unchanged due to the fact that we assumed fanout out buffers provide zero free 86 Minimize X cost c s c + X i X l cost b b i l subject to 8 c =hv i ;v j i2E c ; 8 lF i 8 > > > < > > > : a j =a i +l c m(c) +f c +l s s c +l f P l lx i jl (i) 0f c c +s c ( s ) (ii) F b i l1 P j x i jl +b i l (iii) P l x i jl = 1 (iv) a i 0; 8 v i 2V c s c ;b i l 2Z + x i jl 2f0;1g Figure 5.2: MILP to solve fanout optimization and slack matching (FOSM). slack. Also it is worthwhile mentioning that the true free slack of fanout buffers can only be calculated once the fanout tree shape is fully determined. Therefore will be described in Section 5.3, the slack matching problem can be reformulated to account for the true slack of the fanout out buffers once the fanout trees are implemented. 5.3 Relaxation Algorithm The proposed MILP in Section 5.2 turned out to be intractable for even moderate sized circuits, therefore to be able to solve the problem in reasonable time we devel- oped a heuristic algorithm to solve related sub-problems iteratively to obtain high quality suboptimal solutions. To do so we will use the same formulation presented in Figure 5.2 but we relax the binary and integer variables to real values meaning that we let: s c ;b i l 0 x i jl 2 [0;1] The main reason for the complexity of the presented MILP is that the MILP solver have to branch and bound on a relatively large search space resulted from the constraints associated with potentially large fanout trees. The relaxed LP version of theproblemcanbesolvedrelativefastwhichenablesustodefinefewLPsub-problems 87 to get to a high quality suboptimal result in feasible time. The main idea is that instead of solving the complete MILP problem to concur- rently derive the shape of all fanout trees, a subproblem is defined which simply decides to whether assign a specific branch to the current level of the fanout tree or to postpone its assignment to the later iterations of the algorithm to the next layers of the tree. Our iterative relaxation algorithm is presented below and is composed of three major steps. In each iteration of the algorithm starting from level L = 1 of the fanout trees our algorithm performs the following tree steps and proceeds until all branches are assigned to a layer. Step 1: Concurrently assign all the critical branches to level L of the fanout trees. Step 2: Round the fractional number of buffers (b i L ) connected to level L to the next largest integer and fix them. Step3: GreadilyassignnoncriticalbranchestothecurrentlevelLofthefanout tree if there is room for those branched and only if it reduces the overall cost of FOSM problem. Step 1 of the algorithm considers all the branches of all fanout trees concurrently and decides which branches can be safely postponed further apart from the root of the trees. We define a branch as critical if it has to be assigned to the current level of the tree otherwise a timing constraint of the problem will be violated. Heuristic Rule I: Critical branches are assigned to the current level of the fanout tree because they cannot be postponed. Note that the criticality of each branch depends on the assignment of other branches which are being processed concurrently at each iteration and the assign- ment of other branches which are processed in previous iterations of the algorithm. 88 f 1 b 11 f 2 b 21 Figure 5.3: Interdependency of branch assignment to fanout tree levels. To demonstrate this point let s assume that two fanout trees f 1 and f 2 as shown in Figure 5.3 are located in a loop. Let us assume that the slack matching target cycle time for the circuit is 18 transitions and also that before fanout optimization there are 6 pipeline stages in this loop and therefore the latency along the loop is 12 transitions. At iteration 1 of the algorithm neither b 11 of b 21 branches shown in the figure are critical as the assignment of both branches can be postponed to level 2 which results in a latency of 16 around the loop. Note that in the next iteration, the available slack for fanout buffers is only 2 as postponing either of these branches leads to the latency of 18. This means that if the assignment of b 11 is postponed to the next level of the fanout tree b 21 becomes critical and has to be assigned to level 2. Also note that b 12 becomes critical in the next iteration and has to be assigned to level 3. Detection of critical branches at iteration L of the relaxation algorithm is done through solving the relaxed linear program with a simple modification of the cost function. The following term is added to the cost function of the LP. X ij Z ij x i jL where, Z ij is a large positive unique coefficient for each branch of each fanout cone. Sincethecoefficientislarge,thismodificationguaranteesthatx i jL arepushedstrongly toward zero. Once the LP is solved, any branch with non-zero x i jL is conservatively consideredtobecriticalandisforcedtobeassignedtolevelL. Notethatifthebranch is already assigned to previous level of the tree, x i jl for some l<L is 1 and therefore 89 x i jL isforcedtobezero. Ifthebranchisnotassignedtoanylayerinpreviousiteration of the algorithm x i jl = 0 for all l < L and the fact that x i jL is non-zero suggest that the branch cannot be pushed to a later stage due to timing violation. On the other hand, oncex i jL equal to zero for an unassigned branch it is guaranteed that it can be pushed back to a later layer of the tree and therefore it is non-critical. OncecriticalbranchesareassignedtolevelL,atStep2,weusetheceilingfunction to fix the value of b i L to the next largest integer. For example, if fanout cone i needs 2.3 buffers to ensure feasible implementation, b i L will be forced to 3. This heuristic guarantees that once the relaxation is complete all b i l values are integer. The idea behind this heuristic is that if due to this ceiling operation some unassigned branches are forced which tends to stay at current layer is forced to be scheduled at next level of the tree, there is always available room for that branch in next layer. For instance ifeachfanoutbufferhasafanoutof4,inthepreviousexample,wemaypushabranch with x i jL = 0:7 from layer L to L+1 by ceiling 2.3 to 3 but we know that 0:7 4 branches availability is added to the next layer of the tree. Recall that the postponed branches at Step 2 are not critical otherwise the branch would have been forced to level L at Step 1. Heuristic Rule II: At iteration L of the algorithm, once critical branches are assigned, the fractional b i L values are rounded up to the next integer. At Step 3, we solve the original relaxed LP once again after modification of the valuesofb i L atStep2. Wethenassignbrancheswhichhaveatendencytobescheduled at level L at this level by forcing their x i jL = 1 if there is any branch available at level L. To determine the availability of branches we examine the values of b i L1 , b i L , x i jL to ensure that the availability constraints of the LP are not violated after the assignment of the branch to this level. A branch has a tendency to be schedules in stage L if x i L has the maximum value among all x i l . Heuristic Rule III: At Step 3, branches with a tendency to be scheduled at 90 level L are greedily scheduled to level L. 5.4 Buffer Sharing and Tree Implementation As mentioned before, in the formulation presented for integrated fanout and slack matching (Figure 5.2), a slack buffer variable is added for each channel. These slack variables are independent from each other even for channels with the same source node which compose a fanout cone. It is however expected that by sharing slack buffers between channels of the same fanout cone the total number of slack matching buffers can be considerably reduced. For example, if a fanout cone with 3 channels turns out to need 2, 4, and 5 slack matching buffers on its channels, it would be possible to share 2 slack matching buffers at the root of the fanout tree and add 0, 2, and 3 slack buffers to the channels. As a result, the total slack matching cost can be reduced from 11 to 7 buffers. Although sharing buffers at source is relatively straight forward further extension of this idea highly depends on the shape of the fanout tree. Consider Figure 5.4, which shows different implementations of the fanout tree of the example above with different multilevel sharing of the slack matching buffers. Figure 5.4(a) depicts the original implementation of the tree without sharing. Figure 5.4(b) shows single level sharing of the slack matching buffers at root of the tree. The two different implementations of the tree shown in Figure 5.4(c) and (d), while have the same functionality at circuit level result in different slack matching cost. In Figure 5.4(c), since the first two branches labeled with 0 and 2 slack matching buffers are sharing the same source fanout buffer further sharing is impossible while inFigure5.4(d)sharingcanbedonemoreeffectivelytoobtainevenlowercostofslack matching equal to 5. Compared to Figure 5.4(a), further sharing of slack matching buffers requires a different implementation of the fanout tree with one extra fanout bufferandthereforetheoverallreductiononthenumberofbuffersis5buffersinstead of 6. 91 0 2 3 2 2 4 5 0 2 2 3 0 0 0 1 2 2 0 (a) (b) (d) (c) Figure 5.4: Different implementations of a fanout tree and its impact on buffer sharing. The above discussion highlights the dependency between fanout optimization and slack matching. In this section we discuss ideas on shaping the fanout tree optimally to reduce the total cost of fanout tree and slack matching buffers. Note that the solutionofthelinearprogramofFOSMonlypartiallydeterminestheshapeofthetree by determining the number of available buffers and level assignment of the branches. The number of required slack matching buffers for each branch is also known by solving the LP. We will present a greedy algorithm to shape the fanout trees to minimize total buffer cost based on the number of slack matching buffers and level of each branch. In the next section, we also present several heuristic modifications to the LP to improve the possibility of sharing slack matching buffers through reshaping the fanout tree. 5.4.1 Modifications to the LP to Improve Fanout Sharing The FOSM linear program presented in Figure 5.2 can be slightly modified to improve fanout sharing. These heuristic modifications are designed to push the final solution of the LP toward fanout trees shapes which enable the maximum possible buffer sharing. The buffer sharing at the root of the tree can be formulated in the LP at the 92 cost of slightly more complicated LP. A similar technique is introduced in [Dim09] which is extended in our formulation as follows. For each set of channels in the same fanout cone we add a separate slack matching variable to its channels which models the number of slack matching buffers shared at the root of the tree. 8 > > < > > : a j =a i +l(c)M(c) +f ij +l s (s i +s ij )+l f P l lx i jl 0f c c +(s ij +s i )( s ) Note that all the channels in the i th fanout cone share the same slack matching variable, s i , in the above formulation and therefore the value of s ij can be trade off with s i . However since the cost of s i is the same as s ij in the formulation, this can reduce the cost. For example, if s 12 = 3 and s 13 = 1, by setting s i = 1 and reducing s 12 and s 13 by one the cost function can be reduced by 1 unit. Note that the above change in the formulation improves the precision of the cost function and enables the LP to find solutions which are in favor of sharing buffers at the root of the tree. Extension of this idea to the sharing at different levels of the tree is not straight forward as adding dedicated slack variables at the output of the fanout buffers in the mid-levels of the tree requires knowledge about the exact shape of the tree which is not available in the LP. However, a heuristic can be used to mimic the same concept. Heuristic Rule IV: Slack matching buffer are placed closer to the root of the tree because these can lead to better buffer sharing. To enable the above heuristic in the LP, for each fanout cone a slack matching variable is added per level of the tree. That is for a fanout cone with l branches, the fanout tree can have l levels. We add l slack variables one for each level of the tree namely,s 1 i ;:::;s l i wheres 1 i models the number of slack matching buffers at the root of the tree. To encourage the LP to use slack variables closer to root we set the cost of 93 variable further apart from the root to increase as l goes up. Note that intuitively, s 1 i should have 1 4 of the cost of s 3 i as each buffer at level 1 can be potentially shared among 4 branches at level 3. ThismodificationleadstoamorecomplicatedLPandtokeeptheimplementation feasible this optimization is only enabled for fanout out cones with less than 10 channels in the current implementation. 8 > > < > > : a j =a i +l(c)M(c) +f ij +l s P l (ls l i )+l f P l lx i jl 0f c c + P l (s l i )( s ) DuringlevelbylevelrelaxationoftheFOSM,onceabranchisassignedtoaparticular level of the tree, sayL,s l i variables for anyl>L will be dropped from the constraint relatedtothatbranchtoindicatethatsharingoftheslackmatchingbuffersassociated to the branch is not possible in higher levels of the tree. Our results showed that this heuristic is very effective mainly because it increases the precision of the cost function. 5.4.2 Implementing the Fanout Trees - The Greedy Algorithm As mentioned before, the solution of the LP partially determines the shape of the tree by deriving the level of each branch (x i jl ) and the number of available fanout buffers at each level (b l i ). For a given solution of the LP however different implemen- tationsoffanouttreesarepossible. Toensure thatthesolutiondoes notleadtoextra slack matching buffers it is essential to keep the arrival time of the nodes unchanged. The actual implementation of the fanout trees can be obtained through hierar- chical clustering of branches of each tree. The presented greedy algorithm forms the fanout trees simply by finding a clustering of the branches from the last level of the tree toward the root. As an example, for a fanout tree with 8 branches a hierarchical clustering such as f1;3;ff4;f5;6;7g;2g;8g can completely defines the shape of the 94 4 2 2 3 8 1 5 7 6 Figure 5.5: Example: Implementation of a tree associated with hierarchical clustering f1;3;ff4;f5;6;7g;2g;8g. tree as shown in Figure 5.5. Therefore the goal of our greedy algorithm is to find a clustering which maximizes buffer sharing. Note that in the process of building fanout trees new channels have to be added to the initial set of channels to connect fanout buffers to each other. We denote the set of channels which included original and added channels as ^ E. A cluster, C k ^ E, is a set of channels which are connected to the output of the k th fanout buffer. Obviously the size of each cluster has to be less than the maximum number of fanout channels a fanout buffer can support, jC k jFO MAX . HeuristicRuleV:Onlybrancheswhichareatthesamelevelofthetreeareclus- teredtogethertoguaranteefeasibilityandtokeepthearrivaltimevariableassociated to the sink node unchanged to avoid violation of target cycle time. Ouralgorithmssortsthechannelsineachlevelofthefanoutconeinthedescending order of the number of slack matching buffers inserted on the channels and then add the channels one by one to the current cluster until the value of cluster drops such that creation of a new cluster is justified. Heuristic Rule VI - Value of the cluster: The value of each cluster is de- fined as the minimum value of the slack matching variable associated to the channels included in the cluster. V(C k ) = min c ij 2C k s ij AsanexampleifclusterC k includestwochannelswiththefollowingvaluesofslack 95 matching buffers f4:1;3:3g the value of the cluster is V(C k ) = 3:3. Adding a new branch with slack variable equal to 1:2 to this cluster reduced the value of the cluster by (3:31:2) = 2:1. If the cost of creating a new cluster isC cost = 1, given that there are enough available buffers in this level, it make sense to be greedy and terminate the current cluster with just two branches and add the new branch to a new cluster to minimize the reduction in the total value. However, if we assume thatFO Max = 4, terminatingthisclusterwithonly2branchesmayleadtobufferavailabilityproblems. Therefore the algorithm has to look ahead and ensures that terminating a cluster is safe. For example if once the cluster is terminated, 5 branches are left unassigned at this level and the number of available buffers is 2, then the cluster cannot be terminated. This is because we have to use one buffer to implement the current cluster and the next buffer can only implement 4 branches. Therefore at this point we have to accept the cost of adding the current branch to the current cluster despite the drop in the cluster value to 1.2. The complete algorithm is defined as follows: For a fanout conei, with maximumL i levels, set current levelcurLevel =L i , and pick the next empty cluster, C k . Repeat until curLevel = 0: 1. Set UsedBuffers = 0. 2. Sort all the branches assigned to curLevel in descending order based on the number of slack matching buffers added to each channel, s ij , in the list of Available Branches (AB). 3. Removethefirstbranchc ij =hi;ji,withmaximumvalueofslackmatching buffer, from AB, c ij =AB:pop(). Add c ij to the current cluster C k :Add(c ij ): 96 4. Repeat until AB is empty: Select the next channel from AB, c ij =AB:pop(), Terminate C k as follows if jC k j =FO Max or (V(C k )V(C k +c ij ))>C cost andjABj+1<FO Max (b i curLevel UsedBuffers) . – Instantiateafanoutbufferwithinputchannelc buf andoutputchannel C k . – ^ E = ^ E +c buf . – Add c buf to the list of branches assigned to level curLevel1. – usedBuffers =usedBuffers+1 – Pick a new empty cluster to continue, k =k+1: – Add c ij to the new C k . Otherwise, add the c ij to C k . 5. curLevel = curLevel -1. 5.5 Experimental Results Todemonstratetheeffectivenessofthepresentedmethodforintegrationoffanout optimization and slack matching, the LP formulation and the relaxation and the greedy algorithms presented in previous sections are implemented in Proteus and the results are obtained for the ISCAS89 benchmark circuits. As before we considered each gate as a PCHB pipeline stage and wires between gates are implemented using asynchronousfourphasedualrailchannelsusingthelegacyRTLflow. Theresultsob- tained from the FOSM method compared to the current fanout optimization method are presented in Table 5.1 and Figure 5.6. The new integrated and fanout optimiza- tion method saves on average 22% of total needed fanout and slack matching buffers. This achievement is mainly due to creation of skewed fanout trees which are suitable 97 Table 5.1: Integrated Fanout & Slack Matching Results !"#$%&'() *+,-./#0 12,-./#0 3.0,!/4 35&" *+,-./#0 12,-./#0 3.0,!/4 35&" 678 8 7 9 : 8 7 9 : ;<;; 77<77 77<77 679= >; :; 8: : ?8 7@ >: : :@<;= :@<;= ?9<?@ 6?@@ >? 9 8? : @@ 7> 8; : @<:: :7<?? ?8<:@ 6?@9 >? 9 8? : @A 7> 8: : 7<8@ :7<?? ?><>7 6?=7 89 :; 97 : ?9 ?> 8A 7 :=<@= :;<=8 @=<;; 6?=> 8: :: == : 7> ?= >@ 7 78<78 :7<A; A9<?= 6@;; =@ :: 9= : @: ?= 89 : :9<?9 ::<77 @=<:; 6@7; 8= :7 98 : @@ ?A =A : :7<?8 :7<?8 @:<:= 6@@@ 8= :7 9? : ?9 ?= 88 7 :8<7; :7<9; @9<?A 6>@: :;; :9 :?? : A@ @= :;7 7 7?<?: :@<79 @8<;> 68:? 97 :9 :78 : @8 @= 9A : 7A<7; :@<9> A;<A? 6=7; :>8 ?> 77= 7 @7 =: :7A 8 @A<:= :A<89 >@<=; 6=?= :9@ ?; 7?@ : =9 :;@ :99 > :@<9> :7<=7 A7<7> 69A? 77@ @: 7=8 ? =? 9@ :88 :? ?=<?? :@<79 A?<:: 6::9> @:9 ?= A;= A :>? :8: ??@ ?; ?@<7A 8<@= A:<7; 6:7?= @AA ?> A@7 @ :=; :=> ?>> ?? ?7<@8 ><>@ A;<=7 6:@7? @:: @= @8= A :A: ::A 7>> := @@<?A :;<;@ @?<7? 6A?8= >9A 9@ =7A :@ ?9: 7>@ >A8 :;@ 7;<?> ::<?9 @;<:= 697?@ >A> =8 8>? :> 7== 7@@ A?@ >@ ?;<;: ::<@; @A<>9 6:?7;8 :@7: 7:= :>=: ?9 9@: @8> :@78 ?=7 :A<:: :7<98 ??<?> 77<?7 BC"('D" 2(5D<,12, -./#0,EFG H12*+,12, -./#0,EFG !/4<,-./#0, I"J,EFG H12*+ 2(5D5#'K,12L*+ for slack-matching and the greedy buffer sharing. While the original fanout opti- mization algorithm uses the minimal number of fanout buffers, our results indicate that ignoring slack matching constraints in fanout optimization leads to more slack matching buffer in the later steps of the synthesis flow. In addition to the percentage of reduction in total buffer count, Figure 5.6 also depict the ratio of the number of fanout buffers to the total buffer count. As it can be seen, the optimally skewed fanout trees tend to use almost 50% of the total buffer count in the formation of fanout trees compared to about 10% fanout buffer ratio used in the original fanout optimization algorithm and yet achieves 22% less overall buffer count on average. !"!!# $%"!&# %"$$# '"(%# $&"%&# '("'(# $)"*)# $'"*(# $("'!# '*"*$# '+"'!# %+"$&# $%"),# *&"**# *%"'+# *'"%(# %%"*+# '!"*,# *!"!$# $+"$$# !"!!# $!"!!# '!"!!# *!"!!# %!"!!# +!"!!# ,!"!!# (!"!!# -'(# -')&# -*%%# -*%)# -*&'# -*&,# -%!!# -%'!# -%%%# -,%$# -($*# -&'!# -&*&# -)+*# -$$),# -$'*&# -$%'*# -+*(&# -)'*%# -$*'!(# ./012#314"#567# 89:;<#3=>?@#92A"#567# 89:;<#3=>?@#B89;<#567# Figure 5.6: Integrated Fanout & Slack Matching Results. 98 5.6 Conclusions Theproblemofintegratedfanoutandslack-matchingoptimizationofasynchronous circuits is formulated. The overall result is a saving in the total fanout and slack matching buffers by an average of 20% which translates to about 8% average reduc- tion in the total area of the circuit, because up to 40% of the area of these circuits are composed of fanout and slack matching buffers. Our results clearly indicates the de- pendency of the FO and SM problems. The resulted area improvement is at the cost of up to 10 times increase in execution time for larger benchmark circuits although all the benchmarks could be processed in matter of few minutes. Inaddition, sinceourformulationprocessesallthefanouttreesconcurrentlywhile making decisions, we believe our approach also improves the feasibility of fanout op- timizationbyfindingsolutionswhichtheoriginalfanoutoptimizationfailstoachieve. In comparison, the original fanout optimization build fanout trees sequentially in a randomorderandbecauseoftheorderdependencyinformingfanouttreesitmayfail if the timing constraints are tight and circuit includes competing large fanout cones. 99 CHAPTER VI Conclusions and Future Work Asynchronous circuits have been studied for a long time and recently gained more interest as global synchronization becomes more challenging due to the growing vari- ability and increasing complexity of System-On-Chips. Asynchronous circuits can be considered as a viable solution in a broad set of application [YVR13, BZL + 12, JE12]. The design of efficient asynchronous circuits however requires intelligent use of con- ditional communication. This thesis provides theoretical foundation to analyze the performance of condi- tionalasynchronouscircuitswithmodebasedbehavior. Ourproposedscheme,models aconditionalcircuitasasystemwhichswitchesbetweendifferentmodesofoperations. Each mode of operation is composed of the set of channels which are activated to- gether during that operation and switching between these modes are modeled using a Markov chain. The underlining performance model is a wide class of Petri nets which enables the application of the analysis framework to a broad class of pipeline tem- plates developed to implement asynchronous circuits including QDI PCHB [Lin98], self-timed GasP [SF01], and micro-pipeline mousetraps [SN07]. The main contribution of the thesis is the development of several performance bounds which precisely bound the average cycle time of conditional circuit based on the given Markov chain model and set of modes of operation of the system. Our 100 performance bound makes no assumption on the exact sequence of mode switching and therefore can be utilized at optimization of the circuits for which the sequence of modes is not known at design time. The calculation of the performance bounds is carried out through solving a linear program which enables us to easily integrate these bound in optimization problems which can be formulated as linear program as well. Based on the formulation of the proposedperformanceboundsweformulatedandsolveseveraloptimizationproblems forconditionalasynchronouscircuitsincludingconditionalslackmatchingandfanout optimization. Ourexperimentsdemonstratedthatincorporatingtheinformationaboutdynamic nature of conditional asynchronous circuits in the optimization of these circuits can improve the efficiency of these circuit considerably. In the case of conditional slack matching we achieved up to 74% reduction in the number of pipelines buffers as a result of using the more accurate performance bounds. As for the future work, we suggest to revisit several optimization problems with average case performance constraints through the application of the proposed perfor- mance bounds. Example of such problems, includes performance driven clustering of asynchronous circuits, automatic mode assignment and mode clustering, and power optimizationofconditionalasynchronouscircuitswithperformanceconstraintswhich are discussed in more details in the following sections. 6.1 Performance Driven Clustering of Asynchronous Circuits Clustering is a major step in implementing asynchronous circuits and determines pipeline granularity of the circuit and therefore has a major role in managing asyn- chronous circuit overheads. Ingeneral,widepipelineswithhighlogicdeptharedesirabletoshareasynchronous controllersforalargerportionofthelogicbutitmayleadtopoorcircuitperformance. 101 In particular, a controller is needed to implement each cluster and each data lines crossing clustering boundaries requires a state holding element, therefore optimal tradeoffofthetotalnumberofwirescrossingclusterboundariesandthetotalnumber of clusters can minimize the total overheads of the circuit. It is also notable that since clustering enforces the structure of circuit e.g. re- convergent fanouts (fork-joins) and number of stages in a ring it has a major impact on the result of slack matching. Also note that clustering determines the number of outgoing edges from each cluster which in turn can impact Fanout optimization. The goal of performance driven clustering is to build a micro architecture for an asynchronous circuit through clustering logic gates into pipeline stages, under a performance constraints such that the overall all cost of the circuit implementation including required slack/fanout buffers and the total latch/controller cost are mini- mized. Wesuggesttodevelopthesolutionusingasimilarstrategyastheconditionalslack matching , Chapter IV, and integrated fanout optimization and slacking matching, Chapter V. Meaning to first formulate a complete MILP for the problem which is expected to be intractable and then try to develop a proper relaxation algorithm which can find high quality sub-optimal solution. One way to formulate the MILP is to assign a binary variable,b ij , to each channel c ij which denotes whether the channel is on a cutline post clustering. b ij = 8 > > < > > : 1 if channel c ij between two different clusters: 0 if channel c ij resides inside a cluster: In order to capture the impact of clustering on slack matching we suggest to use an static arrival time variable per cluster. This idea requires careful development of arrival time constraints as the shape and latency of clusters is subjected to change. 102 6.2 Automatic mode assignment and mode clustering In the current implementation of conditional slack matching mode assignment is done by the user. Designer is responsible for decomposing the circuit channels into multiple modes of operations through assignment of input and output ports of the design components or specific channels to different modes of operation and for assigning appropriate target cycle time values for each mode. Note that the whole process is a top-down approach started from the user who has clear understanding of the functionality of the circuit and specifies the mode assignment to the channels. Mode assignment, however, can impact the quality of the optimization. Although an experienced user may prefers to have full control over detailed mode assignment and target cycle time selection, the procedure is tiresome and time-consuming. In a bottom-up approach, one can indeed consider the output channels of the SENDs and input channels of the RECVs modules which are controlled by the same enable signal in the same mode. While through the application of the above mode propagationalgorithmsomemodesmayenduptobeequivalentthenumberofmodes maystillbequitelarge. Largenumberofmodescanmaketheprocessofoptimization time consuming and in some cases impractical. Therefore it is desirable to develop an scheme to conservatively cluster different modes of operation into a predetermined number of modes in a way that calculated performance bound based on the new set of modes are still conservative. The problem of mode assignment to the channels of circuit can be defined as follows. We are given the following: The netlist of an asynchronous circuit as a directed graph G c (V c ;E c ) where v 2 V c is an asynchronous node, and c ij = hv i ;v j i 2 E c is a channel between nodes v i and v j . A set of circuit’s operational modes = fm 1 ;:::;m M g where m k E c is the 103 subset of channels activated in the k th mode. Forward and backward latency values for each channel. Markov chain matrix containing pairwise mode transitional probabilities, P = [p ij ]. Maximum number of modes N. Ourgoalistoclustermodesofoperationintoanewsetofmodes =fm 1 ;:::;m N g and obtain a new matrix of pairwise mode switching probabilities, P = [p ij ], such that the performance bound derived using the new set of modes is conservative and the error of the new performance bound is minimized. In a simpler version of the problem, mode clusters, C = fc 1 ;:::;c N g, are given where c i , and the goal is to find the modified Markov chain matrix, P = [p ij ], which results in a conservative performance bound. 6.3 Reconditioning with Performance Constraints Thereconditioningproblemoriginallyisdefinedasrelocatingsendandreceivecells in the image netlist appropriately in order to minimize the overall power consump- tion. An MILP which formulates the reconditioning problem and list of conditions to preserve correct functionality and avoiding deadlock is presented in [Sai12]. An extension to the problem is to solve the reconditioning problem under performanceconstraints throughapplicationoftheproposedperformancebounds for conditional circuits. A challenge which must be addressed in order to enable ap- plication of our performance bounds to reconditioning is how to formulate the impact of relocating send and receive modules on circuit operational modes. The location of send and receive cells, in fact, determines conditionality of the channels and therefore moving these cells can possibly make conditional channels unconditional, or moves a 104 channelsfromonemodeofoperationtoanother. Theimpactofsuchmutationstothe netlist on the overall average performance of the circuit is not yet well understood. This study would enable us not only to solve the power optimization under perfor- mance constraints but also it would empower us to solve the reconditioning problem for different cost functions. Reconditioning and conditional slack-matching is the problem of finding the op- timal location of send and receive cells in the netlist such that the number of slack- matching buffers is minimized. Reconditioning for performance is the problem of finding the optimal location for send and receive modules such to hit a specific performance target. 105 Appendix A Theorems and Proofs The first set of two lemmas support our claim that each segment can be mapped to a live safe marked graph. Lemma A.1. Let ^ p 1 and ^ p 2 be two places in a segment of a live safe and reversible unique choice Petri net with all cycles initially marked. If ^ p 1 is not on the boundaries of the segment, then l(^ p 1 )6=l(^ p 2 ): Proof. (By contradiction) If l(^ p 1 ) = l(^ p 2 ) = p, there must exist a path, , between ^ p 1 and ^ p 2 in the segment. Otherwise ^ p 1 and ^ p 2 would be concurrent which would imply the existence of a marking in the original Petri net with multiple tokens in p, contradicting the fact that the Petri net is safe. This path corresponds to a cycle, c, in the original Petri net containing p which is not initially marked, contradicting our initial assumption. Lemma A.2. For any two arbitrary transitions in any segment of a live safe and reversible unique choice Petri net with all cycles marked initially, l( ^ t 1 )6=l( ^ t 2 ): Proof. (By contradiction) All the transitions in any segment have all their input places in the segment. If l( ^ t 1 ) =l( ^ t 2 ) then either there exists two input places where l(^ p 1 ) =l(^ p 2 ) or there exists one input place in which both transitions ^ t 1 and ^ t 2 are in its postset. The latter suggests that the segment exhibits choice which is not possible 106 by the definition of a segment. The former is not possible according to Lemma A.1. Lemma A.3. Lets i =h ~ P i ; ~ T i ; ~ F i i be a segment of an unfolded execution of a live safe and reversible unique choice Petri net. (s i ) is a live safe marked graph. Proof. (s i ) is a marked graph because by definition of a marked graph all unfolded places have at most one output and one input transition and by Lemma A.1 and Lemma A.2 no two internal unfolded places/transitions in a segment map to the same original place/transition. (s i ) is live sinces i is acyclic so by marking all places on its input cutline, all the paths in s i are marked by exactly one token. Therefore each cycle, c 2 P (s i ), is marked at least by one token, a sufficient condition to guarantee(s i ) is live. Moreover, from each input boundary place p there must exist a path to its next occurrence p 0 . Otherwise, p and p 0 are concurrent in the unfolded execution and a reachable marking exists in whichl(p) is marked with more than one token, contradictingthefactthattheoriginalPetrinetP issafe. Thus,everyinitially marked place in (s i ) is part of a cycle that contains exactly one token, a sufficient condition to guarantee that (s i ) is safe. The next lemmas prove key properties of super segments and elevation. Lemma A.4. Let s i and s j be two segments of an unfolded execution of a live safe and reversible unique choice Petri net that has every cycle marked with at least one token. Let s ij =s i s j . (s i ) is a live safe marked graph. Proof. The cross operator will not introduce any choice or merge by definition there- fore (s ij ) is a marked graph. It generates acyclic super segments, otherwise there must exist a cycle in the original Petri net that has no token, a contraction to our initial assumption. Thus, using the same argument as in the proof of Lemma A.3, we know (s ij ) is live. Moreover, if ether s i or s j has a path from a place p to its next occurrencep 0 ;l(p 0 ) =l(p), a path would exist between these two places in s ij . If 107 neither s i nor s j has a path then p =p 0 , this place will not exist in s ij . Thus, every place p on the input cutline of s ij has a path to its next occurrence p 0 ;l(p 0 ) = l(p). This means that every initially marked place in (s ij ) is part of a cycle that contains exactly one token, a sufficient condition to guarantee that a marked graph is safe [Mur89]. Lemma A.5. Let U [i : j] be a sequence of segments obtained from U[i : j] by elevating s k to s k . For any path, , between two unfolded places or transition, x i and x j in U[i :j], there exists a path, , between them in U [i :j] such that : Proof. Let cross s k boundaries at places p s and p e : Case 1: Ifthereexistsapathbetweenp s andp e suchthatp s 6=p e ,thentheremust exist a path from p s = p s to p e = p e in s k because the cross operator only add arcs between unfolded transition or places. Moreover, essentially passes along the same path and p s = p s to p e = p e . The cross operator might insert additional transitions on this path to convert choices/merges constructs to fork/join, but by dropping these additional transitions k1 can be reproduced. Case 2: If there is no path between p s and p e then we must have p s =p e . There are two sub-cases: Case 2-1: p s = p e . In this case, p s = p s and is the same as . Case 2-2: p s 6= p e . In this case, p s = p s and p e is the next occurrence of the p s in U [i : j]. Since (s k ) is live and safe, every initially marked place in (s k ) must exist in a cycle containing exactly one token [Mur89]. Thus, there must exist a path between any place on the input cutline ofs k to the next occurrence of the same place on the output cutline of s k . In this case, passes along the path from p s to p e and can be reproduced by dropping all transitions and places along this path. Lemma A.6. Let U be a sequence of segments obtained from U by elevating s j to s j . U (t) U j (t). Proof. Let the globally critical path in U be . By Lemma A.5 this path is a subset 108 ofa path containedinU . Moreover, as all delaysinU arenon-negative, weknow the critical path in U is at least as long as that of U. As introduced in Chapter 3.3, we assume the Petri net has m segment types S =fs 1 ;:::;s m gcorrespondingtomodeset =f 1 ;:::; m g. Weconsideranunfolded execution of length N with mode sequence U =hU 0 ;:::;U N i where U j 2 . Our goal is to bound the length of the longest path from the first to the last instance of the transition of interest t in this sequence, i.e., to bound (0) U (t;t;N). Wedefineincreasinglylargersupersegments S =fs 1 ;:::;s m g, wheres 1 =s 1 , and 8i 6= 1;s i = s i1 s i . We let i be the cycle time for the marked graph associated with super segment s i . Lemma A.7. 8i> 1; i i1 . Proof. Based on Lemma A.6 since s i1 s i we can replace every s i1 in U i1 with s i to get U i . Theorem A.1. For any arbitrary unfolded execution of a live safe and reversible unique choice Petri net, U, we have U (t) m . Proof. By Lemma A.6 since 8s2;ss m we can replace each segment in U to get U m which completes the proof. The bound of Theorem A.1 is not optimal. When applied to slack matching, this conservative bound results in extra slack matching buffers. In the following we will try to obtain tighter upper bounds for the cycle time considering more assumptions about the mode switching behavior of the Petri net. To do this, as introduced in Chapter 3.3, we consider an unfolded execution U = hU 0 ;:::;U N i and start by proving a useful lemma about the time separation of events within subsequences of segments that have the same type. 109 Lemma A.8. 8 8 ^ p 1 ;^ p 2 2 ^ P;l(^ p 1 )6=l(^ p 2 ))jjjj< Proof. Let us assume that there exists a place-simple path, 0 , with k = jj 0 jj > . Letfc 1 ;c 2 ;:::;c k g be the set of cutlines intersecting with 0 andf^ p 1 ;^ p 2 ;:::;^ p k g be the places on the cutlines which intersect the path, ^ p i = 0 (c i ). Since 0 is a place-simple cycle we know l(^ p 1 ) 6= l(^ p 2 ) 6= ::: 6= l(^ p k ) and since these places are on the cutlines fl(^ p 1 );l(^ p 2 );:::;l(^ p k )gM 0 which is a contradiction since k> by assumption. Lemma A.9. D( i )jE i j(s Max (E i )) Proof. LetuselevateallthesegmentsofE i tos Max (E i )toobtainE i . BylemmaA.5, weknowthatthereexistsapath ~ i inE i suchthat i ~ i andD( i )D(~ i ). Note that i is a cycle generated by cycle extraction which crosses the cutlines jE i j times. Aselevationdoesnotchangethenumberofcutlines, ~ i isalsoacyclewithjE i jinitial tokens. Also note that E i is the unfolded execution of the marked graph (s Max ), and ~ i is the cycle in that marked graph, therefore D(~ i )jE i j(s Max (E i )). Theorem A.2. U (t) lim N!1 1 N X i jE i j(s Max (E i ) . Proof. Since is the global critical path to t N , we know that the simulation time of t N ,(t N ) =D( ) = P i D( i )+D( ). Since is the residue path of cycle extrac- tion, its span is less than so the value ofD( ) is finite and lim N!1 D( )=N = 0. Also note that the value of (t (0) ) is finite and lim N!1 (t N )=N = 0. Therefore, U (t) = lim N!1 1 N X i D( i ) and by lemma A.9: U (t) = lim N!1 1 N X i jE i j(s Max (E i ) : 110 TheoremA.3. ForanarbitrarymodeassignmenttofE i g, 1i, letD B (fE i g) = P i jE i j(s Max (E i )) then D B (fE i g)D B =D B (f ^ E i g), where ^ E i =hs i ;U i MIN (1)i: Proof. Exchange Argument: We prove that any arbitrary fE i g can be converted into f ^ E i g through a series of mutations. Initially, f ~ E i g =fE i g and let us assume at step j of conversion ~ E 1:j = ^ E 1:j , and D B (fE i g)D B (f ~ E i g) we show that after step j +1 we will have ~ E 1:j+1 = ^ E 1:j+1 and D B (fE i g)D B (f ~ E i g): Atstepj+1some ~ E k ,kj+1hasthelargestsupersegment. Letusswap ~ E k and ~ E j+1 . We then swap the largest super segment in ~ E j+1 with its first segment. Clearly these two exchanges will not change D B (f ~ E i g). We then exchange each segment in ~ E j+1 , from second to the last segment, with the j ~ E j+1 j1 smallest segments in ~ E ij+1 . Let assume that the next smallest segment, s MIN , falls in some ~ E k . If the currentsegmentof ~ E j+1 , exchangedfors MIN , islargerthans Max ( ~ E k )thenD B (f ~ E i g) increases as ~ E k has to be elevated to a larger segment otherwise there would be no change after elevation and D B (f ~ E i g) remains constant. Note that the elevation of ~ E j+1 is governed by its first segment which remains unchanged. Finally, we increase the length of ~ E j+1 one at a time until we reach to . This is done by removing the next smallest segment from say ~ E k>j+1 , and appending it to ~ E j+1 . On this exchange D B (f ~ E i g) will be increased by = D B (s MAX ( ~ E j+1 )) D B (s MAX ( ~ E k )) and we know that 0 because ~ E j+1 contains the maximum seg- 111 ment. By the sequence of exchanges applied, we now have ~ E j+1 = ^ E j+1 and as shown by each exchange D B (f ~ E i g) either remains constant or increases and therefore: D B (fE i g)D B (f ~ E i g) which completes the proof. TheoremA.4. For any arbitrary long sequence U with arbitrary mode order, U (t) P m j=1 ^ f j (s j ) where 8 > > < > > : ^ f m =Min(f m ;1) ^ f j =Min(f j ;1 P m k=j+1 ^ f k ); 1j <m Proof. By theorem A.3 we know that: U (t) lim N!1 1 N X i j ^ E i j(s Max ( ^ E i )) : Forthebasecase, ^ E 1 ;:::; ^ E m areaffectedbymodes m where m =Min(f m N;), Therefore lim N!1 1 N X i=1 j ^ E i j(s Max ( ^ E i )) = lim N!1 1 N m X i=1 (s m )+ X i=m j ^ E i j(s Max ( ^ E i )) The first term can be reformulated as follows: 112 lim N!1 1 N m X i=1 (s m ) = lim N!1 1 N Min(f m N;)(s m ) = lim N!1 1 N Min(f m N;N)(s m ) = Min(f m ;1)(s m ) = ^ f m (s m ): Note that the total number of elevated segments to s m equals to: N m = 8 > > < > > : f m N f m N N otherwise or equivalently N m = ^ f m N. After elevating for modes m;m1;:::;j +1 we have: lim N!1 1 N X 1i j ^ E i j(s Max ( ^ E i )) = m X k=j+1 ^ f k (s k )+ lim N!1 1 N X j+1 <i j (s m )+ X j <i j ^ E i j(s Max ( ^ E i ) 113 The middle term can be reformulated as: lim N!1 1 N X j+1 <i j (s m ) = lim N!1 1 N N j (s m ) = lim N!1 1 N Min(f j N;N N m:j+1 )(s m ) = Min f j ;1 X k>j ^ f k (s m ) = ^ f j (s m ) In the above derivation, we utilized the fact that by step j = 1, N m:j+1 = P k>j N k = P k>j ^ f k N segments are already elevated and N N m:j+1 segments remained non-elevated. Onjth elevation we elevateN j =Min(f j N;NN m:j+1 ) segments to s j . Lemma A.10. The associated marked graph to a sequence, ( i ), as defined in (Definition II.25), is live. Proof. If the ( i ) is not live there must exists an unmarked cycle in the marked graph. Such cycle should not be cut by any cutline otherwise our algorithm would have marked it with at least one initial token. Therefore the cycle has to exist inside a segment which contradict the fact that segments are acyclic. Lemma A.11. The associated marked graph to a sequence, ( i ), as defined in (Definition II.25), is k-safe, where k is the length of sequence i . Proof. Ifthemarkedgraphisnotk-safethereshouldexistacycleinitwithmorethan k initial tokens, which means that the cycle has to be cut by more than k cutlines. This is impossible as the sequence with length k spans k + 1 cutlines and the last cutline is eliminated in the folding process. 114 Bibliography [BD90] E. Best and J. Desel, “Partial order behaviour and structure of Petri nets,” Formal Aspects of Computing, vol. 2, pp. 123–138, 1990. [BDL11] P. A. Beerel, G. D. Dimou, and A. M. Lines, “Proteus: An ASIC flow for GHz asynchronous designs,” Design & Test of Computers, vol. 28, no. 5, pp. 36–51, Sept. 2011. [BLD11] P. A. Beerel, A. M. Lines, and M. Davies, “Logic synthesis of multi-level domino asynchronous pipelines,” US Patent 8,051,396, Nov. 1, 2011. [BLDK06] P. A. Beerel, A. M. Lines, M. Davies, and N.-H. Kim, “Slack matching asynchronous designs,” in 12th IEEE International Symposium on Asyn- chronous Circuits and Systems, Mar. 2006, pp. 183–194. [Bur91] S. M. Burns, “Performance analysis and optimization of asynchronous circuits,” Ph.D. dissertation, California Institute of Technology, 1991. [BZL + 12] S. Bo, W. Zhiying, H. Libo, S. Wei, and W. Yourui, “Reducing power consumption of floating-point multiplier via asynchronous technique,” in Fourth International Conference on Computational and Information Sci- ences (ICCIS), Aug. 2012, pp. 1360–1363. [CDY01] S. Chakraborty, L. D. Dill, and Y. Yun, “Efficient algorithms for approx- imate time separation of events,” in Academy Proceedings in Engineering Sciences, 2001. [CLH + 12] K. Chang, T. Lin, W. Ho, K. Chong, B. Gwee, and J. S. Chang, “A comparative study on asynchronous quasi-delay-insensitive templates,” in IEEE International Symposium on Circuits and Systems (ISCAS), May 2012, pp. 1819–1822. [Dam12] J. Dama, “Private communication,” 2012. [Das04] A.Dasdan,“Experimentalanalysisofthefastestoptimumcycleratioand mean algorithms,” ACM Transactions on Design Automation of Elec- tronic Systems, vol. 9, no. 4, pp. 385–418, Oct. 2004. 115 [DBL11] G. Dimuo, P. A. Beerel, and A. M. Lines, “Performance-driven cluster- ing of asynchronous circuits,” in International Workshop on Power And Timing Modeling, Optimization and Simulation, 2011, pp. 92–101. [Dim09] G.Dimou,“Clusteringandfanoutoptimizationofasynchronouscircuits,” Ph.D. dissertation, University of Southern California, 2009. [GB11] P. Golani and P. A. Beerel, “An area-efficient multi-level single-track pipeline template,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), Mar. 2011, pp. 1–4. [GGS08] G. Gill, V. Gupta, and M. Singh, “Performance estimation and slack matching for pipelined asynchronous architectures with choice,” in IEEE/ACM International Conference on Computer-Aided Design, Nov. 2008, pp. 449–456. [GS90] M. R. Greenstreet and K. Steiglitz, “Bubbles can make self-timed pipelines fast,” The Journal of VLSI Signal Processing, vol. 2, pp. 139–148, 1990. [GS09] G. Gill and M. Singh, “Bottleneck analysis and alleviation in pipelined systems: A fast hierarchical approach,” in 15th IEEE Symposium on Asynchronous Circuits and Systems, May 2009, pp. 195–205. [HB95] H. Hulgaard and S. Burns, “Efficient timing analysis of a class of Petri nets,” in Computer Aided Verification, ser. Lecture Notes in Computer Science, P. Wolper, Ed. Springer Berlin / Heidelberg, 1995, vol. 939, pp. 423–436. [HBAB93] H. Hulgaard, M. S. Burns, T. Amon, and G. Borriello, “An algorithm for exact bounds on the time separation of events in concurrent systems,” IEEE Transactions on Computers, vol. 44, pp. 1306–1317, 1993. [JBRS10] P. Joshi, P. A. Beerel, M. Roncken, and I. Sutherland, “Timing veri- fication of gasp asynchronous circuits: Predicted delay variations ob- served by experiment,” in Concurrency, Compositionality, and Correct- ness, ser. Lecture Notes in Computer Science, D. Dams, U. Hannemann, and M. Steffen, Eds. Springer Berlin Heidelberg, 2010, vol. 5930, pp. 260–276. [JE12] N. Jamadagni and J. Ebergen, “An asynchronous divider implementa- tion,” in 18th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), May 2012, pp. 97–104. [KB00] S.KimandP.A.Beerel,“Pipelineoptimizationforasynchronouscircuits: Complexity analysis and an efficient optimal algorithm,” in International Conference on Computer Aided Design, Nov. 2000, pp. 389–402. 116 [Lin98] A.M.Lines,“Pipelinedasynchronouscircuits,” Master’sthesis,California Institute of Technology, 1998. [Mag84] J. Magott, “Performance evaluation of concurrent systems using Petri nets,” Information Processing Letters, vol. 18, pp. 7–13, Jan. 1984. [MLM + 97] A. J. Martin, A. Lines, R. Manohar, M. Nystrom, P. Penzes, R. South- worth, U. Cummings, and T. K. Lee, “The design of an asynchronous MIPS R3000 microprocessor,” in Seventeenth Conference on Advanced Research in VLSI, Sept. 1997, pp. 164–181. [MM98] R.ManoharandA.J.Martin,“Slackelasticityinconcurrentcomputing,” in Proceedings of the Fourth International Conference on the Mathemat- ics of Program Construction, Lecture Notes in Computer Science 1422. Springer-Verlag, 1998, pp. 272–285. [MN07] P. McGee and S. Nowick, “An efficient algorithm for time separation of events in concurrent systems,” in IEEE/ACM International Conference on Computer-Aided Design, Nov. 2007, pp. 180–187. [MNW03] A. J. Martin, M. Nyström, and C. G. Wong, “Three generations of asyn- chronousmicroprocessors,” IEEE Design and Test of Computers, vol.20, pp. 9–17, 2003. [Mur89] T. Murata, “Petri nets: Properties, analysis and applications,” Proceed- ings of the IEEE, vol. 77, no. 4, pp. 541–580, Apr. 1989. [Naj12] M. Najibi, “ISCAS89a: A set of benchmarks for conditional optimization of asynchronous circuits,” Available on personal contact., Tech. Rep., 2012. [NB12] M.NajibiandP.A.Beerel,“Performanceboundsofasynchronouscircuits with mode-based conditional behavior,” in IEEE International Sympo- sium on Asynchronous Circuits and Systems, 2012, pp. 9–16. [NB13a] ——, “Deriving performance bounds for conditional asynchronous cir- cuits using linear programing,” in 19th IEEE International Symposium on Asynchronous Circuits and Systems, May 2013, pp. 19–22. [NB13b] ——, “Slackmatchingmode-basedasynchronouscircuitsforaverage-case performance,” in IEEE/ACM International Conference on Computer- Aided Design (ICCAD), 2013. [PG97] P. B. K. Pang and M. R. Greenstreet, “Self-timed meshes are faster than synchronous,” in Third International Symposium on Advanced Research in Asynchronous Circuits and Systems, Apr. 1997, pp. 30–39. 117 [PM06] P. Prakash and A. J. Martin, “Slack matching quasi delay-insensitive cir- cuits,” in 12th IEEE International Symposium on Asynchronous Circuits and Systems, Mar. 2006, pp. 194–204. [RVFG05] D. Rostislav, V. Vishnyakov, E. Friedman, and R. Ginosar, “An asyn- chronous router for multiple service levels networks on chip,” in Pro- ceedings of the 11th IEEE International Symposium on Asynchronous Circuits and Systems. New York City, USA: IEEE Computer Society, 2005, pp. 44–53. [Sai12] A.Saifhashemi,“Poweroptimizationofasynchronouspipelinesusingcon- ditioningandre-conditioningbasedonathree-valuedlogicmodel,” Ph.D. dissertation, University Of Southern California, 2012. [SB11] A. Saifhashemi and P. A. Beerel, “SystemVerilogCSP: Modeling digital asynchronous circuits using SystemVerilog interfaces,” in 33rd WoTUG Conference on Communicating Process Architectures, 2011, pp. 287–302. [SB12] ——, “Observability conditions and automatic operand-isolation in high-throughput asynchronous pipelines,” in International Workshop on Power And Timing Modeling, Optimization and Simulation, 2012. [SF01] I. Sutherland and S. Fairbanks, “Gasp: A minimal fifo control,” in Pro- ceedings of the 7th International Symposium on Asynchronous Circuits and Systems, Salt Lake City, Utah, USA, Mar. 2001, pp. 46–53. [SN07] M.SinghandS.M.Nowick, “Mousetrap: High-speedtransition-signaling asynchronous pipelines,” IEEE Transactions on Very Large Scale Inte- gration (VLSI) Systems, vol. 15, no. 6, pp. 684–698, June 2007. [ST09] A. Smirnov and A. Taubin, “Heuristic based throughput analysis and optimization of asynchronous pipelines,” in 15th IEEE Symposium on Asynchronous Circuits and Systems, May 2009, pp. 162–172. [vBBK + 94] K. van Berkel, R. Burgess, J. Kessels, M. Roncken, F. Schalij, and A. Peeters, “Asynchronous circuits for low power: A DCC error cor- rector,” IEEE Design & Test, vol. 11, no. 2, pp. 22–32, Apr. 1994. [VG06] G. Venkataramani and S. C. Goldstein, “Leveraging protocol knowl- edge in slack matching,” in IEEE/ACM International Conference on Computer-Aided Design. ACM Press, 2006, pp. 5–9. [WHAY87] T. Williams, M. Horowitz, L. Alverson, and T. Yang, “A selftime chip for division,” in Advanced Research in VLSI: Proceedings of the 1987 Stanford Conference, Mar. 1987, pp. 75–95. 118 [XB97] A. Xie and P. A. Beerel, “Symbolic techniques for performance analysis oftimedsystemsbasedonaveragetimeseparationofevents,” in Interna- tional Symposium on Advanced Research in Asynchronous Circuits and Systems. IEEE Computer Society Press, 1997, pp. 64–75. [XKB99] A. Xie, S. Kim, and P. A. Beerel, “Bounding average time separations of events in stochastic timed Petri nets with choice,” in Fifth Interna- tional Symposium on Advanced Research in Asynchronous Circuits and Systems, 1999, pp. 94–107. [YR07] E.YahyaandM.Renaudin, “Performancemodelingandanalysisofasyn- chronous linear-pipeline with time variable delays,” in 14th IEEE Inter- national Conference on Electronics, Circuits and Systems, Dec.2007, pp. 1288–1291. [YVR13] A. Yakovlev, P. Vivet, and M. Renaudin, “Advances in asynchronous logic: from principles to GALS & NoC, recent industry applications, and commercial CAD tools,” in Design, Automation Test in Europe Confer- ence Exhibition (DATE), Jan. 2013, pp. 1715–1724. 119
Abstract (if available)
Abstract
Asynchronous circuits continue to gain interest as an attractive alternative to synchronous design for both low-power and high-performance applications. In both applications, however, providing accurate performance bounds is important to guide performance-aware optimizations as well as provide a guaranteed performance to the user. While in synchronous technologies, the performance is largely dictated by the clock frequency, which is usually invariant and independent of data values or circuit operational modes, the cycle time of a conditional asynchronous circuit depends on the values of input stimuli and can get very complex. ❧ In the first part of this thesis, we present accurate average-case performance bounds for conditional asynchronous circuits which demonstrate mode-based behavior. Analyzing the performance of these circuits is challenging as the critical paths cannot be identified without knowing the exact sequence of modes of operations, usually unknown at design time. Markov chain processes are used to model mode switching and Petri nets are used as the performance model. We adopt a performance analysis scheme based on decomposing the behavior of the Petri net into marked graph components to reason about performance. The bounds are derived using analytical and linear programing approaches based on the switching probability matrix of the Markov chain. ❧ In the second part of this thesis, we demonstrate the application of the proposed average-case performance bounds in performance-aware optimization of conditional asynchronous circuits. In particular, we propose an exact slack matching algorithm for conditional asynchronous circuits operating in distinct modes of operations with potentially different performance requirements
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Clustering and fanout optimizations of asynchronous circuits
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Clocking solutions for SFQ circuits
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Library characterization and static timing analysis of asynchornous circuits
PDF
Static timing analysis of GasP
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
Radiation hardened by design asynchronous framework
PDF
Theory, implementations and applications of single-track designs
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Thermal analysis and multiobjective optimization for three dimensional integrated circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
I. Asynchronous optimization over weakly coupled renewal systems
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
Towards high-performance low-cost AMS designs: time-domain conversion and ML-based design automation
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
Asset Metadata
Creator
Najibi, Mehrdad
(author)
Core Title
Average-case performance analysis and optimization of conditional asynchronous circuits
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering (VLSI Design)
Publication Date
06/06/2014
Defense Date
11/25/2013
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asynchronous circuits,average performance analysis,Markov chain,OAI-PMH Harvest,Petri nets,pipeline optimization
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter A. (
committee chair
), Breuer, Melvin A. (
committee member
), Golubchik, Leana (
committee member
), Pedram, Massoud (
committee member
)
Creator Email
mehrdad.najibi@gmail.com,najibiko@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-356419
Unique identifier
UC11297374
Identifier
etd-NajibiMehr-2212.pdf (filename),usctheses-c3-356419 (legacy record id)
Legacy Identifier
etd-NajibiMehr-2212-0.pdf
Dmrecord
356419
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Najibi, Mehrdad
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
asynchronous circuits
average performance analysis
Markov chain
Petri nets
pipeline optimization