Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A template-based standard-cell asynchronous design methodology
(USC Thesis Other)
A template-based standard-cell asynchronous design methodology
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A TEMPLATE-BASED STANDARD-CELL ASYNCHRONOUS DESIGN METHODOLOGY by Recep Ozgur Ozdag A Dissertation Presented to the FACULTY O F T H E GRADUATE SCH O O L UNIVERSITY O F SO UTH ERN CALIFORNIA In Partial Fulfillment o f the Requirements for the Degree D O C T O R O F PH ILO SO PH Y (ELECTRICAL E N G IN E E R IN G ) May 2004 Copyright 2004 Recep Ozgur Ozdag Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3140528 Copyright 2004 by Ozdag, Recep Ozgur All rights reserved. INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. UMI UMI Microform 3140528 Copyright 2004 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dedication This dissertation is dedicated to all my family members including my wife JemeUe Ozdag, my parents Huseyin and Nurten Ozdag, my brother Cenk Ozdag. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgm ents There are a lot of people I would like to thank for a number of reasons. Firstly, I would like to thank my research advisor, Peter Beerel. I could not have imagined having a better advisor and mentor for my PhD. I am grateful for his continuous supervision, encouragement, and support. I would also like to thank professors Massoud Pedram, W on Namgoong, Keith Chugg, and Roger Zimmermann who served on my qualification examination committee and particularly Massoud Pedram and Roger Zimmermann for also serving on my defense committee and spending the time and effort to enhance the contents of my thesis. I would also like to thank professor Steve Nowick from Columbia University and Montek Singh from The University of N orth Carolina Chapel HiU for their effort on our joint research. I would like to thank the USC Asynchronous Research Group Members that overlapped with me; Marcos Ferretti for finding that my high speed pipelines would not work at the last minute, Sangyun Kim for his feedback on many research related issues, Sunan Tugsinavisut for his help on the development on the asynchronous Fano design and the first publicly available asynchronous library. I am also happy that I have had Hoshik Kim and Pei-Chuan Peter Yeh as part o f our research group. I would especially Hke to thank Mr. Jay Moon who was always one step ahead o f me and therefore always knew the answers to CAD tool related questions. His help has been invaluable. A big thank you goes to Mr. Sachit Chandra who designed most of the cells in our library. O n a different note, I would like to thank the coffee producers of the world for keeping me thinking. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Last but not least a big thank you to my wife JemeUe Ozdag who has kept up with my grumpiness and tolerated my late work hours and always supported me, to my parents Nurten and Huseyin Ozdag, and my little naughty brother Cenk Ozdag for their support, encouragement and love. You can only appreciate your parents when you are older and realize you are more hke them than you thought. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. U1 T able of Contents D edication Acktiowledgem etits List o f Tables viii List o f Figures k Abstract xii 1. Introduction.......................................................................1 1.1 Asynchronous Circuit Design Flow ......................... 7 1.2 Expected Contributions o f the Thesis...................................................9 1.3 Thesis O rganization............................................................. 10 2. B ackground.................................................................. 11 2.1 Data Encoding Styles ....... 11 2.2 Handshaking Styles....................................... 12 2.3 Delay M odels ...... 15 2.4 Synthesis Based Design.................. 16 2.4.1 Fundamental Mode Huffman Circuits.............................. 16 2.4.2 Burst-Mode Circuits..... ................................................................18 2.4.3 Event-Based Design................................................ 19 2.5 Template-Based D esign .............................................................21 2.5.1 Template-Based Compilation Systems ............ ....22 2.5.1.1 Caltech’s Design Methodology...................................... 22 2.5.1.2 Tangram and Balsa.......................................................... 23 2.5.2 Micropipelines................................................................................... 24 2.5.3 Ad Hoc Design............. ................................................................26 2.6 Linear and Non-Linear Asynchronous Pipelines ...... 26 2.6.1 Linear Pipelines.......................................... ........... ............................27 2.6.2 Fine Grain Pipelining........... 30 2.6.3 Performance Analysis of Linear Pipelines........................................... 31 2.6.4 Non-Linear Pipelines ............ 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3. N ew H igh Speed QDI Asynchronous Pipelines 37 3.1 Caltech’s Q D I templates .................................................... 37 3.1.1 WCHB.................................................................................................. 37 3.1.2 PCHBandPCFB..................................................................................39 3.1.3 Why Input Completion Sensing?.........................................................41 3.2 N ew Q D I Templates............................................................................. 42 3.2.1 RSPCHB.............................................................................................. 44 3.2.2 Loops Using RSPCHB.........................................................................51 3.2.3 RSPCFB .....................................................................................56 3.2.4 FSM Design.................................................... 58 3.2.5 Simulation Results............................................................................... 60 3.2.6 Conclusions.......................................................................................... 63 4. T im ed Pipelines............................................................64 4.1 Williams’ PSO PipeMne.................................................................... 65 4.2 Lookahead Pipelines Overview (Single Rah).................................... 67 4.3 Lookahead Pipelines Overview (Dual Rah)...................................... 70 4.4 High Capacity Pipelines (Single R ah).................................................71 4.5 Designing Non-linear Pipeline Structures.........................................72 4.5.1 Slow and Stalled Right Environments in Forks...................................72 4.5.2 Slow and Stalled Left Environments in Joins......................................73 4.6 Lookahead Pipelines (Single Rah)....................................................... 74 4.6.1 Solution 1 for LPsr2/2......................................................................... 75 4.6.2 Solution 2 for LPsr2/2......................................................................... 76 4.6.3 Pipeline Cycle Time............................................................................. 77 4.7 Lookahead Pipelines (Dual Rah)......................................................... 77 4.7.1 Joins................... 78 4.7.2 Forks ................................................................... 78 4.8 High Capacity Pipelines (Single Rail)............................................ 81 4.8.1 Handling Forks and Joins ........ 82 4.8.2 Pipeline Cycle Time....................... 83 4.9 Conditionals..................................... 84 4.10 Loops.........................................................................................................87 4.11 Simulation R esults..................................................................................88 4.12 Conclusions................................... 91 5. A D esign Example: The Fano A lgorithm ............92 5.1 The Fano Algorithm ...................................................................92 5.1.1 Background on the Algorithm..............................................................92 VI Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.2 The Synchronous Design........................................................ 94 5.2.1 Normalization and its benefits........................ 94 5.2.2 Register-Transfer Level Design ............ 95 5.2.3 Chip Implementation.........................................................................100 6. The Asynchronous F a n o ........................................102 6.1 The Asynchronous Fano Architecture............................................. 103 6.2 The Skip-Ahead U n it...........................................................................105 6.3 The Memory D esign ...... 107 6.4 The Fast Data and Decision Registers ....... 109 6.5 Simulation Results and C om parison.................................................. 110 6.6 Skip-Ahead Unit with RSPCFIB .................................................... 113 7. An Asynchronous Semi-Custom Physical D esign F lo w ........................................................................................ 115 7.1 Physical Design Flow Using Standard CAD T ools.........................115 8. Conclusion and Future W ork...............................125 9. References...................................................................128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Tables Table 4.1: Cycle time (ns) of original linear pipelines vs. proposed non-linear pipelines 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List of Figures Figure 1.1: Asynchronous circuit design flow under developm ent...................................................8 Figure 2.1: Handshaking protocols: Two-phase versus four-phase.............................................14 Figure 2.2: Pipeline channels....................................................................................................................28 Figure 2.3: Synchronous vs. asynchronous pipelines.................................................................... 29 Figure 2.4: Throughput vs. tokens graphs........................................................................................33 Figure 2.5: a) a fork and b) a jo in ........................................................................................................35 Figure 2.6: Fundamental non-linear pipeline structures.................................................................36 Figure 3.1: W C H B..................................................................................................................................38 Figure 3.2: a) PGHB and b) PCFB tem plates.................................................................................. 39 Figure 3.3: a) PGHB and b) PGFB S T G ............................................................................................39 Figure 3.4: An OR gate implementation using weak conditioned logic....................................... 42 Figure 3.5: Optimized PGHB for a 1 -of-N+1 channel..................................................................... 43 Figure 3.6: a) Abstract and b) detailed QDI RSPGHB pipeline tem plate....................................45 Figure 3.7: The STG of the RSPGHB................................................................................................ 46 Figure 3.8: Conditional a) join and b) split using RSPGHB............................................................48 Figure 3.9: A RSPGHB 1-bit m em ory................................................................................................ 5 1 Figure 3.10: Performance slowdown in a 33 stage loop using RSPGHB....................................53 Figure 3.11: Request breaker for RSPGHB loops...........................................................................54 Figure 3.12: Figure 15. a) The loop at time t=0. Data and request are synchronized, b) Data travels around the loop and stalls, c) Request breakers prevent the s ta ll.......................... 55 Figure 3.13: Performance improvement using request breakers in a 33 stage RSPGHB loop ............................................................................................................................................................ 55 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 3.14: a) Abstract and b) detailed RSPCFB..............................................................................57 Figure 3,15: a) Abstract and b) detailed RSPCFB.......................................................................... 58 Figure 3.16: An abstract asynchronous FSM ......................................................................59 Figure 3.17: Throughput versus tokens for a) the PCHB and RSPCHB and b) the PCFB and RSPCFB linear pipelines...............................................................................................................62 Figure 4.1: W illiam s’ PSO pipeline stage.......................................................................................... 65 Figure 4.2: The STG of the PSO Pipeline......................................................................................... 67 Figure 4.3: a) LPsr2/2 b) LP3/1 and c) HC pipelines......................................................................69 Figure 4.4: a) Modified first stage after the fork, b) Detailed implementation of the gates in the dotted box................................................................................................................................. 76 Figure 4.5: The LPsr2/2 pipeline stage with a symmetric c-element...........................................77 Figure 4.6: The LP3/1 pipeline with a modified CD to handle jo in s ............................................ 79 Figure 4.7: a) Modified first stage after the fork, b) Detailed implementation of the additional gates................................................................................................................................................. 79 Figure 4.8: The LP3/1 stage with a C-element................................................................................ 80 Figure 4.9: a) Original and b) New HC stage........................................................................................... 82 Figure 4.10: A 2-way join 2-way fork HC stage............................................................................... 84 Figure 4.11: Conditional read and b) write........................................................................................86 Figure 4.12:A one-bit LPsr2/2 memory............................................................................................. 86 Figure 4.13: HSPICE Waveforms, a) Linear pipeline, b) Two-way fork and c) Two-way join 90 Figure 5.1: Flow-chart of Fano Algorithm ......................................................................................... 94 Figure 5.2: RTL architecture of the synchronous Fano Algorithm ............................................... 97 Figure 5.3: Finite State Machine describing the R T L .....................................................................99 Figure 6.1: RTL architecture of the asynchronous implem entation...........................................104 X Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 6.2: Detailed implementation of the Skip-Ahead Unit........................................................ 106 Figure 6.3 Implementation of the Received M em ory.................................................................... 108 Figure 6.4 Implementation of a 1-bit fast shift register..................................................................110 Figure 6.5: Layout of the asynchronous F ano...............................................................................i l l Figure 6.6: a) Error-Free and b) Error Region operation w aveform s.......................................112 Figure 6.7: The critical loop of the Skip-Ahead Unit.....................................................................113 Figure 7.1: Physical design flow using standard CAD tools....................................................... 116 Figure 7.2: Asynchronous circuit design flow proposed, covering both gate and leaf cell based physical design flow..........................................................................................................117 Figure 7.3: The functional description of a dynamic buffer......................................................... 119 Figure 7.4: The transistor view of a dynamic buffer..................................................................... 120 Figure 7.5: The layout view of a dynamic buffer............................................................................121 Figure 7.6: Cell placement in Silicon Ensem ble............................................................................122 Figure 7.7: Routed Counter block with Silicon Ensem ble...........................................................123 Figure 7.8: Extracted netlist of a block............................................................................................ 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract Asynchronous design is increasingly becoming an attractive alternative to synchronous design because of its potential for high-speed, low-power, reduced electromagnetic interference, and faster time to market. To support these design efforts, numerous design styles and supporting CAD tools have been proposed. We adopt a template-based methodology that facilitates hierarchical design using standard asynchronous channel protocols, removes the need for complicated hazard-free logic synthesis, and naturally provides fine-grain pipelines with high throughput. We propose seven different templates that provide tradeoffs between throughput and robustness to timing. The m ost robust templates are quasi-delay-insensitive in that they work correctly regardless o f delays on individual gates. The m ost aggressive templates use timing assumptions that can be satisfied with additional care during transistor sizing, floorplanning, and layout. We propose a complete design methodology for template-based designs using standard hardware description languages and the Cadence design framework. We demonstrate the advantages o f the templates and methodology by designing an asynchronous sequential channel decoder based on the Fano algorithm. Spice simulations, on the extracted layout, show that the circuit runs at 450MHz and consumes 32mW at 25°C. The asynchronous chip runs about 2.15 faster and consumes 1/3 the power o f its synchronous counterpart. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cha.p te r 1 1. Introduction Digital VLSI circuit design styles can be mainly classified as either synchronous, asynchronous or some mixture. Synchronous designs, are controlled by one or more clocks that control synchronization and communication between blocks and hare dominated the design space since the 1960’ s. Combinational logic is placed in between clocked registers that hold the data. The delay through the combinational logic plus relevant setup time should be smaller than the clock cycle time. In fact, the data at the inputs of the registers may exhibit glitches or hazards as long as they are guaranteed to settle before the sample clock edge arrives. Asynchronous methodologies, in contrast, use event-based handshaking to control synchronization and communication between blocks. This chapter first reviews various synchronous design methodologies and then describes some potential advantages o f asynchronous design, before providing a more detailed overview o f the thesis. Synchronous design methodologies can be classified in one of two main categories; standard cell design and fu ll custom design. Semi-custom standard-cell-based design methodologies offer good performance with typically 12-month design times [27]. They are supported by a large array o f mature CAD tools that range from simulation, synthesis, verification, and test. The synthesis task is divided into architecture definition, logic/gate-level design, and physical design. Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. A large library of standard-cell components that have carefull]’ ' been designed, verified, and characterized supports the synthesis task. This library is generally limited to static CMOS based gates for a variety of reasons. Compared to more advanced dynamic logic faiTiilies, standard CMOS static logic has higher noise margin and thus requires far less analog verification, significantly reducing design time. Standard-ceU designs also use standard clocking strategies to facilitate more automation and reduced design times. The forms of gated clocking are limited, reducing power efficiency. Standard flip-flop based designs are used to simplify timing analysis despite the incurrence of significant data to clock output overheads. Moreover, the time-to-market advantage of standard-ceU based designs is being attacked by the increasingly difficult task of estimating wire-delay. In submicron designs, the process o f architecture, logic, and technology mapping design could proceed somewhat independendy from placement and routing o f the cells, power grid, and the clocks because wire-delays were negligible compared to gate-delays. In deep-submicron design, however, the relative delay o f long-range wires are increasing and becoming harder to estimate. This is causing the traditional separation of logic synthesis and physical design tasks to break down because synthesis is not properly accounting for actual wire delays. This timing-closure problem has forced numerous shipment schedules to sUp. EDA vendors have now developed a new suite o f emerging CAD tools that address aspects o f the physical design must occur much earlier in the design process. In the future, predictions suggest that long-range wires may have 5 to 20 clock cycles in delay making estimation particularly critical [27]. In particular it is predicted Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. that that high-speed clock regions communicating at perhaps reduced frequencies may become prevalent, but the semi-custom CAD support for multiple clock domains is just emerging. The simplest approach involves adding synchronizers between clock domains that incurs a significant latency penalty^. Some manufacturers have extended the standard cell design technique to the design of datapaths and other higher-level functions such as microprocessors and their peripherals. O n the other hand the design can also be implemented by optimizing every transistor o f the layout. This technique is called full custom design, and is generally preferred when one or many aspects o f the chip need to be optimized beyond what is readdy available in a semi-custom approach. Since the designer controls the transistor size, placement of the smallest functional blocks and the main routing method, the end result in general is much better than standard cell design. In the full custom method, design time is traded in for higher performance, reduced area or power consumption, since all possible circuit techniques can be applied, where as in standard cell design, the CAD tool only has a limited number o f pre-laid out cells that need to be broad enough to suit every customers need. Full-custom design houses have found that these challenges with standard cell design can be overcome with longer design cycles of an average of 36 months. In particular, the use o f advanced logic dynamic logic styles has been an area o f growing interest in full-custom designs [55] [6] [26] [25]. Dom ino logic is estimated to be 30% faster than static logic because of the improving logical effort derived by the removal of PMOS logic. Traditional domino logic however still suffers from overhead associated with clock skew and latch delays. More advanced flip-flops and latches have been Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. developed that somewhat Improve the clock skew overhead and reduce the latch delays. A t the extreme, the latch delays can be removed using multiple overlapping clocks in a widely used technique, recently named skew-tolerant domino logic [25]. In addition to the problems of clock distribution and skew is the problem o f heat and power consumption. Many of the gates switch because they are connected to the clock, not because they have new input data to evaluate. The biggest gate o f all is the clock driver, and it must switch aU the time to provide the correct timing, even if only a small part o f the chip has anything useful to do. Although gating the clock is an option to send the clock signal to only those who need it, stopping and starting a high-speed clock is not easy. To reduce power consumption, particularly in memories and long-distance on-ship and off-chip communication, low-voltage signaling has been commonly used. These also suffer from reduced noise margins, requiring more manual design practices and extensive analog simulation. The basic cost that achieving this higher performance and low-power presents is the reduced noise margin and the increased need for more careful, manual design practices and extensive analog verification, pre and post layout. The increasing limitations and growing complexity of both standard-cell and fuU- custom synchronous design have led to a change of focus on digital circuit design. In particular, circuits that lack a global controlling clock, namely asynchronous circuits have demonstrated potential benefits in many aspects of system design (e.g. [28], [23], [7], [54], [38], [64], [20],[70]). Asynchronous circuits have several advantages over their synchronous counterparts, including; Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. /) FJimination of clock skew: Clock skew is defined as the arrival time difference of the clock signal to different parts o f the circuit. In general in standard cell design, to avoid this problem, the clock pulse is increased to assure correct operation, which yields slower running circuits. However in full custom design buffer insertion, or careful clock tree design and analysis to improve clock routing and clock power are some of the methods synchronous designers are using to handle this problem. Although full custom design approach leads to reduction or even elimination of clock skew, for synchronous design this is still a problem that needs to be worked on. O n the other hand, since asynchronous circuits have no global clock that controls the data flow, there is no clock skew problem. 2) Ijower power consumption: In general, the constant activity o f the clock signal causes synchronous systems to consume power even though some parts of the circuit may not be processing any data. Even though some improvements in full custom design, such as clock gating avoid sending the clock signal to the un-active parts the clock driver has to constandy provide a powerful clock to able to reach all the parts of the circuit. Although asynchronous circuits in general have more transitions due to the hardware overhead, they generally have transitions only in areas that are active in the current computation. 3) A.verage case performance: Synchronous circuit designers have to consider the worst-case scenario when setting the clock speed to ensure that aU the data has stabilized to before being latched. However asynchronous circuits detect and react when the computation is completed, yielding average case performance rather than worst case [70]. Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. 4) Easing of global timing issues: since in synchronous circuits the slowest path dictates the clock speed, designers try to optimize all the paths to achieve the highest possible clock rate. In particular there maybe long wires, which require large buffers and consume significant power even though they may be non-critical or maybe infrequently driven. In contrast in asynchronous circuits optimizing the frequently used paths is easier [54]. 5) Better technology migration potential: Since the technology which the circuit is implemented improves rapidly, for synchronous circuits better performance often can only be achieved by migrating all the system components to the new technology where as for asynchronous design the communication between blocks only occur when the completion o f the processing is detected, therefore different delays introduced with different technologies can be easily substituted into a system without altering other structures. 6) A.utomatic adaptation to physical properties: The delay on a path may change to the variations in the fabrication process, temperature, and power supply voltage. Synchronous system designers must consider the worst case and set the clock period accordingly. However asynchronous circuits naturally adapt to changing conditions since the slowdown on any path does not affect the functionality of the system [24]. 7) Improved EM I: In a synchronous design, all activity is locked into a very precise frequency. The result is nearly all the energy is concentrated in very narrow spectral bands at the clock frequency and its harmonics. Therefore, there is substantial electrical noise at these frequencies. Activity in an asynchronous circuit is uncorrelated, resulting in a more distributed noise spectrum and a lower peak noise value [41]. Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. 1.1 Asynchronous Circuit D esign Flow The USC Asynchronous CAD and VLSI group, joindy with the Columbia Asynchronous group, is currently developing a complete asynchronous circuit design methodology that wiU support automated design exploration of both high-performance and low-power asynchronous circuits. The basic steps of the methodology are illustrated in Figure 1.1. First a language based, model such as CSP [37] and Verilog [51], is used as the input description. This input description describes the desired top-level functionality of the chip and maybe annotated with overall constraints on power, energy consumption, throughput, latency, chip area, etc. Note that details regarding internal structure or the specific asynchronous protocols used are specifically not included in the description. After generating this input description and verifying its correctness, the next step in the methodology is to explore and finalize a basic architecture for the design. This basic architecture should identify the number and relative characteristics of the basic blocks in the design (register files, ALUs, multipliers, etc.) To automate this step we expect to adapt variations in classical high-level synthesis, i.e., scheduling, resource sharing, and binding. After architectural design is complete, the next step in the methodology is micro-architecture design. In this step the designer can choose to implement the architecture with various methods ranging from fine grain pipelines template-based using delay insensitive cells to components relying on bounded delay based with no pipelining at all. Depending on the style chosen, various optimizations can be appKed, namely selection of the handshaking protocol, defining the level of pipelining, and slack optimization for pipelined designs. Once this initial mirco- architecture is created, next step is to identif}^ critical components and perform 7 Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. handshaking optimisation to achieve higher performance and lower power. Based on the final micro-architecture, a gate or transistor level design is generated. This can be done either automatically using new template-based synthesis techniques that our group is creating or manually. A rchitectural D esign Placem ent and Routing Gate Level D esign O ptim ization Micro Architectural D esign Language based Input Description (CSP, Verilog, C) Figure 1.1: Asynchronous circuit design flow under development Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. Finally, placement and routing wiU be applied very a similarly to that required synchronous circuit design. In every step aU the design process, verification and performance analysis tools are used to verify correct functionality and overall performance. The focus this proposal is the generation o f new templates for template- based design, as well as to help develop the above CAD frame for the automated design of asynchronous systems. 1.2 Expected Contributions of the Thesis Our research group’ s goal is to produce a complete design m ethod for asynchronous systems, including specification, synthesis, verification, simulation, and testing and to develop a suite o f CAD tools supporting the design method. And by using these CAD tools to design high-performance and energy-efficient asynchronous microprocessors, and systems-on-a-chip. As part of an ongoing research to accomplish these goals the we: • Develop two new quasi delay insensitive, high-speed templates targeted at non linear pipelines, which are faster and smaller than other quasi delay insensitive templates. Quasi delay insensitive templates are the m ost robust asynchronous building blocks for designs based on templates. By using templates we can rnimic ease o f design of the standard cell design methodology in synchronous design. We also show the implementation of some o f the non-linear structures. • To achieve higher speeds, we then develop five new bounded delay pipeline templates by modifying and further improving the templates developed by Columbia University, which are based on timing assumptions to shorten Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. handshaking rime and achieve higher speeds. In parricular, the templates developed by Columbia University were targeted for linear pipelines such as FIFOs. Real life designs however, require more complex structures that require the template to also function correctly with non-linear pipelines. To extend the existing pipelines we modify each template to handle non-hnear pipelines with Uttle impact on performance. • We then implement a communication algorithm as a design example in both synchronous and asynchronous methods to show the advantages of asynchronous design over synchronous design as well as to help the development of a CAD environment, which is mainly targeted for template, based design. The asynchronous implementation of the algorithm will also be used to study the trade offs among different asynchronous templates from timed to delay insensitive. 1.3 Thesis Organization The organization o f the reminder of this proposal is as follows. Chapter 2 presents background on asynchronous circuit design styles, and linear and non-hnear pipehne apphcarions, Chapter 3 presents the new high speed Q D I pipelines. Chapter 4 presents the extension to the pipehnes introduced by Columbia University and the introduction of five new rimed templates. Chapter 5 presents the design example in synchronous, and Chapter 6 presents in asynchronous. Chapter 7 presents our semi-custom asynchronous design flow and finally Chapter 8 covers the conclusion and future work. 10 Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. Cha p ter 2 2. Background This section presents the basics of asynchronous circuit design and classifies many of the existing asynchronous circuit design styles according to data encoding method, handshaking style, granularity of pipelines and circuit style. Then we describe the differences between logic synthesis-based methodologies and those that rely more on a template-based methodology. We then focus on existing templates that support the design of complex fine-grain pipelines and analyze their performance. 2.1 D ata E ncoding Styles Single rail [52] communication between functional blocks consists of one request wire and one wire per data bit from the sender to the receiver and one acknowledgment wire from the receiver to the sender. D m l rail communication often consists o f two wires per data bit from the sender to the receiver and one acknowledgment wire from the receiver to the sender. In addition, dual-rail designs can have an additional request line [57]. 1-of-N communication is a generalization of dual raU communication in which [/ogjJV j bits are sent using N wires. An acknowledgment signal from the receiver to the sender is used to tell the sender that the data is no longer needed. The logic that drives this acknowledgment signal often involves completion sensing circuitry that helps determine when the receiver is done using the current data bits. In single rail communication, completion sensing circuits are implemented with bundled data lines [52] or more sophisticated speculative completion sensing circuitry [47], [5], that includes delay lines that match the critical paths o f the 11 Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. functional unit. O n the other hand, completion sensing o f dual rail designs can be done using specialized logic that actively identifies when the computation is done. This latter logic relies on the dual-rail nature of the data and can be implemented without relying on timing assumptions and thus, is more robust to variations in delay than its delay-Hne counterparts. Completion sensing, however, requires more circuitry than delay lines and, if not done wisely, can incur a significant performance, power and area penalty. The functional units can be implemented using static or dynamic logic. Often functional units that communicate using dual rail or 1-of-N styles are implemented using dual rail dynamic logic [67] [33], but since static logic is also possible [67]. Functional units that communicate using single rail are more commonly implemented using static logic that is often smaller and consumes less power than dynamic counterparts. Designs implemented with dynamic logic, however, can generally achieve higher throughput than their static logic counterparts. Consequently, they can run at lower voltages to achieve a given throughput requirement and, thus may yield a lower power design than their single rail counterparts. 2.2 Handshaking Styles Asynchronous circuits consist of functional units that communicate control and data information using various handshaking styles. The most dominant forms o f handshaking styles two-phase [62] and four-phase handshaking [21] are shown in Figure 2.1. In two-phase handshake protocol, a request and an acknowledge wire is used to implement handshaking between the sender and the receiver. In two-phase handshake protocol, all transitions are functional and consequently every pair of consecutive 12 Reproduced with permission ofth e copyright owner. Further reproduction prohibited without permission. request/acknowledge transitions forms a complete handshake. Two-phase single rail communication is usually seen with static logic functional units that use bundled-data for completion sensing. Due to some difficulties in designing complex two-phase control circuits, a novel single-track handshaking protocol has been suggested by van Berkel and Bink [8]. This handshaking protocol is achieved by combining the request and acknowledge lines into one wire and is illustrated in Figure 2.1 (b). Where two- phase handshaking involves two events per cycle, four-phase handshaking requires four- events, as shown in Figure 2.1 (c). Since four events are used to designate a complete handshaking cycle, half of these are essential for functional computation and the other half are not actively used to communicate data. Nevertheless, this reset phase is very useful for precharging dynamic units. Figure 2.1 (d) shows a four-phase handshaking protocol for dual-rail dynamic units [67] [33]. Other protocols extend the data valid region through the reset phase [52] [12], to more efficiently use four-phase handshaking with static functional units. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Req Ack Data (a) Tw )-phase handsnaking protocol Req Ack Ack Req Data haking protocd! (b) Single track hands Req Ack Data Evaluation (c) Four-phase handshaking protocol Data -^E valuatiot^ Valid data Reset / 1 Evaluation! Valid data (d) Four-phase handshaking protocol Figure 2.1; Handshaking protocols: Two-phase versus four-phase. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3 Delay M odels Most design techniques require some timing assumptions or constraints on the wires and/or components to ensure correct operation. For example, in synchronous circuit design, the data input to every register must satisfy aU setup and hold times. The delay assumptions in asynchronous circuits widely vary based on design st^des as outlined below. * Delay insensitive (DI): Delay insensitive designs [19] [65], require no timing assumptions on wither wires or gates. That is, D I circuits work correctly for any arbitrary, time-varying gate and wire delay. This is the most conservative and robust design style, but it has been shown that very few gate-level delay insensitive designs can exist [35]. That said, delay insensitivit}? can more easily and practically be achieved at a block level where blocks communicate only through delay insensitive channels. ® Qimsi delay insensitive (QDI): Quasi delay insensitive design [34] [33] is a practical approximation to delay insensitive design. QD I circuits work correctly regardless o f delays in gates and all wires except in cases of wire forks designated isochronic. The difference in time at which the signal arrives at the ends of an iso chronic fork must be less than the minimum gate delay. If these isochronic forks are guaranteed to be local to a small component, these circuits can be practically as robust as D I circuits. The Q D I assumption has also been extended to include assumptions of isochronic propagation through a number of logic gates [9]. 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • speed independent (SI): SI design [67] [3], assumes that gate delay can be arbitrary but that all wire delay is negligible. From a delay perspective 81 design basically assumes that all forks are isochronic. For the design of small control circuits, thus timing assumption is generally satisfied. ® Scalable delay imensitive (SDI): SDI approaches [44] [47], are motivated by the observation that SI design should not be used for any circuit that spans significant chip area. Consequently, in SDI design the chip area is divided into many regions, SI circuit design is used within each region, and communication between regions is done delay insensitively. • Bounded delay: In bounded delay models each gate is given a minimum and maximum delay and the circuit must work if the delay of aU gates are within these bounds. These timed circuits can often be faster, smaller and lower power than their Q D I or SI counterparts, but require more careful timing verification during physical design [42]. • Belative timing. In relative timing based circuits, a Ust of relative orderings of events identifies sets of path pairs, where for each pair of paths, one path must be longer/shorter than each other to ensure correctness. These circuits can have the same benefits of times circuits and may be easier to validate [61] [30] [17]. 2.4 Synthesis Based Design 2.4.1 Fundamental M ode H uffm an Circuits In this model, the circuit design flow is similar to that of the design of synchronous circuits[24]. The circuit is usually expressed as ■i.jlow table [66]. The flow table has a row 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for each internal state, and a column for each cc^mbination of inputs. The entries indicate the next state entered and output generated when the column’ s input combination is seen while in the row’ s state. States where the next state is identical to the current state are called stable states. It is assumed that each unstable state leads direcdy to a stable state, with at m ost one transition occurring on each output variable. Similar to finite state machine synthesis in synchronous systems, state reduction and state encoding is performed on the flow table, and Karnaugh maps generated for each of the resulting signals. There are several points that need to be considered for this design method. The system responds to input changes rather than clock ticks therefore the circuit may enter some intermediate states if multiple inputs change at the same time. Therefore it must be guaranteed that these intermediate states should still lead to the intended stable state, irrespective o f the order of how inputs change. Another concern is hazard removal. Since hazards, static or dynamic, can cause the circuit to enter an unstable state, they must be eliminated by adding a sum-of-products circuit that has functionally redundant products. Due to the restriction o f only one input changing to the combinational logic at a time, several requirement need to be forced on the implementation o f sequential circuits. First, the combinational logic must settle in response to a new input before the present state entries change. The state encoding must assure a single bit transition for state transitions. The last requirement is that the next external input transition cannot occur until the entire system settles to a stable state. 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. While the fundamental mode assumption makes logic design easy, it also increases cycle time. There are proposed solutions, which carefully analyze an implementation to relax the fundamental mode assumption, however because of the limitations on the multiple input changes, this design methodology has never achieved wide acceptance for complex system design. Burst-mode circuits, covered in the next section, overcome the limitations on multiple input changes. 2.4.2 Burst-M ode Circuits The burst-mode design style developed by [45], [46], [71] is based on the earlier work at HP laboratories by [18], attempts to move even closer to synchronous design than the Huffman m ethod [24]. In this method, circuits are specified via a standard state- machine, where each arc is labeled by a non-empty set of inputs (an input burst) and a set of outputs (an output burst). The assumption is that, in a given state, only the specified inputs on one of the input bursts leaving that state can occur. The inputs are allowed to occur in any order. The state reacts to the inputs only when all of the expected inputs have occurred. The state machine then fires the specified output bursts and enters the specified next state. New inputs are only allowed to occur after the system has completely reacted to the previous input burst. Therefore, the burst-mode method still requires the fundamental-mode assumption, but only between transitions in different input bursts. Another restriction is that no input burst can be a subset in another input burst leaving the same state. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Burst-mode circuits can be implemented in various ways, including siixdlar techniques to those of Huffman circuits. The problems with both the fundamental-mode and burst-mode circuits that restrict these circuits are the fact that circuits often are not simple single gate small state machines, but instead complex systems with multiple control state machines and datapath elements. These methods do not discuss system decomposition for complex circuits. Also, these methodologies cannot design datapath elements. This is because datapath elements tend to have multiple input signals changing in parallel, and the fundamental-mode assumption would be easily violated. Although one solution for datapath implementation is to use synchronous components with careful add-hoc optimization, another issue is the increased delay by the additional delay elements to satisfy the fundamental-mode assumption. N ot only is the delay increased but it must also be able to work under worst-case scenario. 2.4.3 Event-Based D esign Petri nets and other graphical notations are a widely used alternative to specify and synthesize asynchronous circuits. In this model, an asynchronous system is viewed not as state-hased, but rather as a partially ordered sequence of events. A Petri net [40] is a directed bipartite graph, which can describe both concurrency and choice. The net consists of two kinds o f vertices:places and transitions. Tokens are assigned to the various places in the net. An assignment of tokens is called a marking, which captures the state o f the concurrent system. W hen all the conditions preceding a transition are true the action may fire which removes the tokens from the preceding places and marks the )9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. successor places. Hence, starting frona an initial marking, tokens flow through the net, transforming the system from one marking to another. As tokens flow, they fire transitions in their path according to certain firing rules. Patil proposed the synthesis o f Petri nets into asynchronous logic arrays. In this approach, the structure of the Petri net is mapped directly into hardware. Many m odern synthesis methods use a Petri net as a behavioral specification only, not as a structural specification. Using reachability analysis, the Petri net is typically transformed into a state graph, which describes the explicit sequencing behavior of the net. An asynchronous circuit is then derived from the state graph. More general glasses o f Petri nets include Molnar et al 1 1-Nets [39], and Chu’ s Signal Transition Graphs or STGs [13]. These nets allow both concurrency and a limited form of choice. Chu developed a synthesis method, which transforms an STG into a speed- independent circuit, and applied the m ethod to a number of examples. Previous work in this area can be found in [16]. Petrify is a tool for manipulating concurrent specifications and synthesis and optimization o f asynchronous control circuits[16]. Given a Petri net, or a STG it generates another Petri net or STG, which is simpler than the original description and produces an optimized net-Hst o f an asynchronous controller in the target gate library while preserving the specified input-output behavior. An ability o f back annotating to the specification level helps the designer to control the design process. For transforming a specification petrify performs a token flow analysis o f the initial Petri net and produces a transition system. In the initial transition system, all transitions with the same label are considered as one event. The transition system is then 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. transformed and transitions relabeled to fulfill the conditions required to obtain a safe irredundant Petri net. For synthesis of an asynchronous circuit petrify performs state assignment by solving the Complete State Coding problem. State assignment is coupled with logic minimization and speed-independent technology mapping to a target library. The final netHst is guaranteed to be speed-independent, i.e., hazard-free under any distribution of gate delays and multiple input changes satisfying the initial specification. The tool has been used for synthesis of Petri nets and Petri nets composition, synthesis and re-synthesis o f asynchronous controllers and can be also applied 2.5 Template-Based Design A different approach is for asynchronous design is to view the system as communication blocks or processes, called templates that encapsulate all the design constraints inside the modules. These templates will have requirements of their environment that must be met, and which will restrict how these templates are used. However, such restrictions or internal timing constraints are much simpler than those of most other methodologies, and the proper template will usually be obvious from the functionality required. Template-based design is somewhat similar to standard cell design in synchronous logic. Templates can be either pre-designed to implement simple logic functions, with handshaking, or can synthesized to create more complex ones. The advantage o f template-based design is the ease o f manual design. In general a datapath is created, and the control unit is designed around the datapath. Once a general architecture is created the rest o f the task is to implement the blocks o f the architecture 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. using templates. Also template-based design has the potential advantage, which is currently being investigated, of being able to be used as a backend to a synchronous CAD tool. The highly optimized synchronous design can be converted to an asynchronous one by replacing every gate with its asynchronous handshaking counterpart template. However additional optimization might be required to improve the performance o f the system. 2.5.1 Template-Based Compilation Systems Although template-based system can ease manual design, their main power is seen when they are coupled with a high-level language and automatic translation software. The following section presents some well-known methodologies, which have their own language for easy compilation of asynchronous systems. 2.5.1.1 Caltech’s Design Methodology Caltech’ s communicating processes compilation technique [36], translates programs written in a language similar Communicating Sequential Processes into asynchronous circuits, which communicate on channels. The source language describes circuits by specifying the required sequences of communications in the circuit. Caltech’ s translation process is accomplished in several steps: (1) in process decomposition, a process is refined into an equivalent collection of interacting simpler processes; (2) in handshaking expansion, each “communication channel” between processes is replaced by a pair o f wires, and each atomic “communication action” is replaced by a handshaking protocol on the wires; (3) in production-rule expansion, each handshaking expansion is replaced by a set of “production rules (PRs)”, where each 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. rule has a “guard” that insures it is activated {i.e., “fires”) under the same semantics as specified by the earlier handshaking expansion; and finally, (4) in operator reduction, PRs are grouped into clusters, and each cluster is than mapped to a basic hardware component. It is important to realize that many of these steps require subtle choices that may have significant impact on circuit area and delay. Although heuristics are provided for many of the choices, much o f the effort is directed towards aiding a skilled designer instead of creating autonomous tools. This has the benefit in that the designer can usually make better decisions, provided that the designer is skilled enough. Caltech has later moved to using more standardized, pre-designed, less complex building blocks, which simplify the design method, explained above. Caltech’ s template- based design methodology has moved from the synthesis of complex templates to chip implementation using smaller, and simpler templates, which have very standard design guidelines. These templates are in general targeted for implementing fine grain pipelined chips. 2.5.1.2 Tangram and Balsa Another compiler-based approach developed by van Berkel, Rem and others [2], at Philips Research Laboratories and Eindhoven University of Technology uses the Tangram language. Tangram, which is based on CSP, is a specification language for concurrent systems. A system is specified by Tangram program, which is then compiled by syntax-directed translation into an intermediate representation called a handshake circuit. A handshake circuit consists o f a network of handshake processes, or components, which communicate asynchronously using handshaking protocols. The circxoit is then 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. improved using peephole optimization and, finally components are mapped to VLSI implementations. Although Tangram is also syntax derived like Caltech’ s design methodology, it also targets non-pipelined designs, which can support non-linear sequential processing as well as pipeline processing. The Tangram compiler has been successfully used at Philips for several experimental DSP designs and electronics; including counter, decoders, image generators, and an error corrector for a digital compact cassette player. Balsa [2], developed at University^ of Manchester, adopts syntax-directed compilation into handshaking components and closely follows Tangram. A circuit described in Balsa is compiled into a communicating network composed from a small (— 35) set o f handshake components. Balsa can be thought as o f an public extension to Tangram. In particular the support for separate compilation and the use o f a flexible communication enclosed input choice mechanism are claimed as useful additions to the expressiveness o f Tangram. New handshake components (which are the constituent parts of handshake circuits) are proposed which are used to implement this choice mechanism as well as more generalized forms of the existing Tangram system components. 2.5.2 M icropipelines Micropipehnes, introduced by Ivan Sutherland, use standard synchronous datapath logic to build asynchronous pipelines [62]. A micropipeline has altering computation stages separated by storage elements and control circuitry. This approach uses transition 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. signaling for control along with bundled data, Sutherland describes several designs for the storage elements, called “event-controlled registers”, which respond symmetrically to rising and falling transitions on inputs. Computation on data in a micropipeHne is accomplished by adding logic computation blocks between register stages. Since these blocks will slow down the data mo\dng through them, the accompanying transition is delayed as well by the explicit delay elements, which must have at least as much delay in them as the worst-case logic block delay. The major benefit of the micropipeline design style is that the registers or latches at the boundaries of pipeline stages filter out logic hazards within the combinational logic. Thus, standard synchronous combinational logic design st)des and supporting CAD tools can be used. Although micropipelines is a powerful design style, which elegantly implements elastic pipelines, there are some problems with them as well. It delivers worst-case performance by adding delay elements to the control path to match worst-case computation times. Also there are delay assumptions that must be carefully verified. Finally, there is little guidance currently on how to use micropipelines for more complex (add speculative completion pros and cons) systems. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.5.3 Ad H oc D esign Our final design methodology is ad hoc design. Although it may not seem like a design methodology, the ad hoc design approach implemented buy a skilful designer can lead to very competitive results. A design can be completely implemented in an ad hoc fashion, or can be initially developed using one of the methods above and then be optimized in an ad hoc sense. An asynchronous design can be implemented the same way a synchronous design would, using synchronous components for the datapath. A matched delay can be used to indicate the completion of the computation. The control circuit can be implemented by modif\dng a synchronous FSM to work with input transitions rather than a global clock. Another approach is the use self-resetting logic. Although self-resetting logic has a number o f difficult to satisfy timing assumptions careful ad hoc design can achieve high throughput with self-resetting asynchronous circuits. The synchronous parts o f the circuit can be replaced with self-resetting logic. Im portant aspects of self-resetting design such as data insertion and pulse generation would require an ad hoc approach. O r alternatively, an asynchronous circuit can be implemented using any o f the approaches presented above and can be later optimized for speed, area or power using verifiable ad hoc optimizations. 2,6 Linear and Non-Linear Asynchronous Pipelines This section presents the basics of linear and non-linear fine-grain asynchronous pipelines where each pipeline stage is derived through one of several basic templates. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.6.1 L inear Pipelines A pipeline is a linear sequence of functional stages where the output of one stage is connected to the input of the next stage. Data signals, which flow from the inputs to the outputs of the pipeline, are also called as data tokens. A linear pipeline has no forking or joining stages. The tokens in the pipelines remain in a first in first out order (FIFO). In synchronous design the sequential functional stages are registers. These registers hold the data tokens and are controlled by a global clock signal. Depending on the implementation, on rising or failing edge o f the clock, all the registers sample new data values which wait at their inputs. Since aU the registers “see” the clock signal at the same time, the movement of one data token to the next register is synchronized to all other data tokens, and they all move at the same time. Flowever there is no central global clock in asynchronous design therefore a data token in one stage only moves to the next stage if it is empty. The handshaking protocol between the two stages (the sender and the receiver) determines how the two stages inform each other when there is an empty space, when the data has been sent, if the data has been received by the next stage (receiver) and when the previous data holding stage (sender) can reset its data. The handshaking protocol is accomplished through a commtmication channel between the sender and the receiver. Although in this section we explain a communication channel under the context of pipelines, a communication channel can exist between any two asynchronous units. An asynchronous communication channel shown in Figure 3.1 is a bundle o f wires and a protocol to communicate data between a sender and a receiver. For single rail encoding one wire per bit is used to transmit the data and an associated request line is sent to identify when data is vaUd. The associated channel is called a 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. bundled-data channel. Alternatively for dual rail encoding the data is sent using two wires for each bit of information. Extensions to 1-of-N encoding also exist. Both single-rail and dual-rail encoding schemes are commonly used, and there are tradeoffs between each. Dual-raH and 1-of-N encoding allow for data validity to be indicated by the data itself and are often used in QDI designs. Single-raU, in contrast, requires the associated request hne, driven by a matched delay line, to always be longer than the computation, as we described in section 2.1. Aek Reg Data (single rail) Channel (b) Bundled data channel (a) Abstract channel Ack Data (Dual rail) Sender Receiver Sender Receiver Sender Receiver (c) Dual-rail channel Figure 2.2: Pipeline channels 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i ^ ■ ■ • > —^ A A A A _ T L n _ (a) A synchronous pipeline Req Req Req Req Ack Ack (b) An asynchronous pipeline Figure 2.3: Synchronous vs. asynchronous pipelines Figure 2.3 illustrates the difference between typical synchronous and asynchronous linear pipelines Abstractly the operation of a general asynchronous pipeline with four-phase handshaking can be described as follows. Initially the pipeline is empty, and aU the data lines as well as the handshaking signals req (the request signal) and ack (the acknowledgment signal) are de-asserted. The request signal req can be used if the data lines are single rail, to inform the next stage the arrival o f data. O n the other hand if the data lines are implemented with dual rail, conventionally, there is no need for the req signal. W hen the first stage evaluates and generates an output the req signal is also assert. W hen the second stage evaluates it asserts its req signal as well as the ack signal to acknowledge the first stage that it has consumed the data. The first stage responds to 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this acknowledge signal by resetting its outputs. The first stage can only generate new data when the acknowledge signal is de-asserted, indicating that the second stage is ready to consume the second data token. W hen the third stage evaluates it wiU generate an ack signal to the second stage, which will cause it to reset its outputs as well as lower its ack and req signal. Since the second stage has lowered its ack signal it can now consume a second data token. 2.6.2 Fine Grain Pipelining The design methodology in this thesis is targets fine grain pipelining and small cells, where the forward latency is two gate delays. Fine grain pipelining is achieved by dividing the processing blocks to even smaller cells where each cell has its own input and output completion detector. For example a 32 bit multiplier can be implemented by using a 32 bit input completion detector at the inputs and a 32 bit output completion detector at the outputs. When the multiplier completes it processing and generates a 32 bit output, the output completion detector detects it and combined with the input completion detector generates and acknowledge. However the multiplier can only accept a new input only when the whole multiplier has finished processing. Therefore the throughput is limited to how fast the multiplier can multiply two numbers, generate and acknowledge and then reset. As in the synchronous case the throughput o f the multiplier can be increased by further pipelining the multiplier. In asynchronous design, this can be done by constructing the multiplier using small number of cells such as adders and other logic gates which have their own input and output completion detectors. N ot only now can the multiplier accept new input as soon as the first row of 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. logic in the multiplier has evaluated and reset but also simplifies the 32 bit completion detectors into 1 bit input and output completion detectors. For a 2 dimensional structure such as a multiplier this is called 2D Fine Grain Pipelining. Also since fine grain pipelining uses pre-designed templates it has an added benefit of cell reuse and faster design time. 2.6.3 Perform ance Analysis of Linear Pipelines Determining the performance of an asynchronous pipeline can be more complex than determining the performance o f a synchronous pipeline. In an asynchronous pipeline, control signals govern token flow with local handshaking. Each four phase token is composed of a data element and a reset spacer. At any instant, the pipeline stages not occupied by data elements or reset spacers can be described as containing a hole or bubble. Control logic only allows an element to flow forward when the stage it will occupy is empty^. W hen an element does flow forward, it leaves behind an empty' slot. Thus, bubbles flow backward as they displace forward-flowing data elements and reset spacers. The performance can be limited by the supply of tokens, the supply o f bubbles or the local control handshaking between two pipeline stages. In a pipeline, the left or input environment supplies data tokens and the right or output environment supplies bubbles. In an asynchronous pipeline the time it takes for a data token to flow from the inputs to the outputs o f one pipeline stage is defined as forward latency. The reverse or backward latency specifies the delay from the acknowledgment of a stage’ s output to the acknowledgment of the predecessor’ s output. The time difference two tokens passing 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. through the same pipeline stage is called cycle time. The cycle time is the total of the forward and backward latency. In an asynchronous pipeline, the per-stage forward or backward latency depends on the implementation of the circuit and the handshaking protocol. Pipeline stages, which can hold one data token using only one stage, are called ftdl buffers (also known as high capacity or slack). Pipeline stages, which need two stages two hold one data token are called half buffers. Assuming that the right environment is not operating, or has stalled handshaking with the last stage o f an asynchronous pipeline, and the left environment keeps inserting as much data tokens as it can, the maximum possible tokens that the pipeline can hold is defined as the static slack of the pipeline. Assuming that the left environment is asserting and the right environment is consuming data tokens as fast as the pipeline can operate, the number o f tokens needed for the pipeline to operate at the highest throughput is called the dynamic slack of the pipeline. For a pipeline where the forward latency is less than the backward latency, the cycle time is dominated by the backward latency. For the opposite case the cycle time will be dominated by the forward latency. Figure 2.4 illustrates the throughput vs. number o f tokens for a linear asynchronous pipeline. The left side o f the triangle shows the characteristic of an asynchronous pipeline operating in a data-limited region. In this region, as the data tokens are inserted more frequently the pipeline operates at a higher throughput. The speed o f the pipeline is limited by how fast data can be inserted into the pipeline. The right side of the triangle shows the characteristic o f an asynchronous pipeline operating in a bubble-limited region. In this region the right environment cannot consume the data provided by the asynchronous pipeline and therefore the data 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tokens start to accumulate in the pipeline. Another way to view this region is to say that the handshaking between pipeline stages is limiting the throughput at which tokens can be processed and therefore the overall pipeline performance starts to degrade. The figure has two throughput vs. tokens triangles. The left one is for a forward-latency limited pipeline and the right one is for a backward-latency limited pipeline. Dynamic slack Static slack Bubble Limited Region Data Limited Region Tokens Tokens Figure 2.4: Throughput vs. tokens graphs In order to determine the latencies and cycle time of a pipeline built out o f a particular configuration of components in each stage, it is necessary to analyze the dependencies o f the required sequences o f transitions. These dependencies can be drawn in a marked graph [15], in which the nodes of the graph correspond to specific rising and falling transitions of circuit components, and the edges depict the dependencies o f each transition on the output of other components. Unfolded dependency graphs are functionally equivalent to Signal Transition Graphs. STG’ s can be used to determine both the forward latency and the cycle time. The local cycle time is Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. detertTiined by cyclic paths in the STG. These cycles occur because a pipeline processes successive data tokens and the components in each stage go through a series of transitions. The transitions eventually return a stage to the same state, where the state is defined by the output values of each component. Each transition in a STG can fire only when all of its predecessors have executed their specified transitions, and cannot fire again until all of its predecessors have fired again. 2.6.4 N on-L ineat Pipelines Recendy many new asynchronous pipelines have been introduced. However most of them have been targeted for linear pipeline applications such as FIFOs. Real designs, however, require more complicated non-linear pipeline structures. In particular, linear pipeline stages have only a single input and a single output channel, where as non-linear pipelines stages can have multiple input and output channels. This section presents an overview o f the challenges involved in designing non-linear pipelines. In particular we address issues with (i) synchronization with multiple destinations (for fork^, and (ii) synchronization with multiple sources (for joins). To introduce these issues we focus on forks and joins. A join is a pipeline stage with multiple input channels whose data is merged into a single output channel. A fork is a pipeline stage with one input channel and multiple output channels. Complex forks and joins can involve conditionally reading from or writing to channels based on the value of a control channel that is unconditionally read, as in a merge or split channel. Abstract illustrations of these channels are shown in Figure 2.5. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. S2 SI L- S3 SI a b Figure 2.5: a) a fork and b) a join Since a fork has multiple output channels, it must receive an acknowledgment signal from all of them before it precharges. A join, on the other hand, receives inputs from multiple channels and must broadcast its acknowledgment signal to all its input stages. A join acts as a synchronization point for data tokens. The acknowledgment from the join should only be generated when aU the input data has arrived. Otherwise a stage feeding a join, referred to as A, that is particularly slow in generating its data token may receive an acknowledgment signal when it should not, violating the 4-phase protocol. If the acknowledgment signal is de-asserted before the slow stage A generates its token, the token is not consumed by the join, as it should be. In fact, this token may cause the join to generate an extra token at its output, thereby corrupting the intended synchronization. A conditional spUt is a combined fork and join where a control channel is used to determine which output is generated. The control may indicate to send the input data to any of the output channels, any combination of the output channels, or none of them. The third option is also known as a sA^. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A conditional join is a join where the control signal, select, comes from another pipeline stage. The select signal controls which incoming channel should be read. Lack Rack Lreq Rreq Lack Rackl Lreq Rack2 Rreq Lack Rack Lreq Rreq a ) Fork Lack Rack Lreq Rreq Lack Rack Lreql Rreq Lreq2 Lack Rack Lreq Rreq b) Join Lack Rack Lreq Rreq Lack Rackl ir««i Rack2 Lreql Lreq2 Rreq I Lack Rack ^ ^ --- Lack Rack Lreq Rreq — ^ ^ Lreq Rreq s S Lack Rack ^ ...... lreq Rreq — ^ — ► s - ► — ► s Lack Rack 4----- Lreql Rreq LreqZ Lreq3 s c) Conditional S p lits d) Conditional Joins Figure 2.6: Fundamental non-linear pipeline structures 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ch ap ter 3 3. N ew H igh Speed QDI Asynchronous Pipelines In this chapter we introduce two new Q D I templates that provide significant performance improvements over those proposed by Caltech without sacrificing quasi delay insensitivity. The key idea is to reduce the complexity of internal circuitry by inteUigently reducing concurrency and using an additional wire for communication between pipeline stages. We present two templates: one that is a half-buffer which requires two pipeline stages to hold one data token and one fuU-buffer template that can itself hold one data token. We first give background on Caltech’ s commonly used QD I templates, the Weak- Conditioned Half Buffer (WCHB), the Precharged Half Buffer (PCHB), and the Precharged Full Buffer (PCFB) templates [33]. 3.1 Caltech’s QDI templates 3.1.1 WCHB Figure 3.1 shows a WCHB template for a linear pipeline with a left (L) and right (R) channel and an optimized WCHB dual-rail buffer. LO and L I, RO and R1 identify the false and true dual rail inputs and outputs, respectively, hack and Rack are active-low acknowledgment signals. Note that we do not show staticizers that are required to hold state at the output of aU C-elements. 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The operation of the buffer is as follows. After the buffer has been reset, all data lines are low and acknowledgment lines, \Mck and Kack, are high. When data arrives by one of the input rails going high, the corresponding C-element output will go low, lowering the left-side acknowledgment l^ack. After the data is propagated to the outputs through one of the inverters, the right environment wiU assert ^kack low, acknowledging that the data has been received. Once the input data resets, the template raises luick and resets the output. Since the L and R channels cannot simultaneously hold two distinct data tokens, this circuit is said to be a half buffer ot has slack V a [33]. This WCHB buffer has a cycle time o f 10 transitions, which is significantly faster than buffers based on other Q D I pipeline templates. Another feature of the WCHB template is that the validity and neutrality of the output data R implies the validity and neutrality of the corresponding input data L. This is called mak-conditioned logic [57] and we will discuss its advantages and disadvantages after we discuss non-Unear pipeline templates. LO L ack ^ R ack R ack Lack Rca> 1 ^ Data out Data in R1 b) Figure 3.1: WCHB Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.1.2 P C H B a n d P C F B Figure 3.2 shows the template for a pre-charged half-buffer (PCHB). UnUke the WCHB, the test for validity and neutralit)^ is checked using an input completion detector. The input completion detector is denoted as LCD and the output completion detector as RCD. ■Rack L a c k aC-r ■Rack b) PCFB a) PCHB Figure 3.2: a) PCHB and b) PCFB templates RCDi+ LCDj+ LCD,' :pr fl- a) Figure 3.3; a) PCHB and b) PCFB STG Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The function block need not be weak-conditioned logic and thus can evaluate before aU the inputs have arrived (if the logic allows). However, the template only generates an acknowledgment signal luick after all the inputs have arrived and the output has evaluated. In particular, the LCD and the RCD are combined using a C-element to generate the acknowledgment signal. A few minor aspects of this template should also be pointed out. First, because the C-element is inverting the acknowledgment signal is an active-low signal. Second, the Lack signal is often buffered using two inverters before being sent out. Another two inverters are also often added to buffer the internal signal en that controls the function block. For simplicity, these buffering inverters wiU not be shown in the figures in this paper. The protocol for a PCFIB pipeline stage is captured by the STG for a three-stage pipeline illustrated in Figure 3.3. From the STG, it is possible to derive the pipeline’ s analytical cycle time: '^PO IB hip,,! k:D iprni, Due to the extra buffering and bubble shuffling, the cycle time generally amounts to 14 gate delays or transitions. The PCFB template and its STG are shown in Figure 3.2(b) and Figure 3.3(b). The PCFB is more concurrent than the PCFIB because its L and R handshakes reset in parallel at the cost o f requiring an additional state variable. The PCFB analytical cycle time is: T jv fb hi mi h : D + 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. which generally amounts to 12 transitions. Here tf p takes two transitions, one of the C- elements takes one transition, and the other takes two transitions. 3.1.3 Why Input Completion Sensing? A join is a pipeline stage with multiple input channels whose data is merged into a single output channel. A fork is a pipeline stage with one input channel and multiple output channels. Complex forks and joins can involve conditionally reading from or writing to channels based on the value of a control channel that is unconditionally read, as in a merge or split channel. Since a fork has multiple output channels, it must receive an acknowledgment signal from all o f them before it precharges. A join, on the other hand, receives inputs from multiple channels and must broadcast its acknowledgment signal to aU its input stages. A join acts as a synchronization point for data tokens. The acknowledgment from the join should only be generated when aU the input data has arrived. Otherwise a stage feeding a join, referred to as A, that is particularly slow in generating its data token may receive an acknowledgment signal when it should not, violating the 4-phase protocol. If the acknowledgment signal is deasserted before the slow stage A generates its token, the token is not consumed by the join, as it should be. In fact, this token may cause the join to generate an extra token at its output, thereby corrupting the intended synchronization. Validity of data should be checked on all input channels before the acknowledgment signal is asserted to prevent the incorrect insertion of a token caused by a slow/late input channel. Neutrality shoirld be checked to guarantee that the previous stages have 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. been precharged, so that the acknowledgment signal is not deasserted too early, thereby violating the four-phase protocol on any stage slow to precharge. The templates presented in this section check validity and neutrality in different ways. Because the function block in WCHB template is weak-conditioned, the output completion detector implicitly checks validity and neutrality of the input data token. In the WCHB buffer the weak conditioned function block is a simple C-element. However, for more complex non-Hnear pipelines, weak-conditioned function blocks unfortunately require complex nmos and pmos networks. This results in slower forward latency and bigger transistor sizes. As an example, a weak-conditioned dual-rail OR is shown in Figure 3.4. Re Ai n r Ai H I HiO-BO j[-AO 01 [ > o - ^ CO ]|-A 0 3|-B1 C|_BO a) chl^ b) Figure 3.4: An OR gate implementation using weak conditioned logic 3.2 N ew Q D I Tem plates One optimization that can be applied to the PCHB and PCFB templates is to merge the LCD of one stage with the RCD o f the other by adding an additional request line to the channel. This is shown in Figure 3.5 for a PCHB template. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. L ack R a ck ■Rack R re q Figure 3.5: Optimized PCHB for a 1-of-N+1 channel The request line indicates the assertion/de-assertion of the input data, as in the bundled-data channel. However in contrast to a bundled-data channel, the data is sent using 1-of-N encoding, yielding what we call a 1-of-N+1 channel. The request Une, at least from the channel point o f view, may appear redundant. However, the request line enables the removal of the input compledon detector thereby saving area and reducing capacitance on the data lines. Moreover, the request line does not significantly impact performance, the template is still Q D I, and the communication between stages remains delay-insensitive. In this section we propose two new 1-of-N+1 Q D I templates that intelligently reduce concurrency to reduce the stack size o f the function blocks and thereby improve performance. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.1 R SPC H B The key goal o f the RSPCHB [49] compared to the PCHB is to eliminate the need of the enable signal m from the control of the function block. We now explain that the need for this enable signal is only to support concurrency in the system that effectively does not improve performance. More specifically, in the PCHB template the output o f the LCD and RCD are combined using a C-element to generate the acknowledgment signal l^ck. This supports the integration of the handshaking protocol with the validity and neutrality o f both input and output data, which removes the need for the function block to be weak- conditioned, but also requires the use of the m signal. It is this replacement however that introduces more concurrency than is necessary. In particular, in the case o f a join, the non-weak-conditioned function block may generate an output as soon as one the input channels provide data. In response, the RCD of the join will assert its output. Meanwhile, any subsequent stage can receive this data, evaluate, assert both its LCD and RCD outputs, and assert its acknowledgment signal. Although the join can receive this acknowledgment, it wiH not precharge until after m is asserted. The en signal delays the precharge of the circuit until after the acknowledgement to the input stages has been asserted. This delay is critical to prevent the precharge from triggering the RCD to deassert which would prevent the C-element from ever generating the acknowledgment. If only the generation o f the acknowledgment signal from any stage subsequent to the join was delayed until all input data to the join has arrived and been acknowledged, then the en signal could be safely removed. In fact, such a delay o f the 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. acknowledgement would not generally impact performance because the join is the performance bottleneck for the subsequent stages. Therefore, this added concurrency is essentially unnecessary. We propose a different pipeline template, wliich reduces this unnecessary concurrency to eliminate the internal en signal, thereby reducing the transistor stack sizes in the function block. We refer to this new Q D I pipeline template, illustrated in Figure 3.6(a), as a Educed Stack Precharged Half Buffer (RSPCHS). A specific form o f this template for dual-rail data is shown in Figure 3.6(b). Notice that we optimized the RCD block by tapping its inputs before the output inverter and using a N i\N D gate instead o f an O R gate. The unique feamre of the RSPCHB is that it derives the request line from the output o f the C-element instead o f the RCD. (In particular, since the output of the C- element is active low and the request line is active high, the output o f the C-element is sent through an inverter before driving Preq) The impact o f this change is that the assertion/de-assertion of Rreq is delayed until after aU ILreq’ s are asserted/de-asserted. Lreq — Rack Lack ■RC» c j ^ Data out Data in nmos stack Data out a) b) Figure 3.6: a) Abstract and b) detailed QDI RSPCHB pipeline template 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. RCD RCD RCDj Figure 3.7: The STG of the RSPCHB As a consequence, the acknowledgment from a subsequent stage o f the join may be delayed until well after its data inputs and outputs are vaUd. More specifically, the stage will delay the assertion of its acknowledgment signal until aU l^recfs are asserted which can occur arbitrarily later than the associated data lines becoming valid. This extra delay, however, has no impact on steady-state system performance because the join stage is the bottleneck, waiting for all its inputs to arrive before generating its acknowledgement. In fact, this change yields a template with no less concurrency than WCHB. The advantage of this generation o f the request fine is that the function block does not need to be guarded by the enable signal. In particular, it is now sufficient to guard the function block solely by the Pc signal because the Pc signal now properly identifies when inputs and outputs are vaHd. Namely, the function block is allowed to evaluate when Pc is deasserted which occurs only after aU inputs and outputs data Hnes are reset. 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Similarly, it is allowed to precharge when Pc is asserted which occurs only after all input and output data hnes are vaHd. The RSPCHB is still Q D I, however, the communications along the input channels to joins become Q D I instead of delaydnsensitive (other channels remain delay- insensitive). In particular, the assumption that must be satisfied is that the data should reset before the join stage enters a subsequent evaluation cycle. If we assert that the fork between the function block, the RCD, and the next stage is isochronic [9], this assumption is satisfied. In particular, the data Hne at the receiver side is then guaranteed to reset before the request hne Rreq resets because only after the data hnes reset can the RCD trigger the C-element, subsequendy triggering Rreq. The analytical expression for the timing margin associated with this isochronic fork assumption can be derived from the abstract STG of the RSPCHB shown in Figure 3.7. In particular, the delay difference between the resetting o f the data and the associated request Hne should be less than; h n e TjJ h This margin is between 6 and 8 gate delays depending on buffering and is easily satisfied with m odern routers. Notice that this timing assumption only apphes to input channels of join stages because non-join stages must receive both vahd data and a vahd Preq before generating vahd output data or valid Rreq. The analytical cycle time o f the RSPCHB can be derived from the STG shown in Figure 3.7 as: T h iisp c H B ~ M ax(3. + 2. + 2. t+ + 2. + 4. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. With bubble shuffling, RSPCHB and PCHB have equal numbers of transitions per cycle. The advantage o f RSPCHB is that the lack of an LCD and reduced stack size of the function block, which reduces capacitive load, and yields significantly faster overall performance. The cost of this increase in performance is that it requires one extra communicating wire between stages. A fork can be implemented easily by either using a C-element to combine the acknowledgment signals from the forking stages or by combining them by increasing the stack size of the function block. Similarly a join can be implemented, by combing the request hnes in the C-element and forking back the acknowledgment signal. — Rack \R C D ^ Data out a ) Racko Pc Racki [RCD Data outi Lreq • Data b) Figure 3.8: Conditional a) join and b) split using RSPCHB 4 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Consider the slighdy more complicated template for a conditional join in which a control channel S is used to select which input channel to read and write the read data token to the single output channel illustrated in Figure 3.8(a). The template has one C- element per input channel, each responsible for generating the associated acknowledgement signal. Each C-element is triggered by not only the RCD output, but also the corresponding control channel bit. The collection of C-elements are simply ORed to generate the l^req^ because the C-elements are mutually exclusive. This template can be easily extended to handle more complex conditionals in which multiple inputs can be read for some values o f the control. The template for the conditional fork is shown in Figure 3.8(b). Here, the functional block, the RCD and the C-element are repeated for each output channel. The select data lines ensure only one function block evaluates. All C-elements are combined using an A N D gate to generate the acknowledgement for the select channel. (This is because both the C-element outputs and the acknowledgement signal are active low.) This template can easily be extended to handle the generation o f multiple outputs in response to some values of the control. A common example o f a conditional fork is a skip in which depending on the control value the input is consumed but no output is generated. The implementation has a skip output acting as an internal N+1 output rail that is not externally routed and is triggered upon the skip control value. A skip in which aU control values generate no output is called a bit bucket [53]. Figure 3.9 shows a one-bit memory implemented using a RSPCHB template. A and C represent the input and output channels. B is the internal storage. S is an input 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. control channel that selects the write or read operation. When SO is high, the memory stores the value at the input channel A to the internal storage B. W hen SI is high, on the other hand, the memory is read, that is, the stored memory value is written to the output channel C. For a write, both input data and control channels are acknowledged, while for a read, only the control channel is acknowledged. The write and read operations are as follows. After reset, the memory, stored in the dual-rail Memory Unit, MU (similar to [33]) is initialized to some value and one o f the rails of the internal signal B is high. W hen an input A is applied and SO is high, one of rails of B is asserted high, thereby storing the data. The Memory Completion Detector, MCD, detects that the value in the memory is updated, and asserts its output. The output o f the MCD as well as the request lines from the data and control channel drive a C-element, which generates the acknowledgment signal Uack,^. W hen SI is high, on the other hand, the internal data stored in B is sent to the output channel C. W hen an acknowledgment is received from the output channel C, the outputs are reset but the data stored remains unchanged. The control channel S is acknowledged for both write and read operations using an A N D gate driven by the two C-element outputs. Notice that the memory is actually implemented by merging two RSPCHB units. The first one is used to store data (write), and the second one to send it to the outputs (read). The MCD detects the completion of the write operation and resets when ail inputs are lowered. The MCD can be simplified by replacing the pmos transistors driven by AO and A1 with a pmos transistor driven by Lack^. However this requires that the delay difference between the data lines of channel A and its associated request line is not long enough to 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. cause short circuit current. This restriction can be removed by also controlling the nmos stack by also adding one more nmos transistor driven by the hac'k^ signal. The overall benefit however is not clear. Ilo-Ai llo-AO Ib-so i l — I I —B 1 B0,B1 Ih-so AO. A1 M C D BO, B1 C O .C t M U Figure 3.9: A RSPCHB 1-bit memory 3.2.2 Loops U sing RSPCHB One can notice from the simple buffer shown in Figure 3.6 that the delay from the left request, Lrry, to the right request Rreij, is 2 gate delays, consisting o f a simple 2- input c-element and an inverter, which closely matches the delay through the domino logic datapath. For a more complex gate with multiple input and output channels, however, the delay from Lreq to Rreq will increase and for some cells wiU be larger than the delay trough the domino logic. It is im portant to realize that for RSPCHB designs with loops, this delay may become the performance bottleneck 5 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In particular, the fact that the data moves forward faster than its request signal causes a problem. Given a loop with L stages, assuming that both the assertion o f the data and the request signal happen to be synchronized at stage 0, let the data move foi-ward faster than the request signal at a rate o f X.Jp, where t^ is the forward latency of data and p is the forward latency of the request signal (p > = p,). Pipeline stages in the loop which have evaluated wtU be able to re-enter the evaluation phase only when the request signal is asserted, allowing them to precharge, and then subsequently de asserted. Therefore as the data attempts to lap the request signal around the loop (which may take many loop iterations), the data will stall waiting for the request signal of the subsequent stage to de-assert. From this point on, the forward latency of the data wiU slow down to match the forward latency of the request signal. Figure 3.10 demonstrates the slowdown o f the loop performance for a large 33-stage loop that highlights this issue. The loop consists of 33 pipeline stages, where tj=2, p=4 and t^.=18 (the cycle time of each pipeline stage) and simulation results of this loop show that the long-term average cycle time is 33*4 unit delays. 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. w n n re<LS ____ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ pL A A A ie<Ll5 n A i T o e r r i - f O ? ;1 A r e O O r C j T_6 1 1 1 1 1 u T_5 r I T_10 1. . . . .i A T_15 f 1 ............. T_20 J u T_25 1 11 . . . .U . . . . . . . . . . . . T_30 1 J ""U "' Figure 3.10: Performance slowdown in a 33 stage loop using RSPCHB. Oiar solution to prevent this negative effect is to insert request breakers into the loop. Request breakers are pipeline stages that generate an Rreq signal as soon as they evaluate rather than waiting for an Lreq signal from the previous stage. A request breaker, shown in Figure 3.11, is a modified PCHB, which generates a request signal from the output of the output completion sensing, but also handles the left side handshake. 5 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Rreq Lack Rack Figure 3.11: Request breaker for RSPCHB loops The period at which request breakers must be inserted into the loop depends on the number of stages in the loop (L) the forward latency of the request signal (t^) the forward latency o f the data (tj) and the cycle time (t^) of the stages used. We wiU compute the number of stages N needed for the data to lap the request. That is, the request signal has arrived at stage N and the data is about to be stalled at stage L+N -2 (with a loop index of (L+N-2)%L = (N-2)%L). More specifically, the time it takes the request signal to enter stage N plus the cycle time of the stage to complete its precharge and re-enter its evaluation phase should be larger than it takes the data to go through the loop and enter stage N-2 for data to stall. Thus, (L + N — 2) . td — tc + N. tr N = (L. td- tc) / ( tr - td) The above formula suggests that N is 22 in our 33 stage loop example. This means that a request breaker should be inserted every 22 stages to prevent a slowdown as shown in Figure. 3.12. Figure 3.13 shows the waveforms for a loop with the breakers 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. inserted, confirming that a slow down is avoided. Note that if N is larger than L, only one request breaker is needed. a) L o o p a t t= 0 22 21 20 l>) R e q a iilv e s a t sta g e 2 2 . D a ta loop.s a ro n u ti aiwl sta lls a t sta g e 2 0 11 f ) R e q u e s t b r e a lif r s iH se ite d e v e iv 2 2 s ta g e s . Figure 3.12: Figure 15. a) The loop at time t=0. Data and request are synchronized, b) Data travels around the loop and stalls, c) Request breakers prevent the stall 1 r n n 1 f\ n n n n ..... N f] n r f\ n f n d J 1 / n f \ 1 f __ n ( \ f\ r \ r \ ( \ n 1 l^ n 1 1 1 1 _^ L j !- r 1 1 U 1 1 1 n n n n n n r n 1 1 1 1 1 1 r 1 1 n n n n n 1 1 1 1 1 r i-« l5 le tL lD ie<i_l5 r e < j L _ 2 V ie q _ 2 5 re { L 5 0 TJ» T_5 T_10 T _ 1 5 T_2« T Mf Figure 3.13: Performance improvement using request breakers in a 33 stage RSPCHB loop 3 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.3 RSPCFB Our second new 1-of-N+1 QD I pipeline template is a full buffer constructed by merging our RSPCHB with a modified WCHB. An abstract illustration o f this reduced stack pre-charged full buffer (RSPCFB) [49] is shown in Figure 3.14(a) and a more detailed implementation for dual-rail data is shown in Figure 3.14(b). The RSPCFB has two new features. First, the inverters from both of the half buffers have been removed to keep the forward latency of the new template at two gate delays. We assert that the inverters between the two half buffers can safely be removed because the RSPCHB has Kttle gate load and wire load can be minimized by placing/routing tliis template as a single unit. The output inverters are only necessary if this unit is driving a significant load and can be added as necessary, (flowever a staticizer, not shown, is still necessary.) Second, the WCHB has to be modified to accept an input request signal and generate an output request signal. This input request signal drives a C-element whose other input is the RCD output. This C-element then triggers the internal acknowledgement to the RSPCHB part instead o f the RCD alone. In addition, the output request signal is implemented by simply tapping of a signal from the RCD output. One other difference is that the request signal is now active low because the inverters have changed locations (i.e., bubble shuffling [36]). The circuit operates as follows. The RCD of the RSPCHB part detects the evaluation of the function block and asserts its output. The output o f the RCD drives the C-element, which generates the acknowledgment signal Ljrck to the previous stage after aU the request Hnes associated with the data also arrive. If the next stage is ready to accept new data, the acknowledgment signal Rack should already be de-asserted, 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. allowing the C-elements in the forward path to pass the data to the next stage. Subsequently, the WCHB’ s RCD will assert its output asserting the request signal to the next stage. The output of the RCD also drives the C-element C,,, which asserts the internal acknowledgement back to the RSPCHB part, allowing the function block to precharge. W hen the acknowledgment signal 'Kack is de-asserted, the C-element in the forward path wik de-assert its outputs. This will trigger the WCHB’ s RCD to de-assert the C-element Ct, to de-assert the internal acknowledgement back to the RSPCHB, and thereby enable the function block to re-evaluate. Pc H C D ] KCD| > R 1 Stack Data Out Data In -►RO Data in Out b ) a) Figure 3.14: a) Abstract and b) detailed RSPCFB Notice that the 'Rreq o f the RSPCFB is taken from the output of the RCD instead o f the C-element, unlike the RSPCHB. This is because the WCHB part has weak- conditioned logic, which wiU not reset until aU inputs, including inputs from the RSPCHB part, have reset. This implicitly avoids the problem of preventing the assertion o f the acknowledgement back to the RSPCHB part that delaying Rreq solved. The advantage o f this is that the 'Rreq can be generated earlier. The disadvantage is that 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this reduces are timing margin on input channels to joins to 5 to 7 gate delays, depending on buffering. The RSPCFB has 10 transitions per cycle, less than Caltech’ s PCFB, which has 12 transitions. The analytical cycle time, using the STG in Figure .3.15, can be expressed as: ~ M ax( 3. + 2. C,, + 2. /,,+ h'ira! h .D ^ ^ N A N d) KAW D, Figure 3.15: a) Abstract and b) detailed RSPCFB The RSPCFB can be extended to handle non-linear pipeline structures in the same way as the RSPCHB without any additional timing assumptions. 3.2.4 FSM D esign One of the m ost im portant aspects o f a complete system design is the implementation of the controller. An FSM is actually a state holding circuit, which only changes its state when the expected inputs for that state are available. One way to build 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. an asynchronous FSM is to feed the outputs of the pipeline stage back to its inputs using buffers to hold the data (also proposed in [33]). This technique is similar to the synchronous case. In addition it requires no new circuits and can be easily appKed to template-based design. Figure 3.16 shows an abstract FSM. Current State Channels Next State Channels Output ChannelM- B uffers N ext State L o g ic O utput L o g ic Figure 3.16: An abstract asynchronous FSM Each channel either is an input, an output, or holds state. The next and current state channels can be implemented with either 1-of-N +l channels, ideally suitable for one- hot state encoding o f the FSM. The next state and the output logic blocks are complex Q D I pipeline stages, which can have multiple function blocks inside. These multi-input multi-output conditional blocks are implemented the same way as the conditional read and write blocks shown previously. 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The simplicity of this method for designing FSMs allows all known synchronous design techniques for generating Boolean next state and output expressions direcdy to be applied. Also the next state logic can be implemented as several stages of pipelined logic, reducing the number of necessary feedback buffers. Aside from using feedback buffers, which for a high number of states can yield a large circuit there are also other ways to design circuits that hold state. Another way to implement state holding is not to generate an acknowledgment signal. This avoids the reset of the input data. Although this technique can be used for specific problems like loop control [33], it is very limited. A more general way is to use the memory block for state holding presented in the previous section. This memory can also be further modified by adding one more internal state to allow read and write operations at the same cycle making it more suitable to be used as a register in FSMs. 3.2.5 Simulation Results Both Verilog and HSPICE simulations were performed to check the correctness of functionality and to measure performance of all the proposed linear and non-linear pipelines. A structural Verilog netJist has been generated with both random and unit delays. The Verilog code is written such that in the case o f any hazard on any of the signals the simulator asserts a warning or error. The Verilog simulations with unit delay were performed for cycle time analysis, and the simulations with random delay were performed to intuitively verify that the circuits are QD I. N o asserts have been found for random delays and the unit delay simulations confirm the transition counts. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. HSPICE simulations were performed using a 0.25 TSMC process with a 2.5V power supply at 25°C. The purpose o f these simulations was to confirm the results obtained by the Verilog simulations, and to compare the throughputs of the proposed pipelines with the pipelines presented in the background section. Since the goal was comparison, no attempt was made to fine-tune the transistor sizing to achieve optimum performance. In particular, all transistors were sized in order to roughly achieve a gate delay equal to a small inverter (Wnmos=0.8um, W pmos=2um, and L=0.24um) driving a same-sized inverter. For the purposes o f this comparison, wire delay also has been ignored. For the half buffers, the PCHB and the RSPCHB, a linear dual-rail pipeline of buffers with 60 stages has been constructed to achieve a static slack of 30, which means that it can hold 30 distinct data tokens. For the full buffers, the PCFB and the RSPCFB, 30 stages have been used to achieve the same static slack. AU pipeUnes can hold 30 distinct tokens. Figure 3.17(a) shows throughput versus tokens triangles for the half buffers and Figure 3.17(b) shows them for the fuU buffers. The triangles for the PCHB and PCFB are indicated with the dotted Unes. Approximately 15 distinct points have been obtained per pipeline for the triangle graphs using HSPICE simulation. One key result obtained from this simulation is the dynamic slack of each pipeline, which is the number o f tokens required to achieve maximum throughput [67], [33]. The PCHB achieves a maximum throughput of 772MHz with a dynamic slack of 7.3. The RSPCHB is faster with a maximum throughput o f 920MHz and a dynamic slack of 8.25. The throughput improvement is approximately 20%. For the fuU buffers, the PCFB achieves a maximum throughput of 707MHz and a dynamic slack o f 3.7. The 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. RSPCHB is faster with a maximum throughput of lOOOMHz and a dynamic slack of 5.9. The speed improvement is approximately 40%, however due to the C-elements in the forward path of the RSPCFB, the forward latency is about 15% slower. In both the half and full buffer, we achieved higher dynamic slack. This means that our templates support more systemdevel concurrency and higher stage utilization. 1000 13 a. JS B £ 9 O tm Tokens PCFB B r 700 3 600 H 400 Tokens Figure 3.17: Throughput versus tokens for a) the PCHB and RSPCHB and b) the PCFB and RSPCFB linear pipelines 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Notice that although the PCFB has 12 and the PCHB has 14 transitions per cycle, the PCFB was slower. This is partially due to the heavier load on the internal wiring in the PCFB compared to the PCFIB. Clearly, careful transistor sizing and buffering can improve the performance of all pipeline templates, however, we expect the relative performances to remain approximately the same. 3.2.6 Conclusions This chapter has introduced new high-speed Q D I asynchronous pipeline templates for non-Unear dynamic pipelines, including forks, joins, and more complex configurations in which channels are conditionally read and/or written. Timing analysis and HSPICE simulation results demonstrate that our new RSPCHB achieves ~20% throughput over its PCHB counterpart and our new RSPCFB achieves — 40% throughput improvement over the PCFB counterpart. 6 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ch ap ter 4 4. Tim ed Pipelines A number of fast asynchronous fine-grain pipeline templates have been proposed for high-speed design, including IPCMOS [55] and GasP [63], [14]. These ultra-high speed designs have very aggressive timing assumptions that introduce stringent transistor sizing requirements and high demands on post-layout verification. Researchers from Columbia University have recently proposed several high-speed dynamic-logic pipeline templates that achieve comparable performance with much less stringent timing assumptions[59], [58]. These pipelines are based on Williams’ well known PSO pipelines which is an optimized version o f Caltech’ s PCHB, where the optimization takes place by removing the input completion detector and adding a timing assumption to assure correct operation. The Columbia pipelines, which also have PSO’ s timing assumption, were introduced for linear datapaths (i.e. without forks and joins), although preliminary solutions for handling joins were proposed in [58]. In addition, an initial approach to handling slow or stalled environments for the limited case of linear pipelines was also proposed in [59]. However, the synchronization problems that arise when using arbitrary forks and joins are much more complex and challenging, and the approaches of [59],[58] do not address these issues. This chapter attempts to fill this void. The contribution o f this chapter is a set of five new non-Unear pipeUne templates that extend the Columbia pipeUnes to handle non-Unear datapaths. Both o f Columbia’ s 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. dynamic4ogic pipeline spdes are targeted: lookahead pipelines (LP) [59] and high- capacity pipelines (HC) [58]. Several distinct lookahead pipeline styles were proposed in [59], both single-rail and dual-rail. This chapter builds upon one representative each of single-rail (LPg^2/2) and dual-rad (LP3/1) lookahead pipelines, and also upon the single-rad high capacity pipeline (PIC). The ideas presented here, however, can be easdy adapted to the remaining st)des. First we present Wdliams' PSO pipelines. Then we review Columbia’ s three asynchronous pipelining styles; (i) LPsr2/2, a single-rail lookahead pipeline, (d) LP3/1, a dual-rad lookahead pipeline, and (id) HC, the high- capacity pipedne. Finally we present solutions to extend these pipednes for non-dnear applications. 4.1 Williams’ PSO Pipeline Figure 4.1 shows one stage o f Widiams’ PSO pipedne [67]. The pipedne stage consists of a dual rad function block and a completion detector. The output o f the completion detector is fed back to the previous stage as the acknowledgment signal. The completion detector checks the vaddity or absence of data at the outputs. There is no input completion detector. ■ 4 Rack -dl a) W illiam s' PSO Figure 4.1: Williams’ PSO pipeline stage 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The function block is implemented using dynamic logic. The precharge/evaluation control input Pc, of each stage comes from the output of the next stage’s completion detector. The precharge logic can hold its data outputs even when its inputs are reset, therefore it also provides the functionalit}' of an implicit latch. Each completion detector verifies the completion o f every computation and precharge of its associated function block. The operation of the PSO pipeline is quite simple. Stage N is precharged when stage N +1 finishes evaluation. Stage N evaluates when stage N+1 finishes reset. This protocol ensures that consecutive data tokens are always separated by reset tokens, holes. The complete cycle of events for a pipeline stage is derived by observing how a single data token flows through an initially empty pipeline. The sequence of events from one evaluation by stage 1, to the next is: (1) Stage 1 evaluates, then (2) stage 2 completes, then (3) stage 2’ s completion detector detects completion o f evaluation, and then (4) stage 1 precharges. At the same time, after completing step (2), (3)’ stage 3 evaluates, then (4) stage 3’ s completion detector detects completion o f evaluation and initiates the precharge of stage 2, then (5) stage 2 precharges, and finally, (6) stage 2’ s completion detector detects the completion of precharge, thereby releasing the precharge o f stage 1 and enabling 1 to evaluate once again. Thus there are six events in the complete cycle for a stage from one evaluation to the next. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The protocol for a PSO pipeline stage is captured by the STG for a four-stage pipeline illustrated in Figure 4.2. From the STG, it is possible to derive the pipeline’ s analytical cycle time: Tpso hva! Gd"'” RCDT Figure 4.2: The STG of the PSO Pipeline Williams has simplified the pipeline stage at the expense of sacrificing delay insensitivity. Williams’ PSO pipeline has the following timing assumption: which must be verified during physical design. 4.2 Lookahead Pipelines Overview (Single Rail) Figure 4.3(a) shows the structure o f one stage of the LPsj^2/2’ lookahead single-rail pipeline [59]. Each stage has a dynamic function block and a control block. The function block alternatively evaluates and precharges. The control block generates the The 2/2 label characterizes the operation of the stage of a pipeline: 2 coraponnents in the evaluation phase and 2 component delays in the precharge phase, forming a complete cycle. 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. bundling signal, done, to indicate completion of evaluation (or precharge). The bundling signal is passed through a suitable delay line, allowing time for the dynamic function block to complete its evaluation (or precharge). N ote that there is one function block (F) for each individual output rail o f the stage, and different function blocks can sometimes share precharge and evaluate (foot) transistors. This pipeline style has two important features. First, the completion signal, done, is sent to the previous stage as an acknowledgment (Lack) by tapping off from before the matched delay. This early tap-off is safe because a dynamic function block typically is immune to a reset o f its inputs as soon as the input data has been absorbed by the first level of dynamic logic. The second feature is that the control signal, Pc, is appUed to both the control block and the function block in parallel. Therefore, the function block can be precharge-released even before the arrival of new input data. This early precharge-release is safe because the dynamic logic wiU compute only upon the receipt of actual data. Both of these features eliminate critical delays from the cycle time, resulting in very high throughput. The analytical cycle time can be expressed using the following components; ~ delay o f function block evaluation = delay o f control (generalized C-element) 6 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Lack Rack Lreq Rreq Lack Rack Lreq Rreq Lackl Rackl Lack2 Rack2 t t -Rack aC Pc Lreq Rreq matched C D a b t Eval R ack -a C Lreq •Rreq matched C Figure 4.3: a) L P sr2/2 b) LP3/1 and c) HC pipelines For correct operation, the matched delay must satisfy, 4 /^ , > 4 ,^ , - 4 . For ideal operation, we will assume that 4 ^ ,^ is no larger than necessary, 4 /^,-~ ^ V ^ o te that to simplify the analytical expressions we assume that the completion delay is longer than the evaluation delay, which is generally true for fine-grain pipelines. Using the above notation and assumption, the pipeline’ s analytical cycle time is: T lP U C ?/^ ~ ^ g c The per-stage latency o f the pipeline is: ^ L P S R 2 /2 ~ ^Eml 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3 Lookahead Pipelines Overview (Dual Rail) Figure 4.3(b) shows the structure o f one stage of the dual-rail L P 3 /L pipeline [59]. In this pipeline, there are no matched delays. Instead, each stage has an additional logic unit, called a completion detector, to detect the compledon of evaluation and precharge of that stage. Unlike most existing approaches, such as WiUiams and Horowitz’ s pipelines [67], [68]each stage o f the LP3/1 pipeline synchronizes with two subsequent stages, i.e., not only with the next stage, but also its successor. Consequently, each stage has two control inputs. The first input, Pc, comes from the completion detector (CD) of the next stage, and the second control input, E-ml, comes from the completion detector two stages ahead. The benefit of this extra control input is to allow a significandy shorter cycle time. This Era/ input allows the current stage to evaluate as soon as the subsequent stage has started precharging, instead of waiting until the subsequent stage has completed precharging. The analytical cycle time can be expressed as: Lj?/; ~ him i ^ E and The per-stage latency o f the pipeline is: L^U>3/1 ~ him! ’ As with the previous pipeline style, the 3/1 label characterizes the operation of a stage of the pipeline; 3 component delays in the evaluation phase and 1 component delay in the precharge phase, fonning a complete cycle. 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4 H igh Capacity Pipelines (Single Rail) Finally, the structure of one stage o f the HC pipeline [58] is shown in Figure 4.3 (c). A key feature of this pipeline style is that is uses decoupled control of evaluation and precharge: separate Eval and Pc signals are generated by each stage's control. Precharge occurs when Pc is asserted and Eva/ is de-asserted. Evaluation occurs when Pc is de asserted and Eva/ is asserted. W hen both signals are de-asserted, the gate output is effectively isolated from the gate inputs; this is the isolate phase. To avoid short circuit, Pc and Eval are never simultaneously asserted. An asymmetric C-element, aC, is used as a completion detector. The aC element output is fed through a matched delay, which (combined with the completion detector) matches the worst-case path through the function block. Unlike m ost existing pipelines, the HC pipeline stage cycles through three phases. After it completes the evaluate phase, it enters the isolate phase (where both Eval and Pc are de-asserted) and subsequendy the precharge phase, after which it re-enters the evaluate phase, completing the cycle. Furthermore, unlike the other pipelines covered in this paper as well as the PSO style in [68] the HC pipeline has only one explicit synchronization point between stages. Once the subsequent stage has completed its evaluate phase, it enables the current stage to perform its entire next cycle. The analytical cycle time can be expressed as: T h C ~ himl /pmb'^ /„C^ /n a NDc'^ /jN ]' The per-stage latency o f the pipeline is: ^HC~ h i,-a 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.5 Designing Non-linear Pipeline Structures The basic assumption in linear pipelines is that each pipeline stage has a single input and a single output channel. Non-hnear pipelines stages, however, may have multiple input and output channels. This section presents an overview o f the challenges involved in designing non-Hnear pipelines using timed templates. In particular we address issues with (i) synchronization with multiple destinations (for forks), and (ii) synchronization with multiple sources (for joins). Subsequent sections provide our detailed solutions for each of the three pipeline spdes reviewed above and then briefly describe how these solutions are extended to channels that are conditionally read or written. 4.5.1 Slow and Stalled Right Environments in Forks Figure 2.5(b) shows an abstract two-way fork in which the forking stage SI drives stages S2 and S3. For correct operation, SI must receive (and recognize) acknowledgments from both S2 and S3. A problem is that S2 and S3, and the subsequent stages of each, may be operating largely independently of each other. One of these stages may get arbitrarily stalled, thus potentially stalling its acknowledgment from either S2 or S3. If the pipeline templates designed for linear pipelines were naively extended to a datapath with a fork, by expecting SI to synchronize on all o f the acknowledges from the forked stages using a C-element to combine them, then the resulting pipeline may malfunction. In particular, the acknowledgments generated in m ost linear pipeline structures are non-persistent. That is, after a stage asserts its acknowledgment, it assumes that the precharge of the previous stage is fast. Therefore, it does not explicitly check for the 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. completion of that precharge before de-asserting the acknowledgment. We call this restriction/assumption the fast precharge constraint. In the case o f a non-Unear pipeUne, however, if exacdy one of S2 or S3 is slow or stalled, the acknowledgment signal of the fast stage may be de-asserted before SI has a chance to precharge, causing deadlock. In other words, in this situation, SI violates the fast precharging constraint. We call this problem the slow or stalled right environment fSREJ problem. In particular, WiUiams’ classic PSO pipeUnes [67] along with the recent lookahead and high-capacit)? pipeUnes aU have this problem. We propose two general solutions. The first solution is to modify only the immediate stages after a fork, such that, even after precharging, they maintain the assertion o f their acknowledgment signal and are expUcidy prevented from re-evaluating until after the forking stage is guaranteed to have precharged. The key is to modify the stages after a fork to guarantee their acknowledgments are properly received while still guaranteeing that these stages satisfy the fast precharge constraint. The second solution is to modify every pipeUne stage such that they maintain the assertion o f their acknowledgment signal until after its predecessor stages are guaranteed to have precharged. In other words, this solution is to modify the entire pipeUne to remove the fast precharge constraint, impUcitiy solving the SHE problem. This solution must be appUed to aU stages because an unmodified stage may otherwise assume its predecessors satisfy the fast precharge constraint, which may not be the case. 4.5.2 Slow and Stalled Left Environments in Joins The second chaUenge is one of synchronization with multiple input channels, as needed in a join. Figure 2.5(a) shows a two-way join structure for an abstract pipeUne 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. where the data from each input stage, SI and 82, must be consumed by the join stage S3. The data outputs of 81 and 82 are gathered together and presented to S3 as its inputs. Subsequently, S3 sends an acknowledgment to both 81 and 82 once it has consumed the input data. Thus, a two-way join represents a synchronization point between the outputs of two senders. A problem can arise if the logic implementation o f stage S3 is “eager”, i.e. 83 may produce output after consuming one but not both of its data inputs (see [68]). For example, if S3 contains a dual-rail O R function that evaluates eagerly (i.e., as soon as one high input bit arrives), then, after evaluation it wiU send an acknowledgment to both 81 and 82, even though one o f them may not have produced data at all. As a result, if one of the input stages is particularly slow or stalled, it may receive an acknowledgment from 83 too soon. This can cause the insertion of a new unwanted data token at the output of the slow stage and thus corrupt the synchronization between the stages. We call this the stalled left environment (SUd.) problem. One solution is to allow join stages to have eager function blocks but stiU ensure that the generation o f the acknowledge signal occurs only after consuming data from all o f the input stages. This solution has been used extensively in quasi-delay insensitive templates [33]. 4.6 Lookahead Pipelines (Single Rail) Handling joins in single-rail lookahead pipeUnes is straightforward, and was initially proposed in [58]. The join stage receives multiple request inputs (ftreq’ .i), all o f which are merged together in the asymmetric C-element (aC) that generates the completion signal. In particular, each additional request is accommodated by adding an extra series 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. transistor in the pull-down stack of the aC element. The aC will only acknowledge the input sources after all o f the hreq’ s are asserted and the stage evaluates. To handle forks, on the other hand, a C-element must be added to the forking stage to combine the acknowledgments from its immediate successors. In addition, the other stages of the pipeline must also be modified to overcome the SRE problem of Section 4.5.1. As indicated, the problem is that the acknowledge signal from an immediate successor of a fork stage can be regarded as a pulse, which may be de-asserted before its predecessor forking stage has precharged, causing deadlock. This section gives two distinct solutions for handling such forks in LPj;j^2/2. 4.6.1 Solution 1 for LPsj^2/2 The first solution [50] is to modif)^ the immediate successor stages o f forking stages to latch their hack acknowledgment signals and delay their re-evaluation until after all predecessors have precharged. For LPjj^2/2, this is solution achieved by modifying hack logic and the control of the foot transistor, as shown in Figure 4.4. Assume the forked stage has just evaluated and the acknowledgment signal hack signal has just been asserted. At this time, the right environment will assert Rack causing the output of the latch, X , to be asserted (X=0, i.e., active low), effectively latching the non-persistent acknowledgment signal. The X output is held low even when Ruck is de- asserted. In particular, X is de-asserted (X —1) only after Done goes low caused by hre^ going low, implying that the input forking input stage has precharged. Effectively, the foot transistor now prevents re-evaluation until after X goes low, delaying re-evaluation until aU inputs (including any slow input) are guaranteed to have precharged. 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. These modifications ensure that even late acknowledgments from a stage S3 immediately after a fork are guaranteed to be properly received while still ensuring that S3 satisfies the fast precharge constraint, thereby solving the SRE problem. — R ack R ack --------- °| D o n e — P > o ------p-o( Latch Latch S -L -J— I matched / delay L req Lreq- R req R ack — D o n e Lreq' 1 1 - Y gC element b a Figure 4.4: a) Modified first stage after the fork, b) Detailed implementation of the gates in the dotted box The only new timing assumption that this template introduces compared to LPsj^2/2 is that the ^Lack pulse width must be long enough to properly latch it. This pulse width assumption, however, is looser than the original timing assumption that remains: the pulse width must be longer than the stage’ s precharge time. 4.6.2 Solution 2 for LPsjj2/2 The second solution [50] is to modif}^ each stage so that it does not de-assert its acknowledgments until after aU input stages are guaranteed to have precharged. This 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. solution can be implemented using the modified LPsji2/2 template shown in Figure 4.5 in which the asymmetric C-element is converted to a symmetric C-element. As suggested earlier, this modification removes the fast precharge constraint, implicitly solving the SRE problem. Rack Lack ^ matched Figure 4.5: The LPsr2/2 pipeline stage with a symmetric c-element 4.6.3 Pipeline Cycle Tim e For the first solution, the cycle time expressions do not change if the additional acknowledgment signals simply increase stack height and do not add additional gates. For multi-way forks and joins, however, the cycle time will increase by the additional C- elements needed to combine them. For the second solution, the cycle time becomes: T "li> sr 2 /2 ~ fnax(2. + 2. + 2. t j 4.7 Lookahead Pipelines (Dual Rail) This section extends a dual-raH lookahead pipeline, L P3/1, to handle forks and joins. Since both the stalled left environment (SEE) and the stalled right environment (SRE) problems of Section 4.5 can arise in dual-rad pipelines, detaded solutions are presented for both forks and joins. 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.7.1 Joins Unlike LPs r2/2, the LP3/1 pipeline has no explicit request line and thus may not function correctly unless it is modified to handle the SLE problem in joins. Our proposed solution stiU allows the use of eager function blocks; however it ensures that no acknowledgment is generated from a stage until after all it’ s input stages have evaluated. In particular, our solution is to add request signals to the input channels of the joins and feed them into the join stage’ s completion detector, as illustrated in Figure 4.6. The join’ s completion detector now delays asserting its acknowledgment until not only the function block is done computing, but also until after aU the input stages have completed evaluation, thereby solving the left environment problem. Note that the additional request signals are taken from the outputs o f the preceding stages’ completion detectors. WhUe this modification does not affect the latency of the pipeUne, the analytical cycle time changes to; T'jj’ i/l ~ ^ C D ~ ^ ^N A N D 4.7.2 Forks As in the single-raU lookahead pipeUne, LPs,^2/2, we propose two solutions [50] for the slow or staUed right environments. These solutions are simUar in essence to the solutions for the single-raU case, but adapted to dual-raU. The implementation of solution 1 is very simUar to LPsf.2/2 as shown in Figure 4.7. First, the completion detector (CD) has been modified such that the acknowledgment signal is de-asserted only after the forking stage has precharged. In addition, we delay 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the re-evaluation of the function block until after the forking stage has precharged using a decoupled foot transistor controlled by the Y signal. Lackl Lackl Rackl Rackl 1 f Eval CD Rreql Rreq2 Figure 4.6: The LP3/1 pipeline with a modified CD to handle joins Lackl CD Pc aC aC Lre<i Ih—-X C D ofF b a Figure 4.7: a) Modified first stage after the fork, b) Detailed implementation of the additional gates 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The second solution is to add a request line to all LP3/1 channels and delay de assertion of the acknowledgment (hac'kl in this case) until after all immediate predecessors have precharged, as shown in Figure 4.8. The request Mne is generated via a C-element that combines the incoming request Une(s) and the output of the completion detection. The output of this C-element becomes the new L atkl. Because the C-element de-asserts its acknowledgment only after is de-asserted, the fast precharge constraint is remormd, sohnng the SRE problem. For solution 1, compared to the original LP3/1 template, the cycle time is slightly increased to; ^BraJ hrecb For solution 2, the cycle time increases to: ~ ^ nand Racli2 Rackl Eval Lackl Lreq ► Rreq Figure 4.8; The LP3/1 stage with a C-element 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.8 H igh Capacity Pipelines (Single Rail) Since the high capacity pipeline template uses singie-rail encoding, it has a request line associated with the data and thus does not have the slow or stalled left environment problem in joins. However, because the acknowledgment signals in the high capacity pipelines are also non-persistent (effectively, timed pulses), they do have problems with a slow or stalled right environment in forks, The simple modification [50] to the original stage controller of the high capacity^ pipeline illustrated in Figure 4.9 delays de-asserting the acknowledgment until after the request line goes low, thus removing the fast precharge constraint and solving the SRE problem using solution 2. In particular, by replacing the N A N D 3 gate by the state holding generalized C- element, the acknowledgment signal Kack only triggers the assertion o f the precharge control signal, Pc. The de-assertion of Peis caused by the input request signal Rreq going low. Thus, Pc remains asserted until after precharge is completed, and is unaffected by the acknowledge signal from the next stage getting de-asserted. Furthermore, the inverter is replaced by a N 0 R 2 gate with an additional input to delay the stage’ s re- evaluation until after the stale input data is reset. In the new version of the HC pipeline stage the state variable, okZpc, belongs to the channel between stage N-1 and A T . The reasoning is as follows. The function of the state variable is to keep track of whether stages N-1 and N are computing the same token, or distinct (consecutive) tokens; precharge of N-1 is inhibited if the tokens are different. If there are two stages, N-1^^^ and N -1 ^\ supplying data for stage N , we propose to have two separate state variables, one to keep track o f whether stages 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and N have the same token, and the second to keep track of whether stages and N have the same token. Similarly, if stage N had two successors, N+1^^^ and N + 1 ^ \ we propose to have two distinct state variables, one each for the pair (N, and the pair (iV,N+/^^). controller R ack aC “ aC Lreq Lack a Stage iV-1 Channel Stage N ok 2 ev al reqin + - | d elay L Eval Figure 4.9; a) Original and b) New HC stage The aC element, which implements the state variable oklpc, is pulled out o f the stage controller and placed in-between stages N-1 and N (i.e., moved into the channel). In addition, the pC element is also moved into the channel to avoid extra wiring. 4.8.1 Handling Forks and Joins Figure 4.10 shows the implementation of a template for stage, N , for the case where stage iV is both a fork as well as join. The multiple reqin’ s, ok2evars and ack’ s are handled by simple modifications to the linear pipeline o f Figure 4.9(b), as shown in Figure 4.10. Multiple reqin’ s - . Each additional reqin is handled by adding a single series transistor 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to the aC element that makes up the completion generator, much like it was done for LPsjj2/2 in Section 4.6. Hence, done is generated only after all the input data streams have been received. Multiple okZepal’ s - . Each additional okZeval is handled by adding it as an extra input to the N O R gate that produces the eval signal. Consequently, the stage is enabled to evaluate {eval asserted) only after all of the okZeval signals are asserted, i.e. after aU o f the senders have precharged. Multiple ack’ s - . Multiple ack’ s are handled by OR’ ing them together. Since the ack’ s are all asserted low, the OR gate output goes low only when all the ack’ s are asserted, thus ensuring that precharge occurs only after the stage’ s data outputs have been absorbed by aU o f the receivers. The OR gate is actually implemented as a N A N D with bubbles (inverters) on the ack inputs. This N A N D has an additional input — the stage’ s completion signal — whose purpose is to ensure that, once precharge is complete, Pc is quickly cut off. Otherwise, Pc may get de-asserted slightly after Eval is asserted, causing momentary short-circuit between supply and ground inside the dynamic gates. 4.8.2 Pipeline Cycle Tim e If only joins are present, the cycle time is only slightly increased. Compared with the cycle time obtained in [58], the new cycle time equation has a N O R delay instead of an inverter delay, and a gC delay instead o f a N A ND 3 delay: 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Stage N okZeval ok2eval ackl ack2 OR delay Figure 4.10: A 2-way join 2-way fork HC stage T hC ~ hind h>r,dr^ h,C ^ ^V O K If forks are also present, then the cycle time increases by the delay of the OR gate which is needed to combine the multiple acknowledgments: ^HC “ hird ^ h d e c h '^ hiC^ ^ g C ~ ^ ^ N O iC ^ ^ O R 4.9 Conditionals Other complex pipeline stages allow conditionally reading and writing data and can have internal state. This section briefly covers the implementation o f these cells for the LPsr2 /2 template; however, a similar approach can also be applied to the other pipeline styles. Figure 4.11(a) shows a conditional read, where the stage reads only one of the input channels depending on the value of the select channel. Only the channels read are acknowledged. Figure 4.11(b) shows a conditional write, where the stage reads the 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. input channel and outputs the data (writes) to only one of the output channels depending on the value o f the select channel. It receives an acknowledgment only from the output channel where the data is written. Note that the C-elements are only symmetric for the ^kack input and asymmetric for all others. Figure 4.12 shows a one-bit memory implemented using a LPm^2/2 template, A and C represent the input and output channels. B is the internal storage. S is an input control channel that selects the write or read operation. W hen SO is high, the memory stores the value at the input channel A to the internal storage B. Both the input A and the select channels are acknowledged. The implementation of how data is stored is shown in the dotted box (similar to [33]). Assuming that there is already data stored, one of the dual rail bits of B is high and the other is low. W hen an input A is applied and SO is high, first both raUs are lowered and then one of them is asserted high, thereby storing the data. The C-element, which generates the acknowledgment of the input channel hackA. through a matched delay line, is reset using its own output, since it doesn’t receive an acknowledgment from an output. The delay o f the delay line is matched to the delay o f writing the internal node B. When SI is high, on the other hand, the internal data stored in B is sent to the output channel C. W hen an acknowledgment is received from the output channel C, the outputs are reset however the data stored remains unchanged. 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. — Rack delay matched Lreq matched Figure 4.11: Conditional read and b) write. L ack s Rack( LackA"^ matched delay matched delay B0,B1 □ lo - A l H b - B l — ►BO P l-A l AO, All B0,B1 ^ GO. C l Figure 4.12:A one-bit LPsr2 /2 memory 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.10 Loops One can notice from the simple LPsj{2/2 buffer shown in Figure 4.3 that the delay from the left request, to the right request Bseq, consists of a simple non-inverting asymmetric c-element (implemented using two gates) and externally added delay to match the data. For a more complex gate with multiple input and output channels, however, the dela^? from l^req to Kreq will increase and for some ceUs wiU be larger than the delay trough the domino logic. It is im portant to realize that for loops implemented with multi input and output channels, this delay may become the performance bottleneck In particular, the fact that the data moves forward faster than its request signal causes a problem. Given a loop with L stages, assuming that both the assertion o f the data and the request signal happen to be synchronized at stage 0, let the data move forward faster than the request signal at a rate o f t j t„ where tj is the forv'-ard latency of data and t^ is the forward latency of the request signal (t,. > = tj). Pipeline stages in the loop which have evaluated will be able to re-enter the evaluation phase only when the request signal is asserted, allowing them to precharge, and then subsequently de- asserted. Therefore as the data attempts to lap the request signal around the loop (which may take many loop iterations), the data will stall waiting for the request signal o f the subsequent stage to de-assert. From this point on, the forward latency of the data will slow down to match the forward latency of the request signal. The modified LPsr2/2 according to solutions 1 and 2 and the LP3/1 pipeline also modified according to solution2 will still have this problem, since the Rreq signal is generated the same way and therefore the delay between the Rreq and Rreq signals 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. increase as the number of input and output channels increase. A detailed solution is given at Section 3.2.2. 4.11 Simulation Results HSPICE simulations were performed using a 0.25 TSMC process with a 2.5V power supply at 25"C. The purpose o f these simulations was only to quantify the performance overhead of using the fork-join structures of this paper, compared with linear pipelines. Hence, no attempt was made to fine-tune the transistor sizing to achieve optimum performance. In particular, all transistors were sized in order to roughly achieve a gate delay equal to a small inverter (Wnmos=0.8um, W pmos=2um, and L=0.24um) driving a same-sized inverter. For the purposes o f this comparison, wire delay also has been ignored. The simulation results for all linear and non-Hnear pipelines discussed in this paper are presented in Table 4.1. The original linear pipelines appear under the Soil columns and the linearl row because solution 1 involves only modifying the first stages after a fork and forks do not exist in linear pipelines. The linear2 row and Sol2 column has the cycle times for linear pipelines, where each stage has been modified according to solution 2. N ote that while the joins add only ~5% to the cycle time, the forks increase the cycle time by ~20% because of the additional C-element needed. The waveforms in Figure 4.13(a) show the data signal of a LPg^2/2 one-bit linear pipeline. Note also that the cost of the more robust solution 2 compared to solution 1 is generally less than 5%. Figure 4.13(b) shows waveforms for a fork with a slow right Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. environment channel called Data4 and Figure 4.13(c) shows a join with a slow left environment channel called DataB. L P sr2 /2 L P 3/1 H C Soil Soil Soil Soil Soil L in e a rl 0.99 N /A 1.20 N /A N /A L ineatZ N /A 1.06 N /A 1.28 0.93 F ork 1.23 1.29 1.41 1.45 1.20 Join 1.05 1.10 M l 1.34 1.01 Table 4.1: Cycle time (ns) of original linear pipelines vs. proposed nondinear pipelines 2,5- DataO f 2.5- r / D atal 2.5- Data3 2.3 l l n lOn 7n 8n 8 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.5- 2,5' Data2 Rack 1.5- 2.5 i Data3 Data3 Lack Data 4 2.51 Data4 Lack 1.5- 13n lUn 12n An 6n 7n 8n DataA 2.51 DataA (RreqA) 2.5 DataB DataB (RreqB) 1.5: 2.51 DataC DataC ("ReqC) 2.5- In 8n lUn 12n 6n 9n c Figure 4.13: HSPICE Waveforms, a) Linear pipeline, b) Two-way fork and c) Two-way join 9 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.12 Conclusions In this chapter we introduced new high-speed asynchronous circuit templates for non-linear dynamic pipelines, including forks, joins, and more complex configurations in which channels are conditionally read and/or written. Two sets of templates arise from adapting the LPsr2 /2 and LP3/1 pipelines and one set of templates arises from adapting the I IC pipelines. Timing analysis and PISPICE simulation results demonstrate that forks and joins can be implemented with a ~5% — 20% performance penalty over linear pipelines. AU pipeline configurations have timing margins of at least two gate delays, making them a good compromise between speed and ease of design. 9 ! Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chap ter 5 5. A D esign Example: The Fano Algorithm In this chapter we present The Fano algorithm, a convolutional code decoder, and its efficient semi-custom synchronous implementation. The algorithm is used in communication systems to decode the symbols received over a noisy communication channel. O ur goal is to later develop an efficient asynchronous counterpart, which we try to explore the challenges in designing asynchronous chips. In this chapter first we will present the Fano Algorithm. Then we will present the synchronous implementation o f the algorithm. 5.1 The Fano Algorithm 5.1.1 Background on the Algorithm The Fano algorithm [1] [32] [69], is a tree search algorithm that achieves good performance with a low average complexity at a sufficiently high signal-to-noise (SNR) ratio. A tree comprises nodes and branches, associated with each branch is a branch metric (or weight, or cost). A path is a sequence of nodes connected by branches with the path metric obtained as the sum o f the corresponding branch metrics. An optimal tree-search algorithm determines the complete path (i.e., from the root to leaf) with minimum path metric, while a good (suboptimal) tree search algorithm finds a path with metric close to this minimum. The Fano algorithm searches through the tree sequentially, always moving from one node to a neighboring node until a leaf node is reached. The Fano algorithm is a depth first tree-search algorithm [1], meaning that it attempts to search as few paths as 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. possible to obtain a good path. Thus, the metric of a path being considered is compared against a threshold T. The relation between T and the metric is determined by the statistics of the branch metrics (i.e., underlying model) and the results of partial path exploration. The latter is reflected by dynamically^ adjusting the threshold to minimize the number o f paths explored. The key steps o f the algorithm involve deciding which way to move (i.e., forward, or deeper, into the tree or backward) and threshold adjustment. Intuitively, it moves forward only when the partial path to that node has a path weight that is greater than T. If no forward branches satisfy this threshold condition, the algorithm backtracks and searches for other partial paths that satisfy the threshold test. If aU such partial paths are exhausted, it will loosen the threshold and continue. In addition if the current partial path metric is significantly above the threshold, it may tighten the threshold. Threshold tightening prevents always backtracking to the root node at the cost of potentially missing the optimal path. Moreover, a maximum traceback depth limit is often imposed to limit worst-case complexity^ The details o f the Fano algorithm are illustrated in the flow chart depicted in Figure 5.1 and a more detailed explanation can be found in [69] [32]. The decoding of a convolutional code with known channel parameters can be viewed as a tree-search problem with the optimal solution provided by the Viterbi algorithm [32], a breadth-first, fixed complexity algorithm. The Fano algorithm is known to perform near-optimal decoding of convolutional codes with significantly'- lower average complexity than the Viterbi algorithm. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Start at origin; T=0 Mp=0 First Visit Check Examine , “tfaeta*’ “thetaf'' j I “theta”! = 1 ' ■ “ D id m *ve originate or term inate on a node violating threshold T„ = T + A ? Yes f sst “theta” = 0 Look forward to best node, o f it entering via “A” next-best node Tlireshold satisfied Hireshold violated (Mp>T) Move forward Al node for first time? Tighten T At origin? Look back search node Threshold satisfied Threshold violated. (M b <T ) Move Did move originate on No “A” Loosen threshold, (T^T-A) Figure 5.1; Flow-chart of Fano Algorithm 5.2 The Synchronous D esign This section describes the efficient normalization scheme used to optimize the algorithm, our architecture at the register transfer level, and statistics o f the chip. 5.2.1 N om ialization and its benefits The basic idea behind normalization [60] is to change the point of reference (e.g., from the origin o f the tree to a current node under consideration). Normalization is often necessary to prevent hardware overflow/underflow Interestingly, in traditional communication algorithms, such as the Viterbi algorithm, normalization often yields significant performance and area overhead that hardware designers generally avoid by using slightly larger bit-widths and modulo arithmetic [11], In contrast, we show that 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. using normalization in the Fano algorithm can yield a smaller, faster and more energy efficient design. In particular, we normalize our variables in such a way as to make to current node’ s metric always equal to zero. This is equivalent to subtracting the current node’ s metric from every variable in the algorithm, which does not change the overall behavior of the algorithm. The advantages of this type o f normalization in the Fano algorithm is as follows. 1) Additions involving the current metric (i.e., during the threshold check) are removed and comparisons with the current metric (i.e., during the first visit check and threshold tightening steps) reduce to a 1-bit sign check. 2) The normalization of the next threshold (subtracting the current node’ s metric from it) can be done by the ALU that compares the threshold with the next metric, and thus consumes negligible additional energy. 3) Lastly, the normalization enables us to work with numbers with smaller magnitudes that can be represented with fewer bits. 5.2.2 Registet-Transfer Level D esign The register-transfer level architecture [60] is illustrated in Figure 5.2. The Threshold Adjust Unit (TAU) is shown in more detail, but still with some of the details omitted to simplify the schematic. At each clock cycle, the best and next best branch metrics are both calculated using data that is stored in memory. (See [69] for more details regarding the branch metric computation.) The threshold check unit compares the error metric with the current threshold to determine if a forward move can be performed and simultaneously speculatively calculates two normalized next thresholds, the first assuming a forward move will be taken and the second assuming the threshold m ust be loosened (by subtracting A from T). 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Based on the above results, either the move will be made and the pre-computed threshold will be stored or the threshold T will be loosened, all in one clock cycle. Additional clock cycles are needed to compute tightening the threshold if (i) a forward move is made, (ii) the first visit check is passed, and (iii) the pre-computed tightened threshold is not in the range of A. Fortunately, with reasonable choices of A, computer simulations suggest that these additional cycles of tightening are rarely needed. Similar speculative execution allows us to perform a look/m ove back in one clock cycle. Transmitted I n o J t D a ta . S k ip I n p u t Command Received Branch Bits Code Generator Last 6 Brarich Bits Predicted Branch Bits i Q s o " • 1 r B Branch Metric Cakulatnr Decoded Data . I Packet Length Selected Branch Metric Main LL LB.LFB.LFT-C,. BrnuErr, BmunoErr B tT iL fw n ... Threshold Adjust Unit (TAU) General Synchronous Arcitecture 9 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Sell (Selected Branch Metric) Sel2 N T (Threshold) p. \ \ \ l T+A N , Lo o se n_T/T ighte n_T Threshold Adjust Unit (TAU) Figure 5.2: RTL architecture of the synchronous Fano Algorithm The register-transfer-level architecture shown in Figure 5.2, is controlled by the finite state machine (FSM) illustrated in Figure 5.3. Three states, state 2-4, make up the main algorithm. In each of these states, the branch metric unit computes the needed selected branch metric using data that is stored in the sequence memory. Depending on control bits from the FSM (not shown) the selected branch metric that is associated with the best or worst branch. In either case, the corresponding input bit is sent to the decision memory where, in the case the branch is taken, it is used to update the selected path. In state 2, the machine looks forward, moves forward if possible, and, if necessary, performs one step of threshold tightening. More specifically, after the selected branch metric is computed, the FSM performs a threshold check to see if the machine can move forward. That is, ALU3 computes T minus the selected branch metric and the FSM examines the m ost significant bit. If the sign bit is a 1, the branch metric is no 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. smaller than T and the threshold check passes. Otherwise, the threshold check fails. Meanwhile, A LU l and ALU2 speculatively compute T+A and T+A minus selected branch metric respectively. These values, along with 6, a state variable shown in Figure 5.1, allow the FSM to determine whether the first visit check passes. That is, the first visit check passes if and only if 6 =0 or if T+A is positive or T+A minus the selected branch metric is positive. Based on the above results, the FSM acts in one of three ways. l)The threshold check passes and a forward move is performed, but the first visit check fails so that the NextState is set state S2, in preparation of another look forward. 2) Both the threshold check and the first visit check pass in which case the FSM moves to state S3. 3) The threshold check fails and the FSM moves to state S4 in preparation of look/m ove backward. In the case of 1) the threshold register is updated with T minus the selected branch metric, computed by ALU3. In the case 2), on the other hand, the threshold is updated with the tighter threshold T+A, computed by A LU l, whereas in the case o f 3) the threshold register remains unchanged. In state S3, the FSM checks whether a subsequent tightening is needed (by computing and checking the sign o f A+T). Simultaneously, it speculatively performs a 9 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. W ait f o r n e w d a t a SO 1 ReglstEr a nd w w RAH initialization Done SI S5 S2: Look/Move Forward and Tighten If If {CiflrentStste ==S2) Then If (Selected8ranchMetric>=T) Then Move FtMiward;T =T'BranchMetric; If (First Visit) Then T=T+A; NeKtState=S3, ElsB NextState =S2; // No tightening required Else NextState=S4 // Must look back 1 S3; Tighten or Look/Move 54: Look/Move Back If (CurrentStatE==S3) Then if ( ( T+Delta)<0) Then Exlra Tightening ; elsBif (SetectKiBranchMetric>=T) Then Move FfM-wwd; T=T-Se!ectsdBranchMetric; If (First Visit) Then T=T+A; NextState =S3; Else NextState =S2; / / No f^htenlng req. Else NextStalB=S4 } // Must look back If {CurrentState ==S4) Then If( (-8ackBranchMelrk:>=T) ) Then Move Back; T =T+BackBranchMetric; If (from w /tM -st node) Then NextState =S4; Else Set from worst node flag; NextState =S 2; Else Loosen ThreshoW; NextState=S2; Figure 5.3; Finite State Machine describing the RTL threshold check (by checking whether the Branch Metric is no smaller than T) which is needed in the event that the threshold need not be immediately tightened (i.e., in the event that tightening o f the threshold requires only the one addition o f A perform ed in state S2). If tightening is required, the NextState is set to state S3. For the case where no immediate tightening is needed, the FSM performs the same m ove/look forward/tightening/next-state operations as in state S2. State S3 is entered when the threshold check fails in either state S2 ot state S3. In state 84, a look backward is performed and, if possible, a backward move is made and the threshold is updated with the re-normalized threshold. Both the look baclcward and 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. re-normalization are performed through ALU3 by adding T and the selected (backward) branch metric. Specifically, the look backward check is satisfied if and only if the negative selected branch metric is greater or equal to the threshold, i.e., the result of the ALU3 operation is negative and the re-normalized threshold is precisely the output of ALU3. If a backward move is performed and it is originated from a worst node, via an additional FSM flag, NextState is set to state S4, in preparation of another look backward. Alternatively, NextState is set to state S2 in preparation o f a look forward to the next best node, controlled by a LookNextBest flag that is not shown to simplify exposition. If the backward look fails, on the other hand, the threshold is updated with a loosened threshold, speculatively computed by A LU l, and NextState is set to state S2. The key feamre of the speculative control strategy is that each forward move typically takes only once clock cycle with negligible performance overhead associated with the first visit check or tightening. In particular, with reasonable choices o f A, computer simulations suggest that additional cycles o f tightening are rarely needed. 5.2.3 Chip Implementation The chip [60] supports a packet length of N=128. The depth of the search tree, which also including 7 tail bits, is thus 135. It supports a rate 'A convolutional code, (i.e., n=2) with generator polynomials 1+ D + D 2+ D 5+D 7 and l+ D 3+ D 4+ D 5+ d6+ D 7. For this prototype, we assumed the chip would have fixed branch metrics B(0)=2, B(l)=-7, and B(2)=-16, requiring 5 bits to represent. These metrics are ideal for the SNR range o f l<E^/Nj,(dB) <3. In practice, they would be dynamically adjusted when the estimated channel SNR is outside this region, which may require an extra bit. 1 0 0 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We used automatic placement and routing tools with a combination of synthesized and manually laid-out components in the 0.5u HP14B CMOS process. The layout has an area of 1.2mm by 1.8mm. PowermiU was used to estimate the performance o f the design. At 1.5V power supply the design successfully operated at 15MHz and at 3.3V it successfully operated at lOOMHz. 1 0 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C h a p t e r 6 6. The Asynchronous Fano Deeper analysis of the Fano algorithm shows that the operation of the algorithm can be divided into two: The E/ror Vree Kegion and the F.rror Hegion. In the Error Free Region, the algorithm moves forward while the received bits from the sender are error free and match the expected bits. In this region of operation the un-normalized threshold is incremented with a constant value, namely the value given for an error free branch o f the tree. If the threshold value is known at the time the algorithm enters the Error Free Region then the next value of the threshold can be calculated. The normalized threshold, however, stays in the range of -A < T < 0 and rotates through a finite number of values in a pre-determined order. Consequently, instead of calculating the threshold values explicitly, a pointer to a lookup table containing these pre-determined values is incremented. When an error is encountered, the design enters the error region where the current value of the threshold is accessed from the lookup table and the full algorithm is applied in order to determine whether to move forward, move backward, or loosen the threshold. The algorithm stays in the error region until a node in the search tree is reached for the first time and the move was a forward move, at which point the algorithm moves back into the error-free region. The algorithm continues until the end of the tree by alternating between the error free and the error regions. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For high SNR applications m ost of the received packets have little to no errors therefore m ost o f the decoding process consists of reading the data from the memory, comparing to the predicted data, and the writing the decision to the memory and involves litde to no multi-bit additions/subtractions/comparisons due to loosening or tightening the threshold. This fact motivates a two-block architecture that is specifically designed to handle the two different operating regions of the algorithm efficiently. 6.1 The A synchronous Fano Architecture The proposed asynchronous architecture, shown in Figure 6.1 localizes the Error- Free region in a small block that is highly optimized [48]. In particular, the Branch Metric Unit (BMU) is partitioned into a Skip Ahead Unit optimised for the Error Free Region and a Threshold Adjust Unit and the Branch Metric Calculator that are active only in the Error Region and have implementations analogous to the synchronous version. The data received to the decoder via the Transmitted Input Data channel are stored in the Received Memory. The fast Skip Ahead Unit requests data from the Received Memory in 8 word chunks via the Previous/Next channel, where each data word is for the (7,1,2) code two bits wide. As the Skip Ahead Unit decodes the code and moves forward in the tree, it locally stores its decisions. Every 8 decision is sent to the Decision Memory via the Last 8 Decisions channel. W hen an error is encountered, the Skip Ahead Unit may need to go back in the tree to explore different branches by requesting previous decisions from the Decision Memory that arrive on the Previous 8 Decisions channel. The data flow between the Decision Memory and the Skip Ahead Unit is controlled via the W rite/Read channel. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i ransmitted Input Date . g<ip Input ^ Comm and ^ 8-packet words NexlyPrevtous Sidp-Ahead Unit CSMI) Main FSM Sfcai't of l'4evtPacket '‘ ® ’End of Packet j« Previous 8 Dectsiore Last 8 Decisions W rite/Read Branch Metric Calculator LB, LFB, Li=NB.. Branch Bmuncfcrr, Metric BmuFtvn... f Threshold Adjust Unit (TAU) c o u m 'E R IncPointer TLookiip LOOKUP TABLE Lookup Inc Dec Count General Asynchronous Architecture Figure 6.1: RTL architecture of the asynchronous Implementation In the Error Free Region, the received bits are read from the Received Memory and decoded in the Skip Ahead Unit, The resulting decisions are then sent to the Decision Memory and the SAU unit increments the look-up table pointer via the IncPointer channel. In this region, the Main FSM, Branch Metric Calculator, and the TAU are inactive. When an error is encountered the SAU informs the Branch Metric Calculator via the Error channel and also sends it the received branch bits and the predicted branch bits calculated using the previous decisions and the convolutional code. Depending on the move commanded by the Threshold Adjust Unit via the LFB (look forward best), LB (look backward), LFNB (look forward next best), and LFBTE (look forward best 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. until error) channels, the Branch Metric Calculator calculates and compares the branches, selects the appropriate one, and sends it to the TAU with additional information notifying if the move originated from a worse branch and if the branch had any errors (via the additional BmuErr and BmuFwn channels). Ever}? time the TAU is accessed for the first time when an error has occurred, the TAU reads the normalized threshold from the look-up table and updates the threshold value. The TAU is implemented analogously to the synchronous version and is responsible for deciding to move forward, move backward, or adjust the T threshold. Upon deciding a move, the relevant information is sent to the SAU and a new command is issued to the Branch Metric Calculator. Finally when a new error free node is reached for the first time, the TAU issues the LFBTE command, stores the normalized threshold, updates the pointer to the look-up table and resume operation in the fast SAU via the Back To Skip-Ahead channel. The operation switches back and forth between the SAU and the TAU until all the data is encoded. Upon reaching the end of the tree, the Decision Memory sends out the decoded data. The fact that the asynchronous circuit has no global clock allows the asynchronous architecture to be naturally divided into two blocks, each operating at its ideal speed that communicate only when and where needed via the inter-block asynchronous channels. 6.2 The Skip-Ahead Unit A high level implementation of the SAU is shown in Figure 6.2. The core o f the SAU is the Error Detector, which compares the predicted branch bits with the received branch bits and stores the decision. To operate at full rate, the memories must keep up with writing/reading one data word per decoding cycle. As the memory capacity 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. increases, this becomes a difficult task and for this reason we have opted to use shift registers that act as caches for the bigger memories. In particular, the Fast Data Register stores 8 words from the Received Memory and the Fast Decision Register acts as an 8- word read/write cache for the Decision Memory. When the Received Memory sends an 8-word packet to the Fast Data Register, the Received Memor}'^ speculates that the SAU will not encounter any errors and moves forward thus prepares to send a new set of data. This cache structure allows the larger memory to run at 1/8 the speed of the SAU. The same motivation applies to the use o f the Fast Decision Register with the exception that it is a read/write register. Both o f the registers have an associated controller to request and send data to their respective memories. SicPointer IncCoiml Received Branch Sits ! Previous 8 Decisions Up/Down Move Forvjard/ Backward Last 8 Decisions Branch Bits Predicted Bi-dfxI-i Bil H loile Controller Fast Data Register Detector Fast Decision Register Skip Ahead Logic Figure 6.2: Detailed implementation of the Skip-Ahead Unit 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The most recent decisions in the search tree, which always reside in the Fast Decision Register, are sent to the Code Generator, which predicts the values of the new branch bits. The predicted branch bits are compared to the received one in the Error Detector. If there is a match, indicating that there is no error, the decision is stored in the Fast Decision Register, an internal counter and the pointer in the look-up table are incremented, and new data are requested from the shift registers via the Move Forward/Backward, U p/D ow n, IncCount and IncPointer channels. If there is no match, then an error is encountered. The predicted and received branch bits are sent to the Branch Metric Calculator and the controls o f the shift registers are transferred to the TAU. The critical loop in the Error Free Region consists o f the Fast Shift Register, the Error Detector, the Fast Decision Register, and the Code Generator. For high SNR operation, m ost of the time the decoder operates in the Error Free Region, therefore our goal is to achieve high speed in this region by optimi2ing the circuit. However if and when the circuit encounters an error, it enters the Error Region and the critical path consists of the Fast Shift Register and the Convolutional Code Generator serving data to the Slow BMU. In the Error Region, the operation is the same as in the synchronous version consisting of a num ber o f sequential operations. In this region the speed is expected to be comparable to the synchronous case. 6.3 The Memory Design Since the chip supports a packet length of only 135 bits (128 data and 7 tail bits), we have opted to design the main data memory blocks o f the Received and Decision memories using standard PCHB templates. However, we introduced unacknowledged 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tri-state buffers on the data bus to efficiently allow multiple drivers o f the bus. This is typical in synchronous design, but does introduce some m inor timing assumptions not typical of PCHB-based designs. We also used standard place and route tools for the physical design o f the memories for faster design time at the expense o f more area and power consumption. Transm itted Input Data Block Block Block n-1 Block 8-packet words Trl-st.ate buffers R,eceived hternoiy' Controller N etft/P tevious Figure 6.3 Implementation of the Received Memory In particular, as depicted in Figure 6.3 the received memory consists o f n blocks where each block can hold 8 words. For the (7,1,2) convolutional code each word is 2 bits. The blocks are FIFO ’s implemented with PCHB’s. A t any time only one o f the tri state buffers is enabled allowing only one of the blocks to send their data. The Fano algorithm is a sequential tree search algorithm, therefore SAU accesses the memory sequentially via the Next/Previous channel. The Received Memory Controller responds to the request by enabling a preceding or proceeding tri-state buffer and sending new 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. data. The buffer captures the new data and sends it to the requesting unit. The timing assumption for correct operation is that the delay from the Next/Previous channel through the Received Memory Controller and the selected tri-state buffer should be less than the delay from the Next/Previous channel to the output buffer. Moreover, the output buffer should only latch its input when the enabled tri-state buffers outputs have changed and stabilized. The decision memory has a similar structure, however since it is a read/write memor)" each of the blocks can be accessed individually to read from or to write to. 6.4 The Fast Data and Decision Registers The fast data register is implemented using two 8-word, 1-bit shift registers, as shown in Figure 6.4. The register consists o f 8 conditional input, conditional output 1- bit memor}? pipeline stages. Depending on the command, cmd, it either shifts forward by receiving new data from InF and sending the old to OutF, shifts backward by receiving data from InB and sending the old to OutB, or loads 8-words in parallel from the main memory. The parallel load command overwrites the old data tokens inside each stage. The command channel Cmd should go to aU of the stages, however to prevent the use o f a big c-tree to generate the Cmd acknowledgement signal the Cmd signal is broadcasted with a tree o f copy buffers. Altough this solution reduces the load on the Cmd channel if it were to be copied to all stages directly, this solution increases critical loop delay o f the algorithm. The fast data register is implemented similarly. 1 0 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8-packet w ords loaded in parallel Crnd; Command InF ; Forw ard input InB ; Backward input OutF ; Forw ard output OutB ; Backward output InP ; Parallel input Received Branch bits Cmd Copy Copy Copy Cmd InP InF OutF OutB InB Cmd InP InF OutF OutB InB Cmd InP InF OutF OutB InB Figure 6.4 Implementation of a 1-bit fast shift register 6.5 Simulation Results and Com parison The core layout of the chip designed in TSMC 0.25p CMOS technology is illustrated in Figure 6.5. Nanosim simulations, on the extracted layout, show that the circuit runs at 450MHz and consumes 32mW at 25°C and has an area o f 2600|im x 2600pm = 6.76mm^. The asynchronous chip runs about 2.15 faster than its synchronous counterpart. However it occupies 5X the area. This is partially due to the no Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. fact that both of the memories which occupy half the chip area in the asynchronous chip are implemented with PCHB’s. Lastly the design consumes 1/3 the power of its synchronous counterpart. /Lf Decision Meiiioiv Blecaved ' Memorv - i i I- i ' ttr S i i - - il f§f* ’ 1 . - _ r Tliieslioki ' . ' iL Blanch - Adjust Tint A . S- Metiic • i , t ’alculatoi Slap PJicai! Unit m I’f. iitj' ■ 1 i ■ j I - ill y? ‘ 4 1 '■ i id W % 2 , o Figure 6.5: Layout of the asynchronous Fano Figure 6.6 (a) below shows the post-layout simulation results for the circuit operating under the Error Free Region. Since the Fast Data and Decision Registers can only hold 8 words, once the data held by the Fast Data Register is consumed a new set of data is requested from the main Received Memory. This request and data transfer causes a slight delay, which can be observed in the waveforms as a slight gap evety 8 pulses. Since there are no errors in the E rror Free Region the nofatl J signal used to HI Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. indicate the encounter of an error is never asserted but instead the nojail_t signal, which indicates that there are no errors is asserted by the detection logic. O n the other hand, as shown in Figure 6.6 (b) in the Error Region, as errors are encountered the decoder moves back and forth to find the correct path. This can be observed with the assertion o f the shiftb (shiftback) and nofail_f signals. 17971ns 8 -bits decoded No-error! Always Shift Foiward ' 18449ns n m I TutttI rtf H>:bi = 128 itits decrtiltft'i Error Encountered Move back (b) Figure 6.6: a) Error-Free and b) Error Region operation waveforms 1 1 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.6 Skip-Ahead Unit with RSPCHB The Skip-Ahead Unit has been re-implemented with the RSPCHB to investigate if the leaf level advantages of the RSPCHB template carry over to the system level. The critical loop of the SAU is shown below in Figure 6.7. It consists of the previously illustrated Fast Shift Register, Fast Decision Register, Error Detect and Code Generator, where the Code Generator unit is implemented with a tree of 3 input XOR gates. In addition to these blocks, there is the XOR_SPLIT unit, which sends the comparison of the actual received branch bits with the estimated branch bits either to the BMU or to the ERROR-DETECT unit. In particular, if the SAU is operating in the Error Free Region then the comparison result is sent to the ERROR-DETECT otherwise to the BMU. To BMU To BMU From BMU X0R_SPL1T ERROR DETECT noError FILTER Comparison ----------------- ► Decision_bit Result Received Data compared with estim ated branch bits Skip-Ahead DecBkm (0 3 I FAST XOR FAST XOR DECISION s l l i r l ' m em REGISTER XOR REGISTER Figure 6.7: The critical loop of the Skip-Ahead Unit 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The ERROR-DETECT unit depending on if there is an error or not, either makes a decision, sends it to the FILTER via the Decision_bit channel, and informs the BMU that there are no errors, or informs the BMU that there is an error. The FILTER acts as a bit consumer and consumes the data on the Decision_bit channel, which is garbage. The M ERGE unit sends to the Fast Decision Register the data coming from the FILTER if the in the Error Free Region or otherwise forwards the decision data sent by the BMU. This critical loop consists of 9 stages and therefore the time it takes one token to travel around the loop, namely the loop forward latency, is 18 transitions. This is equal to the cycle time of some of the pipeline stages in the loop, thus the loop is balanced. We have replaced every PCHB stage in the SAU with a RSPCHB equivalent and added request breakers where necessary. The re-implemented SAU ran at a throughput 14% higher than the implementation using the PCHB. This is lower than the 20-21% higher throughput achieved at the leaf cell level. While the individual pipeline stages that were replaced did see a cycle time improvement o f 20%, the forward latency of the loop only improved by 14%. O n other words, in the re-implemented design, the botdeneck is now the forward latency. The lesson leamed there is that the leaf level advantages of cells do not necessarily directly cart}'- over to the system level but rather is dependent on the pipeline stmcture, the number o f stages in the loop and the number o f tokens running in the loop. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C h a p te r 7 1. An Asynchronous Semi-Custom Physical D esign Flow The general design flow that the USC Asynchronous Design G roup has refined was already covered in the introduction o f this thesis. This chapter focuses on the last parts of the flow, discussing our contributions in the areas of gate level and physical design. 7.1 Physical Design Flow U sing Standard CAD Tools One of the biggest obstacles of designing asynchronous circuits today is the lack of CAD tools specifically targeted for the design of such chips. However it is still possible to complete a fairly complex chip in a reasonable amount of time using standard CAD tools used for synchronous design. Figure 7.1 below illustrates the flow [22]. There is no difference for the initial specification step o f the design for synchronous or asynchronous design, since a specification typically describes the expected functionality (Boolean operations) o f the designed block, as well as the delay times, the sikcon area, and other properties such as power dissipation. Usually, the design specifications allow considerable freedom to the circuit designer on issues concerning the choice o f a specific circuit topology, individual placement o f the devices, the locations of input and output pins, and the overall aspect ratio (width-to-height) o f the final design. The actual implementation o f the asynchronous circuit starts at the schematic level. The top-level circuit or design is hierarchically decomposed until the 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. design consists o f a netlist of leaf cells. If a leaf cell library exists then the automatic place and route tool can generate the layout using this library. Otherwise the leaf cells can be further decomposed into gates where the gate level netlist can be mapped to a gate library. Asynchronous Library Sym bol Schem atic Functional Abstract (.Ipe) Specification i Schematic (Virtuoso, Synopsys) r Simulation and Analysis i THspioe/Nanosim erilogi N etlist (.v) Place & Route (Silicon Ensem ble) Cell view s; •Sym bol •Schematic •Functional •Layout •Abstract Layout (.gds) Layout N etlist (.sp) N etlist (.cir) ------------------------------------- Chip Assembly / .............................. ............ LVS & DRC (Virtuoso) (Virtuoso, Dracula) V y , Layout (.gds) Chip Fabrication Figure 7.1: Physical design flow using standard CAD tools Depending on the final design size either the whole design can automatically be placed and routed using the P & R tool or the design can be partitioned into smaller blocks and each block can be placed and routed separately. This allows for better control over the layout for performance. Once the whole design is laid-out and Design Rule Check (DRC) is completed a Lay out-vs. -Schematic (LVS) m ust be performed to ensure that the layout is the same as the schematic. This step is followed by extraction 1!6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of the layout for post-kyout spice simulation. We have used the Dracula tool from Cadence for this step. The extracted netlist accurately represents the laid-out transistor dimensions as well as the wiring resistance and capacitance. Depending on the post layout simulation to achieve the desired performance and power requirements the top- level design might have to be changed and the whole step repeated. The architectural and leaf ceU design steps o f the physical design flow proposed in this thesis are illustrated below in Figure 7.2. Blueprint Gate Library Physical P&R Leaf Cell Netlist Decompositon Gate Level Netlist L eaf Cell Design Technology Mapping High Level Schematic ^Architecture Design Technology Mapping Leaf Cell Library Figure 7.2; Asynchronous circuit design flow proposed, covering both gate and leaf cell based physical design flow The high level schematic is developed in C and Verilog codes and used to describe the specification of design. The high level schematic is hierarchically implemented by decomposing the design to the lowest level communicating blocks, namely the PCHB 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. leaf cells. In the micro-architecture step the designer can choose to implement the architecture with various methods ranging from fine grain pipelines template-based using delay insensitive cells to components relying on bounded delay based with no pipelining at all. The asynchronous Fano has been implemented with fine grain pipelining using PCHB templates. Slack optimization in consideration of performance is also completed in this step. At the end of the micro-architecture design there are two possible options. One option is to keep going in the decomposition and generate a leaf ceU design. The leaf cell design will depend on the template used (PCHB, RSPCHB, LP3/1, H C ...). The next step is to generate a gate level netlist of the whole circuit just like in synchronous design. The gate library consisting o f static and dynamic gates will be mapped to the netlist and the design can be laid out using standard place and route tools. The other option is to generate a leaf ceU netlist rather than going any further and use a leaf cell library. The leaf cell library would be mapped to the netlist and the automatic place and route would be done at the leaf cell level rather than the lower gate level. This option would probably yield denser circuits with better performance since the leaf cells would be optimized and laid out using more of a full custom approach, although even automatic place and route can be applied to generate the leaf cells. Choosing the first option and applying place and route directly on a gate netlist can lead to a number of undesired effects. O ne of them is a less dense circuit since rather than sharing area and optimizing leaf cells, the leaf cells will be implemented with discrete gates. Another issue is that the handshaking circuits might not be as close to the 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. dynamic functional evaluation circuit when the place and route is applied to the gate netlist rather than the leaf cell netlist, therefore effecting performance. We have used the Virtuoso Schematic Editor from Cadence as a schematic entry tool to design the PCHB based leaf cell. All of the decomposition was also done using this tool. Initially only the functional and symbol views of the dynamic and static gates needed in the design are created and added to the asynchronous cell library. The functional description o f a dynamic circuit used as a buffer is shown below in Figure 7.3. m o d u le D ynam ic_B U FFE R _Function (nBTJFO, nBUFl, .A.0, A l, BTJFe, e n , BUFl, BUFO) ; o u t p u t nBUFO; o u t p u t nBUFl; o u t p u t BUFO; o u t p u t BDFl; i n p u t AO; i n p u t A l ; i n p u t EUFe; i n p u t e n ; r e g nBUFl, nBUFO, BUFl, BUFO, te m p ; i n i t i a l b e g i n tem p=D ; e n d p a r a m e t e r D l= 1 0 ; / / u n i t d e l a y 1 p a r a m e t e r D2=2 0; a lw a y s StBUFe o r AO o r .Al o r en) b e g i n if ( B U F e = = l s s e n = = l &S tem p= = 0) b e g i n i f ( A l = = l && A0==0) b e g i n nBUFl <= #D1 0 ; nBUFO <= #D1 1; BUFl <= #D2 1; BUFO <= #D2 0 ; t e m p = l; e n d e l s e i f (A1==0 £S A.0==1) b e g i n nBUFl <= #D1 1; nBUFO <= #D1 0; BDFl <= #D2 0; BUFO <= #D2 1; t e m p = l ; e n d e n d e l s e i f (BUFs==0 Sfi en==0) b e g i n nBUFl <= #D1 1; nBUFO <= #D1 1; BUFl <= #D2 0; BUFO <= #D2 0; te m p = 0 ; e n d en d Figure 7.3: The functional description of a dynamic buffer 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Once the design is completed and the correctness has been verified at the behavioral level, the schematic (transistor) views of the cells are implemented for spice simulation. The transistor level view of the dynamic buffer is shown below in Figure 7.4. eti ilBtTFl itBITFO Al—IL BO TeH B U Fe— i n A l— I a i — |H a i— I A O — I BOTe— I BT.TF1 BUFO ai en Figure 7.4: The transistor view of a dynamic buffer For spice simulation we have used Nanosim from Synopsys. The layout views were created once we were confident that the design worked as expected at the transistor level. The layout view for the dynamic buffer is shown below in Figure 7.5. 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. L „ Figure 7.5: The layout view of a dynamic buffer One im portant aspect o f designing cells for dynamic logic is charge sharing and transistor sizing. After a number of test simulations on individual cells we have decided to use 8X for the size of the output transistors, 2X for the pull-down transistors. The staticizer inverters were set to approximately 1/10 the strength o f the pull-down transistors to balance reliability of operation against speed. The other aspect for reliable operation is charge sharing. Unlike the schematic in Figure 7.4, if the nBUFl and nBUFO signals were generated using the A., en and BUFe signals as a stack of three transistors in series, there would the possibility of the internal dynamic nodes nBUFl and nBUFO loosing their value due to the charge sharing. This scenario could occur if^T and en were asserted high turning on their respective transistors and BUFe was still asserted low. To prevent this problem, we have opted to use a widely known solution of doubling the puU-down logic and cross-coupling it as illustrated in Figure 7.4. 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To reduce to load on the automatic place and route tool and to meet the performance of the circuit we partitioned the top-level design into a number o f blocks as shown in Figure 6.5. The place and route, which was performed using Silicon Ensemble from Cadence, was not timing based, to show that a Q D I based asynchronous circuit wiU work no matter what the delays are as long as the isochronic fork assumption is met. The Figure 7.6 below is a snapshot illustrating the cell placement o f the counter block. The picture is zoomed in to the lower left corner of the design for clarity. 5 -■ V i * < t 4 ► I . f 4 < 7 ™ > ; ■ O * .4 > Figure 7.6: Cell placement in Silicon Ensemble 1 2 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. f'li? ^s^:srph'- r*?pc1 ■ ’* ■ ii'i- 1 t & f ■ 4 ;,| ► * » 4 t'_ c? , t • T " ' i ; ./'" , 1 ^ " ^ : '■ " ,0" .a . f ; ^ i r - 4 -.a ‘ 1 , f T ' *" 'T ’ ' ■ " " ' Figure 7.7: Routed Counter block with Silicon Ensemble Each block was streamed back into the Virtuoso Layout Editor for DRC and LVS check against its transistor level netlist. The LVS check also generates an extracted netlist of the design for spice simulation. A short sample o f the extracted nethst is shown below. The flattened netlist consists of two parts, the transistor connections and the extracted capacitances. 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. * C.fiDENCE/LPE SPIC E FIL E ; SPICE * DATE : 5 - JU N -2 003 * * * * * * MOS XTOR PARAMETERS FROM ; 7M0SXREF * . GLOBAL VOD1 GND1 . SUBCKT INC2 DATA REQ ACK NRST4 LO L I * * * * * * CORNER ADJUSTMENT FACTOR = 0 .0 0 0 0 0 0 0 M M 2-XI60-XX36 X I 3 6 - A N ET0432 VDD! VDD! PCH L = 0 .2 4 U 13=2. SOU .a.D=l. 04P + PD =3.54U A S = 1 .8 8 P P S = 6 .9 4 U N R S = 0 .0 7 9 N R D = 0.079 M M 3 -X I6 0 -X I3 6 X I3 6 -.A NR<6> VDD! VDD! PCH L = 0 .2 4 D W =2.80U A D = 1.04P + PD =3.54U A S = 1 .8 8 P P g = 6 .9 4 U M R S = 0 .0 7 9 N RD =D.079 MM 7-XI60-XI36 XI36-XI6D-MET029 NET0432 X I3 6 -A GND! MCH L=0.24U W=l.20U + A D = 0 .2 4 P PD =1.60U A S = 0 .4 4 P P S = 1 . 94U N R S=0.183 N R D =0.167 MM7-XI60-XI3 6 - 1 685 MET0432 CTID! GISID! NCH L=0.24U tiJ=1.20U AD=D.24P + PD=1.60U A S=0.a0P PS=3.74U NRS=0.1B3 NRD=0.167 M M 1-X I59-3 NET72 X I5 9 -N E T 3 5 VDD! VDD! PCH L =D .24U W =2.50D .A.D=0.93P + P D = 3 .2 4 U A S = 1 .6 5 P P S = 6 .3 2 U WRS=0.Q8B N R D = 0.08 8 • k * TOTAL # OF MOS TRANSISTORS FOUND : 2 0 1 8 C l NET77 0 4 0 ! 8 .0 0 4 2 1 E -1 5 C2 NET209 GND! 1 . 0 6 9 1 7 E -1 4 C3 NET18B GND! 1 . 1 6 8 9 2 E -1 4 C4 N ET121 GND! 1 .3 4 0 6 5 E -1 4 C5 NET215 GND! 1 . 0 2 4 4 5 E -1 4 C583 X I3 S -X I5 6 -N E T D 1 6 GND! 6 . 9 1 7 1 0 E -1 7 C584 X I2 9 -X I5 0 -X I5 0 -N E T 0 1 6 GNDI 1 .8 5 1 5 0 E -1 7 C585 X I3 0 -X IS O -X I5 0-N ET016 GND! 4 .8 4 6 4 7 E - 1 7 * TOTAL # OF CAPS FOUND : 585 * COMMENTED : 2 . ENDS Figure 7.8: Extracted netlist of a block The layout o f the whole design is show in Figure 6.5. All o f the blocks have been individually placed and routed. However the routing between the blocks have been done manually. 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ch ap ter 8 8. Conclusion and Future Work We have presented an asynchronous, template-based methodology that facilitates hierarchical design using standard asynchronous channel protocols, reinoving the need for complicated hazard-free logic synthesis, and naturally providing fine-grain pipelines with high throughput. Our contributions are in gate level design, architectural implementation and physical design flow o f asynchronous circuits. At the gate level we have introduced seven different templates which range from being quasi-delay-insensitive (QDI) to having timing assumptions in order to achieve higher speeds. In particular, we have presented the new high-speed RSPCHB and RSPCFB Q D I pipeline templates, including the non-linear dynamic pipeline implementations for forks, joins, and more complex configurations in which channels are conditionally read and/or written. Timing analysis and HSPICE simulation results demonstrate that our new RSPCHB achieves ~20% throughput over its PCHB counterpart and our new RSPCFB achieves — 40% throughput improvement over the PCFB counterpart. We have also introduced five new pipeline templates that achieve higher speeds with modest to aggressive timing assumptions. In particular, researchers from Columbia University proposed several high-speed dynamic-logic pipeline templates that achieve high speed that are based on Williams’ well known PSO pipelines which is an optimized version of Caltech’ s PCHB, where the optimization takes place by removing the input completion detector and adding a timing assumption to assure 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. correct operation. The Columbia pipelines, which also have PSO’ s timing assumption, were introduced for linear datapaths. We have extended these Columbia pipelines to handle non-linear datapaths. Two sets of templates arise from adapting each of the LPsr2/2 and LP3/1 pipelines and one set o f templates arises from adapting the HC pipelines. Timing analysis and HSPICE simulation results demonstrate that forks and joins can be implemented with a ~5% — 20% performance penalty over linear pipelines. AH pipeline configurations have timing margins of at least two gate delays, making them a good compromise between speed and ease o f design. In the area of architectural design we presented the design and implementation of an asynchronous sequential decoder that achieves performance close to fuU-custom designs with design times that approach commercially available ASIC standard-cell flows. The Fano algorithm is a tree search algorithm that achieves good performance with a low average complexit)^ at a sufficiendy high signal-to-noise (SNR) ratio and is a low-power alternative to the more conventional Viterbi decoder [32], [11]. Compared to a previously designed synchronous Fano decoder [60], the asynchronous version consumes 1/3 the power, runs at 2.15 times the speed, and has 5X the area, assuming standard process normalization [10]. The design has novel features in architecture and micro-architecture and illustrates the application of a commercially-standard design flow for asynchronous design. In the area of physical design flow, we presented a flow and methodology for the design of QD I pipeline template-based circuits with standard schematic entry, simulation, and physical design tools, demonstrating that the benefits o f asynchronous design can be achieved with a largely commercially-standard flow. We demonstrated the 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. advantages of this methodology with the implementation of the Fano algorithm. As a result of this research we have also made publicly available an asynchronous gate library targeted for TSMC 0.25^im technology. In conclusion, this thesis demonstrates the viability of asynchronous design to achieve veigr high performance with a standard back-end flow. While it provides a framework for high-speed asynchronous ASIC development, however, there is still much room for future work. While the gate-level pipelining yields high-speed it is clearly costs in area. There is a need to study this area cost in more detail and explore templates that require less area at the cost o f performance and possibly a few additional timing assumptions. This would be ideally suited for those units in large systems that do not need to run at full-throughput. Other areas that need additional CAD work include the areas of simulation and synthesis. In the area of simulation, exploration of the recently proposed high-level Verilog simulation support for channel-based architectures in which the detailed channel handshaking need not be specified for each cell is quite promising [56]. Finally, in the area of synthesis, there is also a need for performance- aware high-level and logic-synthesis that target channel-based designs, including the area of pipeline optimization [31]. 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9. References [1] J. B. Anderson and S. Mohan. Sequential coding algoritiims; A survey cost analysis. IE E E Trans, on Communications, COM-32, pp. 169-176, Feb. 1984 [2] A. Bardsley, D. A. Edwards. The Balsa Asynchronous Circuit Synthesis System. FD L 2000. 4-8th September 2000 [3] P. A. Beerel. CA D Tools for the Synthesis, Verification, and Testability of Robust Asynchronous Circuits. PhD thesis, Stanford University, 1994. [4] P. A. Beerel, C. J. Myers, T. H. Meng, “Covering Conditions and Algorithms for the Synthesis o f Speed-Independent Circuits”. IE E E Transactions on CAD, pp.205-219, March 1998. [5] P. A. Beerel, S. ICim, P. Yeh, and K. Kim. Statistically optimized asynchronous barrel shifters for variable length codecs. In International Symposium on Tom Power Electronics and Desijp., pp. 261— 263, August 1999. [6] W BeUuomini, C.J. Myers, H.P. Hofstee. Verification of delayed-reset domino circuits using ATACS. In Proc. of Advanced Research in Asynchronous Circuits and Systems, 1999 pp. 3 -1 2 . [7] M. Benes, S. M. Nowick, and A. Wolfe. A fast asynchronous Huffman decoder for compressed-code embedded processors. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 43— 56, 1998. [8] K. Berkel and A.. Bink. Single-track handshaking signaling with applica-tion to micropipeMnes and handshake circuits. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 122— 133. IE E E Computer Society Press, March 1996. [9] K. Berkel, F. Huberts, and A. Peelers. Stretching quasi delay insensitivity by means of extended isochronic forks. In Asynchronous Design Methodologies, pp. 99— 106. IE E E Computer Society Press, May 1995. [10]K. Bernstein, K. M. Carting, C. M. Durham, P. R. Hansen, D. HogenmiUer, E. J. Nowak, N. J. Rohrer, “High Speed CMOS Design Styles”, pp. 337-345,199 [11]P. J. Black. Algorithms and Architectures for High-Speed Viterbi Decoding. PhD thesis, Standford University, 1993 [12] E. Bmnvand. Parts-R-Us: A Chip Aparts. Technical Report CMU-CS-87-119, Carnegie Mellon University, May 1987. [13]T. A. Chu, Synthesis o f Self-timed VLSI Circuits from Graph-Theoretic Specifications. M.I.T. Tech. Rep. M IT/LCS/TR-393, June 1987. 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [14]W.S. Coates, J.K. Lexau, I.W. Jones, S.M. Fairbanks, and I.E. Sutherland. FLEETzero: an asynchronous switching experiment. In Proc. of A.SYNC, 2001, pp. 173-182. [15]F. Commoner, A. W. Holt, S. Even, and A. Pnueli. Marked directed graphs. Journal of Computer and System Sciences, 5:511— 523, 1971. [16]J. CortadeUa, M. Kishinevsky, A. Kondrat}^ev, L. Lavagno and A. Yakovlev. Petrify: a tool for manipulating concurrent specifications and synthesis of asynchronous controllers. lE IC E Transactions on Infortnation and Systems, vol. E80-D, no. 3, March 1997, pp. 315A25. [17]J. CortadeUa, M. Kishinevsky, A. Kondratr^ev, L. Lavagno, E. Pastor, and A. Yakovlev. Decomposition and technology mapping of speed- independent circuits using Boolean relations. IE E E Transactions on Computer-Aided Design, vol.18, no.9, September 1999. [18] A. Davis, B. Coates, K. Stevens, The Post Office Experience; Designing a Large Asynchronous Chip. In Proceedings of the 26th Annual Hawaii International Conference on Systems Sciences, vol. I, pp. 409-418, 1993. [19]Jo C. Ebergen. Translating Programs into Delay-Insensitive Circuits. PhD thesis. Dept, o f Math, and C.S., Eindhoven Univ. of Technology, 1987. [20]S. B. Furber, D. A. Edwards, and J. D. Garside. AMULET3: a 100 MIPS asynchronous embedded processor. In Proc. International Conf. Computer Design (ICCD), September 2000. [21] S. B. Furber and J. Liu. Dynamic logic in four-phase MicropipeUnes. In Proc. Inter national Symposium on Advanced Research in Asynchronous Circuits and Systems. IE E E Computer Society Press, March 1996. [22] M. Ferretti, R. O. Ozdag, P. A. Beerel. High Performance Asynchronous ASIC Back-End Design Flow Using Single-Track FuU-Buffer Standard CeUs. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, AprU 2004. [23]J.D. Garside S.B. Furber, and S.H. Chung. AM ULET3 revealed. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 51— 59, AprU 1999. [24] S. Hauck. Asynchronous Design Methodologies, An Overview. Proceedings of the IE E E , vol. 83, no.l, pp. 69-93, January 1995. [25] D. Harris, M.A. Horowitz. Skew-tolerant domino circuits. In IE E E Journal of Solid- State Circuits, vol. 32, n o .ll, pp. 1702 — 1711, Nov. 1997. 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [26] H.P Hofstee, Sang H. Dhong; D. Meltzer, K J Nowka, J.A. Silberman, J.I. Bums, S.D Posluszny, O. Takahashi. Designing for a gigahertz [guTS integer processor]. In IE E E Micro, vol. 18, no.3, pp. 66 — 74, May-June 1998. [27] International Technology Roadmap for Semiconductors — 1999 Edition, h ttp ://pubHc.itrs.net/files/1999_SIA_Roadmap/Home.htm [28]J. Kessels and P. Marston. Designing asynchronous standby circuits for a low-power pager. In Proceedings of the IEEE, vol. 87, no. 2, pp. 251— 261, February 1999. [29]J. Kessels, A. Peeters The Tangram framework: asynchronous circuits for low power. Proceedings of the A S P-DMC 2001, pp. 255 — 260, 2001 [30] H. Kim and P. A. Beerel. Relative timing based verification o f timed circuits and systems. In Proc. International Workshop on Logic Synthesis, ]une 1999. [31] S. Kim, H. Peter Beerel, "PipeHne Optimization for Asynchronous Circuits: Complexity Analysis and an Efficient Optimal Algorithm", ICCAD , November 2000 . [32] S. Lin and Jr. D. J. Costello. Error Control Coding: Fundamentals and Applications. Prentice HaU, Englewood CHffs, N.J. 1983 [33] A. M. Lines. PipeUned asynchronous circuits. Master’s thesis, CaHfornia Institute of Technology, 1996. [34] A. J. Martin. CompiHng communicating processes into delay-insensitive \OLSI circuits. Distributed Confuting, vol.l, no.4, pp. 226— 234,1986. [35]A. J. Martin. The limitations to delay-insensitivity in asynchronous circuits. In WilHam J. Daily, cdaX.os, Advanced Kesearch in VLSI, pp. 263— 278. M IT Press, 1990. [36]A. J. Martin, “Programming in VLSI: From Communicating Processes to Delay- Insensitive Circuits”, in U T Year of Programming Institute on Concurrent Programming, C. A. R. Hoare, Ed. MA: Addison-Wesley, 1989, pp. 1-64. [37]A. J. Martin. Synthesis of asynchronous VLSI circuits. In J. Straunstrup, editor. Formal Methods for V L S I Design, chapter 6, pp. 237— 283. N orth-H olland, 1990. [38] A. J. Martin, A. Lines, R. Manohar, M. Nystroem, P. Penzes, R. Southworth, and U. Cummings. The design of an asynchronous MIPS R3000 mi-croprocessor. In Advanced Research in V LSI, pp. 164— 181, September 1997. [39] C. E. Molnar, T. P. Fang, F. U. Rosenberger, Synthesis o f Delay-Insensitive Modules. In Proceedings of the 1985 Chapel Hill Conference on Advanced Research in V L SI, pp. 67-86,1985. 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [40] T. Murata, Petri Nets: Properties, Analysis and Applications. Proceedings of the IEEE, vol. 77, no. 4, pp. 541-580, 1989. [41] C.J. Myers, Asynchronous Circuit Design, John Wiley and Sons, July 2001. [42] C. Myers. Timed circuits: A new paradigm for high-speed design. In Proc. of Asia and South Pacific Design Automation Conference, February 2001. [43]T. Nanya, A. Takamura, M. Kuwako, M. Imai, T. Fujii, M. Ozawa, I. Fukasaku, Y. Ueno, F. Okamoto, H. Fujimoto, O. Fujita, M. Yamashina, and AI. Fukuma. TITAC-2: A 32-bit scalable-delay-insensitive microprocessor. In Symposium Record oj H O T Chips IX., pp. 19— 32, August 1997 [44] T. Nanya, Y. Ueno, H. Kagotani, AI. Kuwako, and A. Takamura. TITAC: Design of a quasi-delay-insensitive microprocessor. IE E E Design < & Test of Computers,, v o l.ll, no.2, pp. 50— 63, 1994. [45] S. M. Nowick, D. L. Dill, Automatic Synthesis of Locally-Clocked Asynchronous State Machines. In Proceedings of ICCAD, pp. 318-321,1991. [46] S. M. Nowick, D. L. Dill, Synthesis of Asynchronous State Machines Using a Local Clock. In Proceedings of ICCD, pp. 192-197,1991. [47] S. M. Nowick, K. Y. Yun, and P. A. Beerel. Speculative completion for the design of high-performance asynchronous dynamic adders. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 210— 223. IE E E Computer Society Press, April 1997. [48] R. O. Ozdag, P. A. Beerel. A Channel Based Asynchronous Low Power High Performance Standard-CeU Based Sequential Decoder Implemented with QD I Templates. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, April 2004. [49] R. O. Ozdag, P. A. Beerel. High-Speed QD I Asynchronous Pipelines. Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, April 2002 [50] R. O. Ozdag, M. Singh, P.A. Beerel, S. M. Nowick. High-Speed Non-Linear Asynchronous Pipelines. D A T E ’ 02, March 2002. [51]S. Palnitkar. Verilog H D L: A Guide to Digital Design and Synthesis. Prentice Hall, 1995. [52] A. M. G. Peeters. Single-Rail Handshake Circuits. PhD thesis, Eindhoven University of Technology, June 1996. [53] Private communications with Andrew W. lin es, 2001. 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [54] s. Rotem, K. Stevens, R. Ginosar, P. Beerel, C. Myers, K. Yun, R. Kol, C . Dike, M. Roncken, and B. Agapiev. RAPPID: An asynchronous instruction length decoder. In Proc. International Symposium on A.dvanced Kesearch in Asynchronous Circuits and Systetns, pp. 60— 70, April 1999. [55] S. Schuster, W. Reohr, P. Cook, D. Heidel, M. Immediate, and K. Jenkins. Asynchronous interlocked pipelined CMOS circuits operating at 3.3-4.5 GHz. In IE E E ISSCC Digest of Technical Papers, pp. 292— 293. [56] A. Seifhashemi, H. Pedram, "Verilog HDL, Powered by PLI; a Suitable Framework for Describing and Modeling Asynchronous Circuits at All Levels of Abstraction", 40th DAC,]xxnt 2003. [57] C. L. Seitz, System timing. In Carver A. Mead and Lynn A. Conway, editors. Introduction to V T S I System's,, chapter 7. Addison-Wesley, 1980. [58] M. Singh, and S.M. Nowick. Fine-grain pipelined asynchronous adders for high speed DSP applications In Proc. of IE E E Computer Society Annual Workshop on VESI, Orlando, FE, April 2000, pp. 111— 118. [59] M. Singh, and S.M. Nowick. High-throughput asynchronous pipelines for fine grain dynamic datapaths. In Proc. ofA SYN C , 2000, pp. 198— 209. [60] S. K. Singh, P. Thiennviboon, R. O. Ozdag, S. Tugsinavisut, P. A. Beerel and K. M. Chugg “Algorithm and Circuit Co-Design for a Low-Power Sequential Decoder,” Proc. Asilomar Conf. Signals, Systems and Comp Oct. 1999 [61]K. Stevens, R. Ginosar, and S. Rotem. Relative timing. In Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, pp. 208— 218, April 1999. [62] I. E. Sutherland. Micropipelines. Communications of the AC Ei, vol.32, no.6, pp. 720— 738, June 1989. [63] I. Sutherland, and S. Fairbanks. GasP; a minimal FIFO control. In Proc. ofA SY N C , 2001, pp. 46-53. [64] H. Terada, S. Miyata, and M. Iwata. D D M P’s: Self-timed super-pipelined data- driven multimedia processors. Proceedings of the IE E E , Vol. 87, No. 2, pp. 282-296, February 1999. [65]J. T. Udding. A formal model for defining and classifying delay-insensitive circuits. Distributed Computing,, vol.l, no.4, pp. 197— 204,1986. [66] S. H. Unger, Aynchronous Sequential Switching Circuits. N ew York NY: Wiley- Interscience, 1969. 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [67] T. E. Williams. SelfTimed Vdngs and their Application to Division. PhD thesis, Stanford University, June 1991. [68JT.E. Williams, and M.A. Horowitz. A Zero-overhead self-timed 160ns 54b CMOS divider. In ISSCC Digest of Technical Tapers, 1991, pp. 98-296. [69]J. M. Wozencraft and I. M. Jacobs. Principles of Communication Engineering. John Wiley and Sons, 1965 [70] K. Y. Yun, P. A. Beerel, V. Vakilotoiar, A. E,. Dooply, and J. Arceo. The design and verification of a high-performance low-control-overhead asynchronous differential equation solver. IE E E Transactions on C LSI Systems, vol. 6, no.4, pp. 643— 655, December 1998. [71] K. Yun, D. DiU, Automatic Synthesis of 3D Asynchronous State Machines, In Proceedings of ICCAD, pp. 576-580,1992. i33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Clustering techniques for coarse -grained, antifuse-based FPGAs
PDF
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
A CMOS frequency channelized receiver for serial-links
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Contributions to efficient vector quantization and frequency assignment design and implementation
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Contributions to image and video coding for reliable and secure communications
PDF
Efficient acoustic noise suppression for audio signals
PDF
Adaptive video transmission over wireless fading channel
PDF
Error resilient techniques for robust video transmission
PDF
Dynamic radio resource management for 2G and 3G wireless systems
PDF
Dynamic logic synthesis for reconfigurable hardware
PDF
A passive RLC notch filter design using spiral inductors and a broadband amplifier design for RF integrated circuits
PDF
Architectural and register -transfer-level power analysis and optimization
PDF
A thermal management design for system -on -chip circuits and advanced computer systems
PDF
Design, fabrication, and integration of a 3-D hybrid electronic/photonic smart camera
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Design and performance analysis of a new multiuser OFDM transceiver system
Asset Metadata
Creator
Ozdag, Recep Ozgur
(author)
Core Title
A template-based standard-cell asynchronous design methodology
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Beerel, Peter (
committee chair
), Pedram, Massoud (
committee member
), Zimmermann, Roger (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-521801
Unique identifier
UC11335971
Identifier
3140528.pdf (filename),usctheses-c16-521801 (legacy record id)
Legacy Identifier
3140528.pdf
Dmrecord
521801
Document Type
Dissertation
Rights
Ozdag, Recep Ozgur
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical