Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back o f the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UM I directly to order. UMI A Bell & Howell Inform ation C om pany 300 North Zeeb R oad , Ann A rb o r M I 48106-1346 USA 313/761-4700 800/321-0600 MICROPARALLEL PROCESSORS BY BARTON JOSEPH SANO A DISSERTATION PRESENTED TO THE FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) MAY 1995 COPYRIGHT 1995 BARTON JOSEPH SANO UMI Number: 9621624 C o p y r ig h t 1 9 9 6 b y S a n o , B a r to n J o s e p h All rights reserved. UMI Microform 9621624 Copyright 1996, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by ........................... under the direction of V i t A L Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies Date ch. 28, 199 5 DISSERTATION COMMITTEE Acknowledgments I am happy to acknowlege my family and friends for their support. Without their love, understanding, and dedication I would not have enjoyed my time while in school nor would I have finished this work. Research is never an isolated endeavour. Indeed my greatest intellectual satisfaction stems from working with bright, creative and industrious individuals. In this respect I have been blessed with some of the most meaningful interactions while at the University of Southern California working on the BAM and SLAM projects. Specifically, my interactions with Dr. Alvin Despain have quite literally changed my view of research and development. He has enhanced both the depth and clarity of my vision for what is possible. I will always be indebted to Dr. Despain for investing so much time and energy in me. He is at once a gifted scholar, researcher and teacher generous with his ideas, time and resources. My education occurred not while in class rooms but in learning from my fellow research assistants or “partners in pain.” Thus, I am also indebted to the friendship and dedication to work that Apoorv Srivastava, Mike Shenoy, Patrick Law, Chi-Ying Tsui, Yong-seon Koh, Amarpreet Chadha and too many others to name here have given me. Their work appears here in many fonns. I have made sure to credit their work that has been modified for my own purposes. In this respect I rest on the shoulders of some fantastic VLSI circuit designers. This research was sponsored by an ARPA contract J-FBI-91-194 and used machine resources under the NSF Infrastructure Grant CDA-8722788. Table of Contents Abstract............................................................................................................................. xii Chapter 1 Introduction.................................................................................................1 1.1 M otivation.........................................................................................................1 1.2 Integrated Circuits T rends............................................................................... 2 1.3 Processor Performance M etric......................................................................... 3 1.4 Trends in Processor Performance...................................................................4 1.5 Challenges........................................................................................................ 6 1.6 The Question..................................................................................................... 7 1.7 Thesis..................................................................................................................9 1.8 Contributions................................................................................................... 11 1.9 Dissertation O utline....................................................................................... 12 Chapter 2 Exposing and Exploiting Instruction-Level Parallelism.......................... 13 2.1 Introduction....................................................................................................13 2.2 Chapter O u tlin e ............................................................................................. 14 2.3 The Basics of Instructions and Blocks...........................................................14 2.4 Software Techniques to Expose ILP.............................................................. 16 2.4.1 Scheduling Across Basic Blocks............................................................16 2.4.2 Scheduling Within Basic Blocks............................................................17 2.5 Hardware Mechanisms to Exploit IL P...........................................................18 2.6 16-Fold Way: A Microparallel Taxonomy..................................................... 19 2.7 Static vs. Dynamic Behavior.......................................................................... 21 2.8 Processor Exam ples.......................................................................................23 2.8.1 The ELI-512: An SSSS Processor........................................................ 24 2.8.2 The ZS-1: An SDSD Processor............................................................ 25 2.8.3 The IBM 360/91: An SDDS P rocessor................................................25 2.8.4 The DEC-21064: An SDSS Processor...................................................26 2.8.5 The DISC: An SSDS Processor............................................................ 26 2.8.6 The TORCH: An SSSD Processor........................................................ 26 2.8.7 The Metaflow: An SSDD Processor..................................................... 27 2.8.8 The HPS: An SDDD Processor............................................................ 27 2.8.9 The PIPE: An DSSD Processor............................................................ 28 2.8.10 The Cyclone: An DDSS Processor........................................................ 28 2.8.11 The MISC: An DDSD Processor............................................................29 2.8.12 The Multiscalar: An DSDD Processor.................................................. 29 2.8.13 The Y-Pipe: An DDDS Processor........................................................ 30 2.9 Additional Processor Instances....................................................................... 30 2.10 Postulated Processor A rchitectures..............................................................33 2.10.1 The Wrench: A New DSSS Processor..................................................34 2.10.2 The Sling-Shot: A New DSDS Processor............................................34 2.10.3 The Hyperscalar: A new DDDD Processor.........................................34 2.11 Conclusion......................................................................................................35 Chapter 3 Processor Performance Estimations.........................................................37 3.1 Introduction.....................................................................................................37 3.2 Chapter O u tlin e .............................................................................................. 38 3.3 Simulation Methodology..................................................................................38 3.4 Candidate Processors........................................................................................ 39 3.5 Processor Architectures..................................................................................43 3.5.1 SSSS Processor.......................................................................................44 3.5.2 SSDD Processor....................................................................................46 3.5.3 DSDD Processor....................................................................................47 3.5.4 SSSS and SSDD S im ulator.................................................................48 3.5.5 DSDD Simulator....................................................................................49 3.5.6 Architectural Parameters........................................................................ 49 3.6 Cost-Performance M o d el.............................................................................. 50 3.6.1 Mapping From Abstract to Pipeline S ta g e s........................................52 3.6.2 Mapping From Pipeline Stages to Circuit M odules............................54 3.6.3 Benchm arks.......................................................................................... 55 3.7 Simulation Results............................................................................................56 3.7.1 SSSS and SSDD Performance..............................................................57 3.7.2 Cache Effects.......................................................................................... 60 3.7.3 Functional Unit Configurations...........................................................62 3.7.4 Reservation Station and Reorder Buffer.............................................. 65 3.7.5 DSDD Performance............................................................................. 67 3.7.6 Effects of Branch Prediction..................................................................70 3.8 Conclusions....................................................................................................71 Chapter 4 Processor Cost Estimations.....................................................................74 4.1 Introduction.....................................................................................................74 4.2 Chapter O u tlin e .............................................................................................. 76 4.3 Comparison to Other VLSI M odels...............................................................76 4.4 Circuit Implementation.....................................................................................77 4.5 Outline of Pipeline........................................................................................... 79 4.6 Standardized Delay............................................................................................82 4.7 Cache Memory..................................................................................................85 4.7.1 Arrangement of Cache Cells................................................................. 89 4.7.2 Decoding D e la y ....................................................................................91 4.7.3 Word Line Delay....................................................................................92 4.7.4 Bit Line Sensing....................................................................................94 4.7.5 Write Driving.......................................................................................... 94 iv 4.7.6 Combined Cache Timing....................................................................... 95 4.7.7 Area Estimations....................................................................................97 4.8 Register F ile ...................................................................................................98 4.8.1 Decode and Drive Delay......................................................................100 4.8.2 Read and Write D elay .........................................................................101 4.8.3 Register File A re a ............................................................................... 102 4.9 Reservation Stations.................................................................................... 103 4.10 Functional U n its...........................................................................................106 4.11 Reorder Buffer..............................................................................................108 4.12 Bypass U n i t .................................................................................................110 4.13 Summary of Circuit M odules.....................................................................114 4.14 Die Area Estimation.................................................................................... 115 4.15 C onclusions.................................................................................................117 Chapter 5 Validation of Area-Time M odel..............................................................119 5.1 Introduction.................................................................................................119 5.2 Chapter O u tlin e ...........................................................................................120 5.3 Cache Memory..............................................................................................120 5.4 Decode and Word Line D riv e .....................................................................120 5.5 Write and Read T im in g .............................................................................. 123 5.5.1 Cache Layout........................................................................................ 124 5.5.2 Register F ile ........................................................................................ 126 5.5.3 Register File L ayout............................................................................127 5.6 Adder..............................................................................................................127 5.7 Fabricated Test Structures........................................................................... 128 5.8 C onclusions.................................................................................................129 Chapter 6 C onclusions............................................................................................. 131 6.1 Introduction.................................................................................................131 6.2 SSSS and SSDD Processors........................................................................ 131 6.3 DSDD Processor...........................................................................................133 6.4 Validating the Thesis.....................................................................................134 6.5 Lessons Learned...........................................................................................134 6.6 Effects of Branch Prediction........................................................................137 6.7 Summary of Research Contributions............................................................139 6.8 Future Research D irections........................................................................ 139 Appendix A IPC of Processors............................................................................... 142 A.l Table D escription........................................................................................142 Appendix B CMOS Technology............................................................................... 149 Appendix C Memory Physical L a y o u t...................................................................150 C.l Register File C e lls........................................................................................150 v C.2 Cache Memory Cells......................................................................................152 .......................................................155 References vi List of Tables Table 1 Four Levels of Modeling............................................................................ 11 Table 2 Variations of Static B ehavior...................................................................21 Table 3 Variations of Dynamic Behavior................................................................22 Table 4 Microparallel Processor Instances............................................................ 24 Table 5 List of Other Microparallel Processors......................................................31 Table 6 Instruction Scheduling P olicies............................................................... 41 Table 7 Architectural Parameters............................................................................50 Table 8 Benchmark P rogram s...............................................................................56 Table 9 Delay of Characteristic Devices................................................................85 Table 10 SRAM Cell Characteristics......................................................................89 Table 11 Register File Cell Characteristics...........................................................100 Table 12 Width of VLSI M odules.......................................................................... 112 Table 13 Data Path Delay and Dim ensions...........................................................113 Table 14 Results of 1.0 um CMOS Test Structure................................................. 129 Table 15 Candidate Processors for Validation........................................................140 Table 16 SSSS-W8-P1............................................................................................. 142 Table 17 SSSS-W8-P2............................................................................................. 142 Table 18 SSSS-W8-P4............................................................................................. 143 Table 19 SSSS-W8-P8............................................................................................. 143 Table 20 SSSS-W4-P1............................................................................................. 143 Table 21 SSSS-W4-P2............................................................................................. 144 Table 22 SSSS-W4-P4............................................................................................. 144 Table 23 SSSS-W2-P1............................................................................................. 144 Table 24 SSSS-W2-P2............................................................................................. 145 Table 25 SSDD-W 8-P1..........................................................................................145 Table 26 SSDD-W 8-P2..........................................................................................145 Table 27 SSDD-W 8-P4..........................................................................................146 Table 28 SSDD-W 8-P8..........................................................................................146 vii Table 29 SSDD-W 4-P1..........................................................................................146 Table 30 SSDD-W 4-P2..........................................................................................147 Table 31 SSDD-W 4-P4..........................................................................................147 Table 32 SSDD-W 2-P1..........................................................................................147 Table 33 SSDD-W 2-P2..........................................................................................148 Table 34 D S D D ......................................................................................................148 Table 35 HP-1 .Own (CMOS26B) Transistor Param eters.................................... 149 viii List of Figures Figure 1 Two Technology Trends.............................................................................. 2 Figure 2 Microparallel (a) and Micropipeline (b) Execution.................................. 4 Figure 3 Trends in MIPS and M H z ........................................................................... 5 Figure 4 Abstract Design Space of Processors.........................................................8 Figure 5 Uniprocessor (a), Decoupled (b) and Multiprocessor (c) Architectures. 9 Figure 6 Exposing and Exploiting I L P ................................................................... 13 Figure 7 A Code Fragment (a) and its CFG (b )....................................................... 15 Figure 8 CDG of Program F ragm ent.......................................................................15 Figure 9 Microarchitectures under consideration....................................................18 Figure 10 Processor Stages.........................................................................................20 Figure 11 16-Fold Way Taxonomy.............................................................................21 Figure 12 IBM RS/6000............................................................................................ 32 Figure 13 SIM P............................................................................................................33 Figure 14 X IM D .........................................................................................................33 Figure 15 Ideal Processor Model S im ulator.............................................................40 Figure 16 Mapping of Abstract Processor M odels................................................... 42 Figure 17 Candidate Processor Classifications......................................................... 44 Figure 18 SSSS Block Diagram.................................................................................. 45 Figure 19 SSDD Block D iagram ............................................................................... 46 Figure 20 DSDD Block Diagram............................................................................... 47 Figure 21 Superscalar Simulator (S sim )...................................................................48 Figure 22 Block Diagram of Processor...................................................................... 52 Figure 23 Pipeline Description with Circuit M odules............................................. 53 Figure 24 Pipeline of V L IW ......................................................................................54 Figure 25 IPC of SSDD and SSSS Processors..........................................................58 Figure 26 Cache Param eters......................................................................................60 Figure 27 Cache Architecture vs. Hit R a te ................................................................61 Figure 28 Instruction D istribution............................................................................ 63 Figure 29 IPC vs. Configurations............................................................................... 64 Figure 30 Reservation Station..................................................................................... 65 Figure 31 Reorder Buffer........................................................................................... 66 Figure 32 IPC for Reorder Buffer& Reservation S tations...................................... 67 Figure 33 A Ring of Four Execution U nits............................................................... 68 Figure 34 IPC vs. DSDD Configurations.................................................................. 69 Figure 35 SSDD (a) and DSDD (b) with Branch Prediction................................... 71 Figure 36 Alternative Implementations..................................................................... 75 Figure 37 Complementary (a) and Dynamic (b) NAND G a te ................................ 77 Figure 38 Dynamic Logic with TSPC Latches......................................................... 79 Figure 39 Timing of Each P h ase...............................................................................81 Figure 40 Latch (a), AND (b) and OR (c) Designs...................................................83 Figure 41 Four Stage Driver D esig n ........................................................................ 83 Figure 42 Detailed Timing of C ach e.........................................................................86 Figure 43 Circuit Schematic of C a c h e ..................................................................... 87 Figure 44 1 and 2 Port SRAM Memory C e lls ..........................................................88 Figure 45 Memory Cell Layout..................................................................................90 Figure 46 Alternative Floor Plans for Cache M em ory.............................................91 Figure 47 Schematic of Decoders............................................................................... 92 Figure 48 Wdln Delay vs. Block Size........................................................................ 93 Figure 49 Btln Delay vs. Set Size...............................................................................95 Figure 50 Combined Cache Timing............................................................................96 Figure 51 Cache A rea..................................................................................................97 Figure 52 Detailed Timing of Register File................................................................98 Figure 53 Circuit Schematic of Register File............................................................ 99 Figure 54 Register File Decode D e la y ....................................................................100 Figure 55 Reading from multiple p orts....................................................................101 Figure 56 Register File Read Delay.......................................................................... 102 Figure 57 Register File A re a ....................................................................................103 Figure 58 Reservation Station T im in g ....................................................................103 x Figure 59 Reservation Station Entry F orm at........................................................... 104 Figure 60 Reservation Station Circuits.....................................................................105 Figure 61 Reduction of Carry L o g ic........................................................................106 Figure 62 CLA in MODL.......................................................................................... 107 Figure 63 Buffer Entry F orm at.................................................................................108 Figure 64 Reorder Buffer Entries.............................................................................. 109 Figure 65 Reorder Buffer Timing..............................................................................110 Figure 66 Bypass Bus Within Data P a th ................................................................. I l l Figure 67 Timing of Circuit Modules........................................................................115 Figure 68 Future Technology (a) Scaled to Current Technology (b )......................116 Figure 69 Floor plan of W4E4 Architecture........................................................... 117 Figure 70 Decoder and Word Line D riv e r.............................................................. 121 Figure 71 Timing Simulation for Decoding.............................................................. 122 Figure 72 Timing Simulation for Memory Access..................................................123 Figure 73 Cache Layout..............................................................................................125 Figure 74 Register File W rite....................................................................................126 Figure 75 Register File R e a d .................................................................................... 127 Figure 76 Register File L ayout.................................................................................128 Figure 77 Adjusted IPC of SSDD and SSSS Processors........................................ 132 Figure 78 Adjusted DSDD Processor........................................................................133 Figure 79 Microparallel Cost-Performance Lattice..................................................135 Figure 80 Alternative Path to Center........................................................................137 Figure 81 Adjusted SSSS vs. SDDD both with prediction..................................... 137 Figure 82 Adjusted DSDD with prediction.............................................................. 138 Figure 83 3-Port Register File C ell........................................................................... 150 Figure 84 6-Port Register File C ell........................................................................... 151 Figure 85 12-Port Register File C e ll........................................................................ 152 Figure 86 1-Port Cache C e l l .................................................................................... 153 Figure 87 2-Port Cache C e l l .................................................................................... 153 Figure 88 4-Port Cache C e l l ....................................................................................154 xi Abstract Microparallel processors are instruction set processors capable of simultaneously executing multiple instructions on parallel resources in the same clock cycle. This general category includes processors referred to as VLIW, superscalar, and decoupled architectures. But which microparallel processor architecture best exploits instruction- level parallelism? The thesis of this dissertation is that a decoupled processor with dynamic execution and retirement stages provides the best performance for a microprocessor implementation. This processor architecture is selected and analyzed from within a comprehensive microarchitecture taxonomy, called the “16-Fold Way.” The criteria of selecting the best architecture is based upon both the cost and performance of the design. Trace simulation provides an estimation of cycle counts for a suite of symbolic programs, while an integrated circuit CMOS cost model is used for the area-time estimation. Symbolic computation is chosen as the benchmark for performance because it poses the greatest challenges to exposing and exploiting instruction-level parallelism. CMOS technology is chosen because it provides a good balance of high circuit integration along with speed of operation. The results of this cost-performance model validate the thesis that a dynamic execution and retirement behavior have moderate cost, but contribute substantially to performance. In the best case, the cost-adjusted performance increase is 56% over one without these dynamic behaviors. Decoupling the fetch stage also increases performance by as much as 59% by adding another execution unit and connecting the two together with a multi-ported cache memory. These execution units are then free to pursue multiple control flows based upon control dependence analysis of the object code of a program. More importantly, spliting a microparallel processor into separate execution units provides an efficient means to balance the width and multiplicity of the execution units. Without this balance, the performance of a microparallel processor is dominated by either the internal bypassing delay or by the central communication mechanism provided by an on-chip cache memory. Thus the best microparallel processor architecture is the “Dynamic Double Duo,” a processor decoupled into two, dual-issue execution units both with dynamic instruction scheduling mechanisms. Chapter 1: Introduction 1.1 Motivation Microprocessors1 now influence a wide range of computing environments. They are commodity parts used as the central processing unit of personal computers, workstations, large-scale shared-memory multiprocessors [97], and even massively parallel supercomputers [21]. This proliferation has come about partially because of their superior execution performance relative to other forms of computers. For example, the increase each year in microprocessor performance, for non-scientific computation, has been greater than traditional minicomputers, mainframes, and even vector supercomputers [39], With such a performance leverage and wide market presence, the characteristics of a single-chip microprocessor is utmost critical importance. In the design of a microprocessor, the most important considerations are performance and cost. Performance is usually measured or modeled in terms of the total clock cycles it takes to execute a set of benchmark programs. Alternatively, the cost of a microprocessor has two definitions. The first is the material costs in proportion to die area, packaging, testing, and yield. These measures have direct bearing on the monetary cost of the processor. The more die area a design requires, the fewer the number of die per wafer. This results in higher production costs as well as higher testing and packaging costs [40]. Although this definition of cost is very important from an economic stand point, we are more interested in measuring the implementation cost of a microprocessor design in terms of its effect on the performance and feasibility of the design. This is an indirect measure of a processor’s implementation. It will be measured in terms of the resulting cycle-time and silicon area. The silicon area of a single die is a finite resource that must 1 ' A microprocessor is a complete instruction set processor on a single VLSI chip. Other forms of processors are divided onto multiple VLSI chips. 1 be managed and optimized within the context of processor and memory architectures. As a result, this optimization directly affects performance. 1.2 Integrated Circuits Trends The relationship between microprocessor performance and integrated circuit technology is complex but tractable. To understand this relationship it is first important to see the trends in integrated circuits that have the most bearing on processor architecture and its performance. By far, the most important trend in MOSFET technology is the relentless decrease in the minimum feature size of transistors. This is due primarily to advanced lithography techniques. Figure 1 relates the two trends; feature size on the left and die size on the right. log(|xm) log(cm/side) 10 • m 1 ™ 0> •P* Q oj 1970 1975 1980 1985 1990 1995 2000 Year Figure 1: Two Technology Trends For the microprocessors indicated, the feature size has shrunk from about 10 microns to presently about 0.5 microns. Circuit density is an 0(n2) function of feature size. So this increases the number of devices per die considerably. Concurrently, the maximum edge-length of a die has also increased from about 0.3 cm to 1.5 cm. This trend also has an 0(n2) effect on the number of devices per die, but not because of circuit density. In this case the individual devices remain the same size, but with an increase in Feature Size 0> u i804! 5 PowerPC-604 68020 a 0 > z80 i80386 i8086 i4040 0.1 2 area, more devices can be placed on a single die. Taking the product of these two trends, the result has been about a three orders-of-magnitude increase in the number of devices per die; forty times improvement coming from feature size, and twenty-five times coming from edge-length. The benefits of smaller transistors is not only an increase in circuit density, but an increase in switching speed as well [17]. Unfortunately, this increase is only an O(n) function of feature size rather than the 0(n2) as for circuit density. This asymmetric improvement means that future designs that use smaller feature sizes must optimize for speed in order to effectively utilize the circuit density. Otherwise all that is gained is higher integration. Although integration alone does have its benefits. Perhaps the greatest benefit stems from the ability to have as much bandwidth between processor elements and memory as desired. This reduces the communications overhead between storage and computing devices [72] and can also reduce the over all clock skew of synchronous systems [3]. Both of these can be major impediments to achieving high performance VLSI implementations. Also, an increase in die area means specialized circuity (e.g. tagged data paths [41]) can be included on a single die to enhance processor performance. This expands the design space of alternative processor organizations that can be implemented. 1.3 Processor Performance Metric The net effect of these integrated circuits trends is an increase in microprocessor performance. A metric for processor performance has been recognized as a function of the number of instructions to execute (Instrs), the average number of instructions per clock cycle (IPC), and the cycle-time [40]. Performance is then measured inversely proportional to the time it takes to execute a program as: Perfomance «= j . mg — j nstrsxCycleTime 3 In the quest for higher performance, microprocessors are pushed to execute more instructions per cycle and with a shorter cycle-time. By optimizing a processor’s microarchitecture, we can increase the number of instructions executed per clock cycle by employing microparallel and micropipeline mechanisms (Figure 2). Microparallel mechanisms allow the simultaneous execution of multiple instructions on parallel resources reducing the effective number of cycles to process the instructions [47]. Micropipeling also reduces the number of cycles to process instructions by overlapping the execution of instructions on segments of a processor [S3]. Although the latency increases with micropipeling, the overall cycle-time is shortened because the individual segments perform fewer functions and therefore run faster than a non-pipelined processor. Cycle-Time 1 2 3 4 F lD E R F l D E R F D E E F D E E T ~D [HE F D E E 11 21 3 4 5l( i l i l F D E *1 D l a s F | D I s [ fT D p i E l F | D U E || R | F D p l l R | T I T F = Fetch D = Decode E = Execute R = Retire (a) (b) Figure 2: Microparallel (a) and Micropipeline (b) Execution It is obvious from the performance equation (1.1) that executing fewer instructions for the same task will also increase the performance. This entails an interaction between architecture and compiler technologies well beyond the scope of this analysis. However, the topic of software techniques to expose instruction-level parallelism is briefly described in Chapter 2. 1.4 Trends in Processor Performance Over the last few decades, micropipelining and microparallel mechanism have been incorporated into microprocessors to exploit instruction-level parallelism. The result has been roughly a 50% performance improvement per year. Figure 3, below, tracks the 4 clock frequency (MHz) and number of instructions executed per second (MIPS) for microprocessors designed for workstations. Superscalar & Superpipe 1000 ^ = v 100. w Pipelined •c £ Year 0.1 1990 1995 1980 1985 2000 Figure 3: Trends in M IPS and MHz The first single-chip microprocessors, used in workstations, were developed during early 1980’s. They contained considerable control storage for microcoded instructions that required multiple-cycle execution and operated at modest frequencies by today’s standards. The number of pipeline segments used in these processors was up to 3 stages. Next, processors with streamlined instruction set architectures were introduced during late 1980’s. The simplicity of the instructions accommodated a micropipeline of up to five stages and could effectively execute an instruction every cycle [51]- A natural extension of this organization is to segment the processor into pipelines of eight to ten stages in length. These superpiplined [49] processors operate at clock frequencies in excess of 100 MHz. Neglecting memory effects, the MIPS rating of a superpiplined processors is only slightly below its MHz rating. This is because of the single-cycle nature of the instructions. Superscalar [4] processors represent a departure from micropipelined architectures. These processors are distinguished by their capacity to issue and execute several instructions. The result is that the MIPS rating is now outpacing the MHz rating. Until 5 recently, the maximum number of instructions executed per cycle was two for the DEC- 21064 [19]. This is represented, in Figure 3, with the small number and arrow next to the microprocessor. Presently the highest for any processor is the Power2 architecture [84]. This processor can issue up to six instructions in parallel. But it needs six chips on a multi-chip module to achieve this throughput. The maximum number of instructions issued per cycle for any microprocessor is four instructions in parallel. This distinction goes to the PowerPC-604 [107]. This trend of increasing issue-width capability of microparallel processors will continue as long as a silicon technology advances. This is because more functional units, buses, and memory units can be arranged on a single die. Thus, from the figure it seems that microparallel processors, with some degree of micropipeling, will soon dominate high performance microprocessor applications. 1.5 Challenges Unfortunately, much of the potential performance of these microprocessors goes unrealized. One study estimates that the instructions executed per cycle for the RS/6000 to be at best 2, half its peak rating of 4 IPC [85]. When memory delay is taken into account, the processor performance decreases further to less than 1 IPC. This ineffectiveness stems primarily from insufficient instruction-level parallelism within basic blocks of a program. The scarcity of independent instructions limits the potential speedup for both micropipeline and microparallel execution [49]. Another reason is that processors are often intolerant to variable memory delays and to resource conflicts. Memory latency is variable or unpredictable because of caches and virtual memory management. The resource conflicts arise because the computation is often non-uniform with respect to the frequency and pattern of the instructions executed [48]. Another challenge stems from the implementation of the processor’s data path as there exists an inverse relationship between the number of instructions issued in parallel and the clock frequency of a processor [47]. For example, a 200 MHz processor can issue two instructions in parallel [19], while a processor capable of issuing 4 instructions can only run at 100 MHz [107]. This is mainly due to the complexity of decoding and issuing multiple instructions in parallel, as well as communicating the computed data values among the parallel resources. Although there is a relationship between issue 6 width, design complexity, and the resulting clock frequency, it is not always a simple linear relationship as the example given tends to indicate. 1.6 The Question Given these technological advances and architectural challenges the question of this dissertation is formulated as: What microparallel architecture best exploits instruction-level parallelism? We classify the architectural options into two general categories; microparallel and micropipeline. The first category implies an increase in a processor’s ability to issue and execute multiple instructions in parallel while maintaining as fast a clock frequency as possible. The second category implies a longer and faster pipeline and is thus optimized solely for clock frequency. The degree of pipelining can be increased to an extreme case where each stage of the processor effectively has one gate delay or the smallest unit of computation possible in a technology. Technically speaking, there is no limit to the degree of microparallelism or how wide a processor can be. There are, however, practical limitations on the width. The amount we increase the pipeline or parallelism within a processor is illustrated below in Figure 3. The X and Z axis represent the degree of parallelism and pipelining within a processor respectively. Ideally the resulting performance will be proportional to the product of the two variables (i.e. performance = pipeline x parallelism). In this investigation, we will fix the degree of pipelining for a given microparallel architecture. The width or degree of microparallelism, however, will be varied along with the behavior of the individual stages of the pipeline. In the figure below, this is represented as a plane of fixed pipeline depth (e.g. N stages) and the width grows from a single issue processor to W-words wide. Unfortunately, because of the many challenges stemming from implementation and the microarchitecture, the performance seldom reaches the ideal available instruction-level parallelism (ILPm ax). 7 Performance i k Fixed Pipeline Depth Degree of Micropipeling Figure 4: Abstract Design Space of Processors One option to improve performance, with microparallel mechanisms, consists of continuing with a superscalar organization and simply increasing the issue width to eight or ten instructions. This would require some support for dynamic, microparallel operations to efficiently execute instructions in parallel. But, as pointed out, the internal complexity of decoding instructions within superscalar processors impedes future performance because it adversely effects cycle-time. A second option is to build a Long Instruction Word (LIW) processor. This design promises a simple and regular model of execution [25]. Because of this simplicity, the clock frequency should be higher for a given issue width when compared to superscalar designs. But the problem with these designs is that the compiler is forced to schedule instructions at compile-time and can not do this to take account of the actual run-time behavior of the program. A third option involves a decoupled form of microparallel processing, shown in Figure 5b. This option consists of two or more independent execution units tightly coupled together by a multi-ported cache memory. Cooperating instruction streams are assigned to different execution units that communicate data through a common memory interface. A similar organization has been proposed with the concept of processor coupling [52]. In this case, parallel data paths are feed a statically scheduled, wide-word of potentially different threads (processes) that are dynamically reordered by a scoreboard [88]. 8 P -C O P EU EU P P $ $ $ ... $ 1 b i M M M (a) (b) (c) Figure 5: Uniprocessor (a), Decoupled (b) and Multiprocessor (c) Architectures The evolution of decoupled processors can be observed in the figure above. Starting from the left a simple uniprocessor configuration is shown with a processor (P) along with its coprocessor (COP), cache memory ($) and main memory (M) [36]. The middle figure shows how a decoupled processor is slightly different from a coprocessor in that a coprocessor relies upon the main processor for control. In a decoupled architecture the execution units are autonomous with each unit fetching its own instruction stream. Advances in integrated circuit technology allow for both an increasing number of execution units on a single die as well as large multi-ported cache memory. These units can communicate via a multi-ported cache with coherence maintained with the multiple ports. This is in contrast to the last figure illustrating a shared-memory multiprocessor system with separate caches which uses a bus mechanism for cache coherence. An example of this processor is an experimental single-chip multiprocessor implementation with two dual-issue VLIW processors integrated with separate caches [67], 1.7 Thesis Given these general categories of processors as alternatives the thesis of this dissertation is: Existing microprattel architectures can be represented by a sixteen-fold taxonomy. In terms of this taxonomy, the best way to build a microprocessor is with decoupled fetch, reordering execution, and speculative retirement stages. To determine the best processor architecture, it is first important to provide a comprehensive taxonomy that describes all of the existing microparallel processors. 9 This dissertation provides such a taxonomy used for a comparative analysis called the 16-Fold Way [73]. This microarchitecture taxonomy provides sixteen distinct classifications. Thirteen classifications have had representative designs built or proposed for them. The other three have not here to fore been investigated. Throughout this investigation, the criteria for selecting the best architecture is based upon performance characterized by a benchmark suite of symbolic computation [37][103]. This metric of performance is determined from simulations by calculating the number of instructions executed per cycle along with a corresponding clock frequency. The processor architectures all utilize basically the same streamlined instruction set architecture [51], advanced compiler technology [33], and are implemented with the same circuit technology [102]. To estimate the clock frequency and cycle count, a cost-performance model will be used to measure the design alternatives. This model can be partitioned into four levels of abstraction. These four are illustrated below in Table 1 along with pointers to the chapter that the level is covered in. The levels provide a successive refinement of detail, ranging from high-level abstract processor descriptions to the low-level physical layout of circuitry. Bridging the gap between these levels of abstraction are mappings from one level of design to a more specific implementation. For example, the table shows a static behavior for a fetch stage at the first level. At this level, the feature modeled is a statically or in order fetched stream of instructions. Thus, one program counter is assumed that provides one flow of control. To implement this behavior of one stage requires some specification of the pipeline model for the processor. This is the type of detail for the second level. In the case of the fetch stage, the number of cycles it takes to access an instruction cache is used. A two cycle cache is used because pipelined caches are employed throughout the processor to achieve a high clock frequency. With a pipeline description and the basic behavior of a processor defined, trace simulations can be performed to estimate the cycle count. 10 Level of Model Icon Description I. Abstract Stages (Chapter 2) F - Describes the abstract behavior of the stages of a processor. (e.g. static fetch stage). II. Pipeline Stages (Chapter 3) Cycle 1 ' i i s ^ 1 1 > Cycle 2 » ) Describes the organization of the pro cessor pipeline and the latency of each stage. (e.g. 2 cycle instruction fetch). III. Circuit Modules (Chapter 4) R o w Decoder Memory Array Describes the basic structure and circuit operations of each module of the pipe line stages. (e.g. design of instruction cache) Column Rd./Wr IV. Physical Layout (Chapter 5) aq p H vdd * I- 2 GND Describes the physical layout and tim ing of the custom cells of the modules. (e.g. an inverter layout). Table 1: Four Levels of Modeling The next level of abstraction is the circuit module level. The example above describes a CMOS implementation of the two cycle cache memory. The construction of this device requires a small number of library cells that are arranged to implement a specific memory capacity and organization. This final and lowest level of detail describes the physical layout and actual timing of the library cells. 1.8 Contributions The contributions of this dissertation include a taxonomy that categorizes a great variety of existing microparallel processors. This taxonomy assists in systematically exploring a processor design space to compare microparallel processors. This dissertation also presents a cost model for processors implemented in a library of CMOS circuits that is then coupled with a performance model to predict a measure of performance. This cost-performance model has not been proposed before and describes the effects of VLSI implementation on microparallel performance. Lastly, this 1 1 dissertation determines an efficient microparallel organization and implementation based upon a specific pipeline design and benchmark suite. 1.9 Dissertation Outline The dissertation consists of six chapters: • Chapter 1 introduces the topic, question, and thesis of the dissertation. • Chapter 2 outlines related work on software techniques for exposing instruction-level parallelism and describes a processor taxonomy named the 16- Fold Way. • Chapter 3 narrows the design space of microparallel processors to three candidate processor organizations, as specified by the taxonomy. This chapter also develops a pipeline model of these three processor architectures to estimate a cycle-count of each. • Chapter 4 uses the pipeline model of Chapter 3 for a VLSI cost model of the processors. It specifically estimates the effects that VLSI implementation has on cycle-time and silicon area. • Chapter 5 validates the VLSI cost model with measurements from circuit simulations extracted from layout and from test structures fabricated in a scalable 1.0 um CMOS technology. • Chapter 6 concludes the dissertation by combining the cycle-count and cycle time results to validate the thesis and describes future research opportunities. 12 Chapter 2: Exposing and Exploiting Instruction-Level Parallelism 2.1 Introduction To build a high performance processor you must both expose and exploit instruction- level parallelism (ILP). Principally the compiler with its global view of the computation has the greatest potential of exposing instruction-level parallelism. It can accomplish this task by emitting independent instructions with software techniques ranging from loop transformations to register allocation. Hardware can then exploit this parallelism with separate resources during run-time. As an example, Figure 6 shows a simple computation compiled into assembly language consisting of independent add instructions. If the hardware provides two single-cycle adders, then a reduction in the effective number of clock cycles can be achieved when compared to a serial execution of the computation. Computation: Assembly Language: Data Flow Graph: E = (A+B) - (C +D) (1) addrl, r2, r2 (2) add r3, r4, r4 (3) sub r2, r4, r4 A B C d E Figure 6: Exposing and Exploiting ILP The synergy between software and hardware is crucial to designing an efficient processor. Therefore to give a balanced view of related work, the first section of this chapter outlines established software techniques to expose ILP. Hardware mechanisms can also perform limited code transformations to expose more ILP. For example, register renaming [89] can alleviate anti-dependencies between instructions during decode and execution. These mechanisms will not be covered in detail. Rather, the main purpose of this chapter is to describe a taxonomy, called the 16-Fold Way, for processors that contain microparallel mechanisms to exploit ILP. 2.2 Chapter Outline The structure of this chapter is as follows: The next section explains basic terminology of instruction-level parallelism and basic blocks within a program. Section 2.4 describes software techniques to expose instruction-level parallelism within these basic blocks and across block boundaries. The next three sections describe microparallel mechanisms to exploit instruction-level parallelism and how they fit within a taxonomy of microparallel processors. Examples are given for the various classifications in Section 2.8 followed by a section describing processors that fit into unexplored processor classifications. 2.3 The Basics o f Instructions and Blocks Before we delve into the subject of software techniques and hardware mechanisms, some basic terminology needs to be established. The smallest user visible operations are the instructions of the processor. The dependencies between these instructions can be described with a data flow graph (DFG) as in Figure 6. The next larger object is a basic block or set of instructions. If one instruction in the basic block is executed then all are executed. This implies that within a program, conditional branch instructions delimit basic blocks along with entry points into the block. Graphically, the flow of control for a program can be represented with a control flow graph (CFG) as in Figure 7. An arc in an CFG that connects two basic blocks describes the control dependence, analogous to data dependence information within a DFG. In the example below the basic blocks, bb2 and bb3, are dependent upon a conditional branch at the end of bbl. These conditional branches inhibit execution within a processor’s pipeline because they interrupt the flow of instructions while the condition is being evaluated. A means to mitigate this effect on a processor is shown in Figure 7b. It involves predicting the outcome of a branch and speculatively executing 14 past the basic block boundaries. In the figure, the predicate p = true is predicted to be the outcome so that bbl and bb2 can be processed or scheduled as a single unit. This means control flow of instructions does not have to stop at the branch. But, if the prediction is incorrect, then the processor must backup to the branch and execute bb3. This can be performed with either software or hardware. In the case of software, compensation code [24][25] can undo speculative computation. At the other extreme, a hardware mechanism that restores an entire saved processor state can be employed [44]. bbl iffeH hen { fbtel } else { fbbJl i . I bb41 (a) Figure 7: A Code Fragment (a) and its CFG (b) A control dependency graph (CDG) can be derived from the CFG and indicates if basic blocks are independent [23]. In Figure 7 basic blocks, bb2 and bb3, are conditionally dependent upon the branch of bbl. But, all paths in the CFG ultimately lead to bb4. This implies that bb4 is actually control independent of the other basic blocks and can be executed in parallel. This is shown in Figure 8 with bb4 only dependent upon the entry of the program fragment. entry PC2 PC exit bbl bb3 bb4 bb2 Figure 8: CDG of Program Fragment 15 A DFG illustrates how a multiple control flow processor can pursue paths simultaneously. First a DFG is constructed from the basic block structure of a program. At compile-time these blocks can be arranged in different areas of instruction memory so that independent program counters (i.e. PCI and PC2) can access them without interference from the other portion of the program. However, synchronization is needed at the entry and exit of the parallel section so that the original serial semantics is preserved. 2.4 Software Techniques to Expose ILP The goal of software techniques for microparallel processors is to translate a program, as far as possible, into independent instructions. If the instructions that are generated are dependent, the next best thing is to schedule or order the instructions so that the hardware will not stall. The primary concern here is due to memory latency or resource conflicts. The memory latency is complicated by the fact that caches are used to buffer memory requests. This leads to variable memory delay. The resource conflicts are from register reuse along with functional unit and bus conflicts within the processor. Many of the methods for generating independent instructions are assembly code transformations performed at the back-end of the compiler. These are describes as techniques that schedule instructions across basic block boundaries. The main idea is to merge the blocks along a predicted path of conditional branches. By enlarging the basic blocks the likelihood of the instructions being independent of each other increases. But, once a block can not be enlarged further, the instructions within the block are rearranged by techniques that take into consideration the latency and availability of functional units or memory resources. At this point, the problem is similar to a scheduling problem constrained by limited resources. In general this problem is NP-complete [31]. Thus heuritics must be employed to obtain near optimal results. 2.4.1 Scheduling Across Basic Blocks There are at least four techniques to consider that either merge basic blocks or schedule instructions across block boundaries. The set consists of trace scheduling [24], loop unrolling [47], software pipelining [12] and percolation scheduling[29]. This set of 16 techniques is by no means exhaustive, because there are many other important compile time techniques for code optimization. But an attempt has been made here to highlight those that are related to microparallel processors. Trace scheduling [24], the first technique, crosses basic block boundaries to increase instruction-level parallelism. This technique requires multiple passes for the optimization process. First the program with a representative data set is executed and profiled. Frequently executed paths that connect basic blocks in the CFG are identified and sent to the scheduler as one enlarged block of code. Compensation code is then added in case predicted branches are wrong. In Figure 7, if the predicted path was taken then bbl and bb2 could be merged and scheduled as one. But in the case when the predicate (p) is false, then the side effects of bb2 have to be compensated for and bb3 can then be executed. Loop unrolling [47] is a technique used to enlarge the body of loops. The key idea is to combine multiple iterations of a loop into a single iteration. This exposes potentially independent operations across iterations for scheduling and reduces the overhead of updating induction variables and executing branches. Software pipelining [12] is another means to transform loop constructs. In this case, the pattern of instructions and their data flow dependencies within the loop body are scheduled without regard to the iteration. Instructions from different iterations can be placed together as long as it reduces the over all execution time. This requires a prologue and epilogue code to be placed before and after the new loop kernel to start the loop and end it. Percolation Scheduling [29] also uses a CFG and applies code transformations to move instructions across block boundaries. These transformations preserve the semantics of the program independent of the outcome of the branches. For example, if an instruction in bb4, of Figure 7, is independent of any instruction in bb2 or bb3, then it can percolate up into bbl. 2.4.2 Scheduling Within Basic Blocks Once the alternatives to merge basic blocks have been exhausted, the instructions within a block can be scheduled to reduce the execution time based upon the latency and 17 availability of resources within a processor. The scheduling techniques for this problem are also well known. They include list scheduling [1], greedy scheduling [47], first- come first-served [16] and branch-and-bound [100] techniques. 2.5 Hardware Mechanisms to Exploit ILP At the gross organizational level the language of computer architecture abounds with the usage of Flynn’s structural taxonomy of four classifications; SISD, SIMD, MISD and MIMD [27]. His taxonomy characterizes how instruction and data streams interrelate as well as how processors are organized to achieve various levels of parallelism. For example, his term SIMD refers to an organization “Single-Instruction stream Multiple-Data stream” where in a computer executes one instruction stream but each instruction applies an operation on a set of data items. A classic example of this organization is the CRAY-1 with its vector operations. The primary purpose of this section is to describe the various ways instruction-level parallelism is exploited at the next lower level of implementation, the microarchitecture level underlying the classifications of Flynn’s taxonomy. Although superpipeline and superscalar processors similarly exploit instruction- level parallelism [49], this section concentrates on processors capable of multiple- instruction issue or execution per cycle identified as microparallel processors. This general processor category includes VLIW, superscalar, as well as decoupled organizations PC MU PC PC FU issue FU MU I-Memory I-Memory I-Memoiy VLIW Superscalar Decoupled Figure 9: Microarchitectures under consideration 18 Here a VLIW architecture provides parallel, tightly-coupled functional or memory units and completely relies upon the compiler to explicitly schedule instructions or components of the “long word1 ” for execution. A superscalar organization differs from a VLIW in that it provides dynamic instruction scheduling, either in the form of out-of-order issue or execution. The third organization is decoupled and typically it has multiple instruction streams each allowed to flow at rates determined by separate functional units. There are several proposals for classifying the architectures of microparallel processors [11][26][57]. None satisfactorily classify all of the above microparallel organizations. Additionally, these classification schemes are primarily based on specific implementations. This can limit the design space to existing microarchitectures instead of pointing to new alternative organizations. As a result, we describe the 16-Fold Way, a taxonomy based on the abstract behavior of the processing steps rather than on a particular structure or implementation. 2.6 16-Fold Way: A Microparallel Taxonomy All three of the above microparallel organizations can be described by the behavior of a pipeline of four operational stages at the microarchitecture level consisting of a fetch (F), decode (D), execute (E) and retire (R) stage. Figure 10 illustrates this simple pipeline model. The first stage fetches from memory instructions placed in order by the compiler and produces a stream of instructions. The next stage then decodes these instructions, gathers operands or possibly symbolic values for the operands and dispatches the instructions to the execute stage. The execution stage is a collection of parallel and possibly pipelined functional units and produces arithmetic as well as memory access results. These results are finally committed in the retire stage updating the state of the processor. Although the diagram below shows a single stream of instructions, we can have multiple instructions in a single stream entering and exiting a stage. 1 A long word is a format for an instruction that contains multiple operations to be execution in parallel, each in distinct fields of the instruction. 19 (processor state) (I-memory) ...1312II Figure 10: Processor Stages Conceptually the micropipeline processes the stream of instructions at each stage in the order as initially fetched from memory. In contrast, a microparallel execution processor might transform or permute the instruction stream at any stage. This distinction between in-order and out-of-order processing is a useful behavioral abstraction that can describe all four operational stages and will be denoted as static and dynamic respectively. Thus, any stage that can change the number or order of the instructions with respect to the input of the stage will be considered dynamic. These two behavioral values (static vs. dynamic) and four variables (FDER) provide sixteen different ways to describe the behavior as well as the unique capabilities of a particular processor organization. The taxonomy is organized in a “K-map” for ease of comparison and positions the fully static processor classification (i.e. SSSS) in the upper left hand comer. This processor class is influenced entirely by the program order specified by the compiler. The categories with a dynamic stage represent processors with some capability to dynamically alter the processing order of the instructions. For instance, the processor classifications in the bottom two rows all have a dynamic fetch stage, while the middle two rows have a dynamic decode stage. For the other two stages, the right most two columns represent processors with a dynamic execution stage, while the middle two columns represent dynamic retirement processors. 20 ER ) \ SS SD DD DS s s SSSS SSSD SSDD SSDS SD SDSS SDSD SDDD SDDS DD DDSS DDSD DDDD DDDS DS DSSS DSSD DSDD DSDS Figure 11: 16-Fold Way Taxonomy 2.7 Static vs. Dynamic Behavior To elaborate on the differences between static and dynamic behaviors we have collected some examples of each. There are at least two variations of static behavior and seven dynamic operational behaviors. The static variations are named lock-step and synchronized, while the dynamic are called reorder, dispose, expand, compact, split, merge and decoupled. Although an icon for some variations include a very small block diagram of a stage, the behavior is not necessarily dependent upon this implementation. Rather these examples are used to demonstrate how others have successfully implemented the various behaviors. Lock-step . 1 3 1 2 II J n L, 1312II . 1 3 1 2 1 1 1 1 3 1 2 1 1 Synchronized .UI2I1-*T d T-* I3I2II .131211. + ..1 3 1 2 1 1 Table 2: Variations o f Static Behavior 21 The two static behaviors are quite simple (Table 2). For a given static stage the input and the output should have the same number of instructions and should be in the same basic order. For the lock-step variation, multiple instructions per cycle enter and exit a stage coupled together as a single word. This behavior is traditionally used throughout VLIW processors. In contrast a synchronized version maintains only the relative timing between otherwise independent instruction streams, in this case with queues. This implies that individual streams can be stalled when waiting for a value. The first dynamic behavior, reorder, is a simple permutation within an instruction stream and requires some form of buffering to allow instructions to be reordered (Table 3). This behavior is useful in executing instructions out-of-order to compensate for data dependencies or to reorder instructions back into a precise order. The dispose dynamic variation is used primarily to support speculative processing in the form of either fetch, decode or execution. For this behavior, a stage accepts instructions and conditionally throws away those associated with useless or mispredicted computation. Reorder -*J|| E — *//i3i2 Dispose ... 13<g)ll— • D -*...1311 t s Split / Merge __j s - i n Decoupled | F [ — i ► ...1 3 1 3 1 1 | T } - » 1 3 1 2 1 1 Expand / Compact 1,1-^, ...i3i2il ^*1... 1 1 - O b ’ Reorder + Dispose .*I3@^-* R -*...1113 tiff Table 3: Variations of Dynamic Behavior In the second row above, the split behavior accepts instructions and sorts them by class (e.g. integer, floating-point, or memory) potentially buffering them before issuing 22 to decoupled streams. Alternatively, the merge behavior can be used either at the decode or retirement stage to merge separated streams back into a single instruction stream. The next component variation, decoupled, is used to denote the behavior of two completely autonomous streams. For example, in the fetch stage this requires separate program counters or at the issue and retirement stages this means there are separate register file images, one for each stream. The last variation can expand a single instruction (II) into a set of internal operations (il, i2 and i3) which then perform specialized micro operations possibly in parallel. Typically this occurs at the issue stage where instructions are decoded. A compact behavior, shown along side the expand version, assembles these micro-operations back into the original form. The last box in Figure 3 illustrates how two component behaviors can be combined into one stage. This example of a “reorder + dispose” variation can be used at the retire stage to support both speculative execution as well as reordering of instructions into a precise order. Although these two static and seven dynamic component variations alone can theoretically lead to 94 possible processor configurations not all variations are compatible. For example, reordering the instruction stream at the fetch stage, before decode, makes little sense unless we consider an associative matching of data to instructions as in a data flow execution model. But with different combinations of component behaviors there still exists a vast number of combinations. Therefore it is important to maintain a manageable number of classifications otherwise details can easily overwhelm the original intent of the taxonomy. Indeed, to his credit, Flynn’s taxonomy has survived so long because of its simplicity in describing complex computer systems with only four classifications. With a similar intent, but at a lower level of abstraction, this section sets forth a taxonomy to understand the microparallelism found within the individual processors of a system with sixteen classes. 2.8 Processor Examples With the taxonomy now defined, this section sorts a collection of processors that have been built or proposed by their closest behavioral FDER microparallel classification. Each processor’s dynamic behavior is briefly described in the text that 23 follows. A small block-diagram is also provided in Table 4 and serves as a central reference point for many of the processors. It is an expanded version of Figure 11. v F D \ SS SD DD DS SS SSSS E M hH S H H ELI-512 SSSD [fH dH I K rI TORCH SSDD [Z H E H Iih E 0 B) Metaflow SSDS [ZHb H IIH K I DISC SD SDSS [ZHQjHjD-* r i) DEC-21064 SDSD [fM dJj I j® ZS-l SDDD HPS SDDS QD"{P'V*[i> r m IBM 360/91 DD DDSS S js M E H E ] * —1 E D Cyclone DDSD MISC DDDD Hyperscalar DDDS Y-Pipe DS DSSS DSSD DSDD DSDS F * d]h [[e]-*|_r F ♦ [fTH^H e ) - * r IZ H ^ h O H IS Wrench PIPE Multiscalar Sling-Shot Table 4: Microparallel Processor Instances 2.8.1 The ELI-512: An SSSS Processor The first classification instance to consider is the ELI-512 [25] which is a VLIW architecture. All four of its operational stages process, in a lock-step manner, a single stream o f long-word instructions without altering the order. This results in an SSSS behavior classification. The obvious advantage of this style o f processor is the simplicity of the hardware design. This can lead to a very wide instruction width in excess of ten 24 operations per word. However, a VLIW processor does require considerable compile- time optimization to compensate for the rigid scheduling constraints, including aggressive loop unrolling and trace or percolation scheduling. 2.8.2 The ZS-1: An SDSD Processor After instructions are statically fetched by a single unit, various dynamic scheduling behaviors can be employed to enhance the performance of a processor. One such behavior is when the instruction stream is logically split. An instance of this category is the ZS-1 [80], a processor with a split issue stage. The static fetch stage of the ZS-1 processes a single stream of paired instructions. The issue stage then separates the instructions based upon their type. This serial-to-parallel transformation alters the relative ordering of instructions and is thus considered dynamic. Because the now separated streams cooperate in performing a single task, there is a need to communicate between them. Thus in the ZS-1 the execute stage utilizes architectural queues to synchronize and pass data values between the streams. The advantage of this style of processing is that variable memory access is compensated when there is sufficient independent ALU and memory operations. This is because the streams have the freedom to “breath” by contracting or expanding. This is unlike the lock-step operation of an SSSS processor which can stall all of the stages. At the end of the ZS-1, the separated streams are then merged back into a single logical stream and so is a dynamic retirement stage. 2.8.3 The IBM 360/91: An SDDS Processor The IBM 360/91 [89] is a significant processor because it is one of the most well known examples with the capability to dynamically reorder instructions at the execution stage. This processor fetches in static order, but disposes of speculatively fetched instructions at tfie decode stage and executes out-of-order with the help of reservation stations associated with each functional unit. This elaborate associative storage mechanism allows for partially decoded instruction to reside in a station until its operands appear on a common data bus that connects to the result registers of all the 25 functional units. Once all operands of an instruction are gathered, it is executed and sent to the retire stage to be processed in static order disregarding the original fetch ordering. 2.8.4 The DEC-21064: An SDSS Processor This classification contains many of the microparallel processors identified. One instance of this dynamic-decode class is the DEC-21064 microprocessor [19]. It has numerous features to support microparallelism. But, the only abstract dynamic behaviors to consider is its branch history table and scoreboard mechanism lumped together within the decode stage. The branch history table is similar in function to the IBM 360/9l ’s decode stage. This stage disposes of instructions speculatively fetched but are no longer of use, while the scoreboard mechanism is used to issue only instructions that have all of their operands ready. For correctly fetched and ready instructions, processing at the execute and retire stages is in-order or with a static behavior, unlike the dynamic-reordering execute stage of the IBM 360/91. 2.8.5 The DISC: An SSDS Processor A rather obscure instance of this class is the DISC [94]. This processor falls into this classification because after fetching and decoding in-order, its execute stage processes with a “forward and backward” routing network that circulates instructions until they can be executed. In the DISC, data dependencies are specified within the instructions themselves in the form of cycle-count tags and are dynamically maintained by the individual functional units in a distributed manner. This means that a functional unit knows when it is “safe” to execute an instruction regardless of the state of other units, with the exception of variable memory latency. After instructions finally exit the execute stage they are retired without modification in a lock-step manner representing a static behavior. 2.8.6 The TORCH: An SSSD Processor When only a dynamic retirement stage disposes of instructions that have been statically processed, it influences the data state or programmer’s view of the processor. This dynamic behavior can support the processor’s ability to translate control dependencies into data dependencies, also known as speculative execution. An instance 26 of this behavior is the TORCH processor [78][81]. Its fetch, issue and execute stages are considered static in this study. The dynamic retire stage, however, disposes of instructions by selectively deleting useless instructions that are part of speculative computation. This dynamic behavior allows the compiler to enlarge basic blocks via branch prediction, boosting performance by exposing more instructions to a global scheduling algorithm, while at the same time providing an efficient run-time mechanism to delete the instructions. 2.8.7 The Metaflow: An SSDD Processor The unfortunate effect of out-of-order execution is that a processor can no longer guarantee precise interrupts without reordering the side effects of instructions back into the original fetched order [47]. To accomplish this task a reorder dynamic behavior can be placed at the last stage. This operation retires instructions according to the order specified by the program counter giving the programmer the illusion of sequential execution along with the benefit of parallel processing. A processor that provides this behavior along with out-of-order execution is the Metaflow [71]. This processor model actually implements a dynamic execute and retire stage in one unit called the DRIS which is a combination reservation station and reorder unit. Unlike the other processors, speculative computation can be in the process of executing and retiring so both these stages are augmented with a dispose variation to throw instructions out of the DRIS. 2.8.8 The HPS: An SDDD Processor One of the more ambitious efforts to exploit instruction-level parallelism is the High Performance Substrate (HPS). This processor model provides restricted data flow execution and over the years the HPS processor model has evolved to support a number of microparallel features for a variety of instruction sets. In this case the instruction set of the VAX architecture is a good example [70]. To support the multitude of addressing modes and interesting instruction semantics, the HPS can expand complex instructions into primitive “nodes,” or micro-operations issued to reservation stations called “node tables.” After executing in a dynamic execute stage, the micro-operations are compacted back into an image of the original instructions and reordered into a precise interrupt model. 27 The main advantage of this style of dynamic expand and compact behavior is code compatibility along with microparallel execution. The cost of this compatibility is the considerable hardware to exploit both the intra-instruction and inter-instruction level parallelism. As such, the HPS is an extremely aggressive example of a static fetch and dynamic decode, execute and retire processor. While at the other end of the spectrum, a fully static VLIW processor typifies a relatively low cost alternative relying more upon the compiler to expose low-level parallelism. 2.8.9 The PIPE: An DSSD Processor The other main class of processors in this taxonomy involve some form of decoupled fetching. The first processor to implement a truly decoupled fetch stage was the PIPE [34]. It utilized a dynamic fetch and retirement stage resulting in a DSSD classification. It has the unique distinction of supporting two completely independent instruction pipelines synchronized at the decode and execute stages with architectural queues. The two streams are fetched from separate code regions and indexed by independent program counters. Each instruction stream has its own register file in the decode stage. This is also where branch outcomes are passed between the streams to synchronize global branch points. The execute stage has the same queue mechanism as the ZS-1, allowing ALU operations to overlap with memory operations. There are advantages however, to the PIPE architecture over the ZS-1 including the fact that completely separate streams can simplify the design of each instruction pipeline and if one stream is stalled at the fetch stage the other is free to continue processing. The challenge for PIPE is to efficiently balance the computation across multiple streams while keeping the communication between the cooperating pipelines to a minimum [101]. 2.8.10 The Cyclone: An DDSS Processor An interesting example of the DDSS classification is the Cyclone [42], a combination of a decoupled fetch and merging decode stage. Like the PIPE, this processor has two separate program counters and is thus considered a decoupled fetch processor. However, each PC fetches a portion of the same program with one PC periodically jumping ahead of the other. The issue stage dynamically pairs together the 28 resulting two dynamic streams forming a single merged instruction stream sometimes throwing away instructions that were speculatively fetched. The instruction pairs are then processed in-order while in the execute and retire stages result in a DDSS behavior. Not all pairs of instructions are allowed to execute in the Cyclone, because of the limited arrangement and number of functional units in the data path. But, by optimizing for the most frequently occurring instruction pairs, the Cyclone’s control path is able to efficiently utilize the data path and other processor resources. 2.8.11 The MISC: An DDSD Processor The processor instance of this class is the MISC [90] and is best described as a message-passing PIPE. Instead of passing computed predicates to the other pipelines, all information is sent via the queues at the execute stage, including simple movement of data. Although this will increase the bandwidth needed to synchronize the pipelines and distribute the control flow computation, it can simplify the implementation of communication between the streams. This is helpful for the MISC as it extends the PIPE from two instruction streams to the four used by the MISC. When the MISC is in a “vector” mode, the processor resembles an DDSD behavior as shown in Table 4. In this mode instructions are conditionally expanded depending on a vector length register. The pipelines cooperate by passing data to caches and other pipelines via queues. Although, the MISC also has an alternate mode issuing instructions in a sentinel manner [90] with the issue stages synchronized (i.e. a DSSD), it has such a unique dynamic and decoupled behavior that it is placed in this class. 2.8.12 The Multiscalar: An DSDD Processor The instance of this classification is named the Multiscalar [30], because it consists of multiple copies of superscalar processors interconnected in a ring topology with communication queues. The distinguishing features of this processor include its decoupled fetch, synchronized decode, decoupled execution and dispose retirement stages. As with the MISC, the legacy of this processor is easily observed when compared to the PIPE, as in Table 4. In fact, the PIPE can be converted into a Multiscalar by simply deleting the synchronizing queues of the execution stage and adding a dynamic dispose behavior in the retirement stage. 29 The basic idea o f the Multiscalar is to follow multiple flows of control in a speculative manner. The control flows can sometimes be control dependent. So, the compiler attempts to eliminate this situation by using the control-dependence graph (CDG) to identify independent tasks. Data dependencies can also exist between the instruction in each processor. The correct order of execution is maintained with the data queue interconnecting the distributed register files. Unlike the PIPE however, the queues between the processors also pass the results of speculative computation. These results are thrown away in the retirement stage if the branch predictions are incorrect. A processor architecture similar to the Multiscalar is the POPE [7]. This processor executed a unification operation in parallel on a collection of tightly-coupled execution units2 interconnected in a similar manner as the Multiscalar. Buffers between the execution units hold the register images of the unification operations. The POPE pursues multiple control flows o f unification operations in a speculative manner with decoupled execution stages. Thus it too is a DSDD processor. 2.8.13 The Y-Pipe: A n DDDS Processor The last variation on decoupled architectures separates the pipelines down to a dynamic execution and static retirement stage and is called the Y-Pipe [56] so named because it resembles the letter “Y.” This DDDS behavior supports zero-cycle branch delays by duplicating the fetch and decode resources eagerly executing one instruction down both paths of a conditional branch. Execution then proceeds down only one path after the conditional is resolved. This can conceptually be done by deleting the unneeded instructions from the streams at the execute stage and only retiring useful computation. 2.9 Additional Processor Instances The above processor instances are of course not alone in their respective classifications. Indeed, there are many commercial and experimental processors that have microparallel capability. Table 5 below lists some of these processors along with the closest classification. 2 These were called processors in the POPE architecture. 30 Name Class Cube Name Class Cube CDC-6600 [88] SDSS M bH IH II Pentium [10] SDSS Multiflow [14] SSSS Power 601 [107] SDSD F -* d || -* E h | R @ IBM - RS/6000 [6] SDDD F Power 603 [108] SDSD F H ^ | h E *| R i860 [105] SSSS [ZHjEHIHI] Power 604 [109] SDDD F - r ^ P ' - i r a ml 0 i960CA [106] SDSS IZ H E eHZHZI SIMP [65] SDDD iWARP [13] SSSS IZKIHIHZ] Super-SPARC [9] SDSS E H 3H IH I] El LIFE [82] SSSD Swordfish [62] SDSS E) MC68060 [35] SDSS F E -* R T f l j l T9000[28] SSSS (ZKIEH1D-® MC88110 SDSD El XIMD[98] DDSS Table 5: List of Other Microparallel Processors Most of the processors have only one or two dynamic stages, except for the RS/6000, SIMP and XIMD. These are described below in detail. Some of the more recent VLIW processors include the Multiflow, i860, iWARP and T9000. Although the LIFE processor is also called a VLIW it is not a fully static processor because its guarded execution is similar to the TORCH’s speculative execution mechanism. For processors 31 with two to four issue widths, modest amounts of dynamic behavior are provided with their branch prediction or out-of-order issue capability. The IBM RS/6000 deserves special recognition because it was one of the first “superscalar” microprocessors to be introduced. Its dynamic behavior is from a split and dispose decode unit followed by a decoupled execute and merge retire stage. This corresponds to an SDDD classification. The decode stage disposes of mispredicted branches and also separates and buffers instructions in small queues. The decoupled execution stages (i.e. branch, floating-point and integer units) are then free to process instructions independently, although they do not have to communicate with queues as in the ZS-1. Another interesting microparallel processor is the Single Instruction Stream / Multiple Instruction Pipelining processor or SIMP. This processor fetches a block of four instructions at a time and distributes them to independent instruction pipelines while activities are coordinated through a centralized register file. Within each pipeline, instructions are buffered and allowed to execute out-of-order. But only when all four instructions from the same fetched block are completed can the entire block be retired. This requires instructions to be gathered and reordered back into the original fetch order (i.e. merge and reorder). The SIMP also has speculative execution support that allows instructions to be executed and conditionally deleted. This is why it is shown below with an additional dispose retirement behavior. Figure 12: IBM RS/6000 32 Figure 13: SIMP The final established processor to consider is the XIMD a hybrid between an SIMD and MIMD processor. It contains separate fetch stages for independent instruction streams. But all of these streams are logically merged to access a central register file image, unlike the PIPE which has completely separate register files for each stream. Figure 14: XIMD When the XIMD is executing in an SIMD mode all fetch units access the instruction memory with the same pattern, simulating the behavior of a VLIW. In a MIMD mode the fetch units diverge while following independent threads of computation. For either mode, however, the resultant dynamic instruction stream is merged and processed lock step and in-order. The result is a static behavior for the execute and retire stages. This demonstrates how a processor can straddle two classifications. In this case, the XIMD can operate both with a SSSS and a DDSS processor behavior. 2.10 Postulated Processor Architectures The previous sections provide established processor instances that fit a particular behavioral classification. This section fills the remaining unexplored “boxes” by combining various dynamic stages together to form three new processors and postulates their features. 33 2.10.1 The Wrench3: A New DSSS Processor One problem with the PIPE is with exceptions because a programmer must compensate for the autonomous retirement of separate instruction streams. Furthermore if both pipelines are used to resolve an exception and restart execution, then both also need to context switch. To handle this we can synchronize the streams at the retirement stage with “guard” instructions that will check for traps and holdup the retirement of instructions until the both guard instructions are retired at the same time. This transforms the decoupled retirement stage of the PIPE into a synchronization stage and therefore an DSSS behavior. This behavior guarantees points in the execution so that the streams will be synchronized. This simplifies the context switch operations. 2.10.2 The Sling-Shot4: A New DSDS Processor The Y-Pipe is inadequate in that only one branch is eagerly processed by the duplicated resources. This simpliftes the control and buffering of instructions but it does not fully utilize the potential of speculative processing down two threads of computation or pursuing multiple-control flows [59]. To enable multiple-control flows, the decode stages of the Y-Pipe are synchronized. This means each stream is free to predict branches down different paths yet still communicate with a predicate queue. This queue holds the outcomes of the branch instructions for each stream. There is, however, the possibility that portions of each predicted path are incorrect. This is because branch prediction is used on each thread and the resolution of the branches might take several cycles. During this time other branches are processed. This situation will require rather sophisticated bookkeeping of the branches, possibly increasing the penalty of restarting the fetch stage down the correct path. 2.10.3 The Hyperscalar: A new DDDD Processor The final processor to consider is a hardware intensive processor capable of performing speculative computation through the execution stage, but without the compilation techniques of the TORCH. Thus the object code can have serial semantics and the processor can dynamically allocate a different pipeline of the processor for every 3So named because a wrench fixes a PIPE. ^Resembles a sling-slot in Table 3. 34 possible outcome of a branch. The illustration in Table 4 only shows two pipelines, but conceptually there can be any number of these pipelines. Each pipeline can operate independently, replicating execution at branch points and processing all paths until the retire stage where only one path is committed to the processor state. Certainly, this is unrealizable as a single-chip microprocessor. But the intent is clear; to burden the hardware with the task of dynamically exploiting instruction-level parallelism while still executing existing object code. 2.11 Conclusion In this chapter both software techniques and hardware mechanisms are covered for exposing and exploiting instruction-level parallelism. To further understand the hardware mechanisms we proposed a two valued, four variable microparallel taxonomy, called the 16-Fold Way. The result is sixteen different processor microarchitecture classifications. We categorized established processors into thirteen of these sixteen different classes. Each processor is defined in terms of its behavior during the four stages of instruction processing; fetch, decode, execute and retire. The behavior can be one of nine component behaviors; lock-step, synchronize, reorder, dispose, expand, compact, split, merge and decoupled. These behaviors are broadly classified as either static or dynamic with some processors containing a combination. From Table 4 we can see that the fetch stage is either a single unit or a pair of units. In the first case this stage fetches a single stream of instructions, occasionally fetching down one predicted path. In the second case it can be a decoupled (dynamic) stage that either fetches multiple cooperating streams or eagerly fetches multiple paths within a single program. At the decode stage the diversity of dynamic behavior varies the most, consisting of all static and dynamic behaviors except compacting. For the execute stage, the figure reveals that for established processor categories (i.e. white boxes) the most common dynamic behavior is reordering. For the last stage, retiring of instructions generally requires some behavior to support either speculative execution by disposal or precise interrupts with reordering or synchronization. This microparallel taxonomy illustrates the great diversity of processors capable of executing instructions in parallel. Each classification holds an abundant variety of 35 processor organizations that have yet to be investigated. This chapter has outlined examples for each of the sixteen classifications but none are meant to be representative. Rather they are a first attempt at understanding what capabilities these classes contain. Another intent of this taxonomy is to provide as a means for postulating new designs. As new microparallel processors are proposed, this taxonomy can be used to systematically compare and evaluate these designs as will be done in the following chapter. 36 Chapter 3: Processor Performance Estimations 3.1 Introduction This chapter evaluates the relative performance of the microparallel processors described by the 16-Fold Way taxonomy. The goal of this chapter is to explore this very large design space to determine the highest performance processor. The design space is first narrowed down to a small number of general candidate processor classifications within the 16-Fold Way taxonomy. To accomplish this a study concerning the limits on control-flow parallelism is related to the taxonomy. This study measures the performance of six different abstract processor models instead of the sixteen classifications as in the taxonomy. However, these abstract models can be mapped to general classifications yielding a performance approximation of each. On the basis of this analysis, three of these general classifications are selected from the corresponding six abstract models and refined into specific processor architectures1 with resource constraints. Of the three, two have support for speculative execution. The difference between these two is that one is a single flow of control architecture while the other executes multiple flows of control. The third architecture has a single flow of control with no speculative support or any dynamic, instruction-scheduling mechanisms. This simple processor architecture serves as a base model for comparison purposes. Of all of the architectures it achieves the lowest performance rating but has the least complexity. Another goal of this chapter is to measure the relative contribution of dynamic execution and speculative retirement. Ultimately, this will help in determining if these two features are effective in terms of both cost and performance. The processor 1 Here processor architecture refers to the basic structure of a processor. This includes the distinctions made by the 16-Fold Way taxonomy along with the specific resources (e.g. number of ports to memory) within a processor. 37 architecture without these features serves as a base architecture indicating the relative increase in performance that each feature has on a single control flow processor. Then to determine if decoupling the processor’s architecture into separate execution units is effective we will compare the single and multiple control flow architectures but without varying the dynamic execution and retirement stages. Refinement of the specifications for the processor architectures is done for the benefit of detailed trace simulations. However, these simulations assume that a pipeline description is available for each processor architecture including the cycle-count latency and operations of the processors. Thus, another goal of this chapter is to provide this pipeline description based upon integrated circuit modules from a VLSI implementation. Although these circuits are not fully described in this chapter, they are initially presented here to simplify describing the cost-performance model in a later chapter. 3.2 Chapter Outline The structure of this chapter is as follows: The next section describes the simulation tools used to determine the cycle-count estimations of the processor architectures under investigation. Section 3.4 relates a study of abstract processor models to the taxonomy to determine which processor classifications are the best candidates for further investigation. Section 3.5 refines these classifications to specific processor architectures along with the simulation tools used to estimate the instructions per cycle (IPC) metric. Section 3.6 introduces the cost-performance model of this investigation and shows how IPC and cycle-time are related to get an adjusted measure of the performance. The results of the trace simulations for the processor architectures are presented in Section 3.7, first for the single control flow and later for the multiple control flow processor architectures. Finally conclusions are made in the last section. 3.3 Simulation Methodology With trace driven simulation a program’s execution is monitored by either intrusive code or execution tracing while running a benchmark program. The resulting dynamic stream of instructions and memory references then feed into a separate program along 38 with a description of a processor’s architecture to determine the rate of instructions per cycle. A total of three simulators are used to estimate the performance of the architectures. All simulators use the object code of the same instruction set architecture [51] generated from the same compiler technology [33]. The three simulators are named Dsim, Ssim and Mutltiscalar2. The difference between these simulators is that Dsim gives only a rough approximation of the performance estimated with both compile-time analysis and run-time mechanisms. The Ssim and Multiscalar simulators provide more detailed analysis. Ssim is a superscalar processor simulator by Johnson [47]. It will model the performance of single flow of control processor architectures. The Ssim also provides abundant parameters for the memory hierarchy and functional unit configurations. Multiple control-flow performance is determined by the Multiscalar simulator by Franklin [30]. This latter simulator has fewer parameters, but actually emulates the instructions within the user portion of a program and simulates the operating systems calls. As an instruction-level simulator, Multiscalar is slightly different from the other trace simulator, Ssim. Thus we will separate the results of the two simulators and make comparisons when needed. 3.4 Candidate Processors The first task of this chapter is to narrow the processor design space to a small set of classifications that have a potential for high performance. To accomplish this we consider a related study provided by Lam [59]. This study is used to estimate the relative performance of some of the classifications within the 16-Fold Way taxonomy. The study uses the Dsim simulator that determines the ideal execution of a program assuming a small set of hardware mechanisms. In particular it analyses the performance of two hardware mechanisms; speculative execution (SP) and multiple flows of control (MF). To exploit multiple flows of control, a static analysis is first performed on the benchmark object code to determine the control flow dependence between basic blocks. 2 Technically the author of the simulator named it "mips." This leads to confusion when describing the simulations, performance and instruction set architecture. So it is referred to here as the Multiscalar. 39 This information is represented in a similar manner as the control flow dependence graph (CDG) of Figure 8. This information is then used during the run-time simulation to determine if an instruction is independent of other instructions. The analysis technique numbers all basic blocks by a unique control flow identifier [IS]. Instructions within the blocks also inherit these identifiers and are then ordered by data flow dependences. Thus both intra-thread and inter-thread data dependencies are maintained even if such dependencies are specified through registers or memory addresses. The Dsim simulation environment is shown below in Figure 15. The benchmark program (prog) is modified with intrusive monitoring code by a program named pixie. The resulting object code runs exactly as the original even with respect to file input and output. This modified program image is then feed to an analysis tool called Presim. Preliminary processing is performed once per benchmark program and is used to determine the control dependence graph for each program and gain profile information for static branch predictions. Each model of a processor architecture then takes a separate run of the trace simulator, Dsim. Pixie -*( prog.pixie! results prog Presim — ► ( CDG Dsim 'f 'f r input model Figure IS: Ideal Processor Model Simulator Dsim runs the modified program as a child process. The input data set is then feed to this child process along with a description of the processor configuration. The output the simulation is the estimated cycle-count of the program under the scheduling constraints of the different abstract processor models. The models defined for Dsim are combinations of hardware mechanisms for speculative execution (SP) and multiple flows of control (MF) along with the control flow dependence (CD) software technique. 40 Not all of the combinations of hardware and software options are meaningful. For example, a multiple control flow processor is useless without control flow analysis. Thus, the MF and MF-SP configuration are not investigated because they require a CD analysis. The resulting permutation of six abstract processor models are described below in Table 6. They are arranged in descending order of scheduling constraints. The first has the most constraints of the six while the last one has the least amount of instruction- scheduling constraints. Model Scheduling Policy Base An instruction cannot execute until the immediately preceding branch in the trace is resolved. Branches are executed in order, one per cycle. CD An instruction cannot execute until its control dependence branches are resolved. Branches execute in order, one per cycle. SP An instruction cannot execute until the immediately preceding mispredicted branch in the trace is resolved. Branch instructions must wait for all preceding mispredicted branches. CD-MF An instruction cannot execute until its control dependence branches are resolved. Multiple branches execute in parallel and out-of-order. SP-CD An instruction cannot execute until its mispredicted control dependence branches are resolved. A branch instruction must wait for all preceding mispredicted branches. SP-CD-MF An instruction cannot execute until its mispredicted control dependence branches are resolved. Multiple branches execute in parallel and out-of- order. Table 6: Instruction Scheduling Policies These six abstract processor models are mapped to classifications of the 16-Fold Way taxonomy to obtain a first order approximation of a processor performance. We assume that processor architectures that are instances of a given classification will have a performance no greater than that indicated. Unfortunately, this lumps many processor architectures together. But as a first order approximation this mapping suffices as a selection process. 41 The Base3 processor is mapped to the fully static classification because it represents the simplest processor architecture. It has no speculative execution or multiple control support. Although it can benefit from the control flow analysis as with all the processors classifications. The speculative execution support includes all the processor classifications that have dynamic retirement. This is shown in the middle two columns of Figure 16 with the SP, SP-CD and SP-CD-MF models. The multiple flow of control (MF) processors maps to any processor with a dynamic (decoupled) fetch stage with separate and independent program counters. /^Speculative^ . ER fd\ SS Execution (SP) SD DD DS ss Base (2) SP (7) CD ( 2 ) SD CD ( 2 ) SP-CD (13) f Multiple Control DD CD-MF SP-CD-MF CD-MF Flows (7) (40) (7) (MF) V DS Figure 16: Mapping of Abstract Processor Models The numbers in the boxes represent the estimated speedup4 over a scalar or single issue processor model. Although all MF processors have well over six times improvement, the SP-CD-MF category includes extremely high performance of almost forty times improvement. In fact this general class of processor has the highest of all the models. The next highest performance processor model is the SP-CD with thirteen times 3' This Base processor refers to the name of the abstract machine of the Dsim simulator. 4 These speedup values have been rounded to the nearest integer. 42 improvement potential. However, there are many assumptions made in this study including that fact that there are an unbounded number of single cycle functional and memory units. Any number of control-flows can be simultaneously pursed and data dependencies in control flow arguments are ignored. 3.5 Processor Architectures As can be seen from these assumptions, the study only isolates the effects that control flow has on instruction-level parallelism. Despite this fact, the study is adequate to filter out some of the processor classifications. It shows that most of the processor models have relatively low performance when compared to the speculative and multiple control flow processor architectures and thus should not be pursued. This includes the SSDS, SDSS and SDDS classifications because they can only realize a speedup of 2 in this ideal environment. This can be attributed to the low levels of instruction parallelism within individual basic blocks. Since these models can not speculatively cross basic block boundaries to execute instructions. Similarly the DDSS, DDDS, DSSS and DSDS classifications only realize speedup of 7. Although these are respectable performance improvements, they are inadequate when compared to the other processor categories. Thus two candidates to investigate further are the SSDD and DSDD classifications because they are included in the two highest performance classifications. Processor architectures that are instances of these classifications are illustrated below in Figure 17 along with the SSSS instance. The difference between these processor architectures are highlighted below with the arrows. The fully static processor does not have a reordering execution stage or speculative retirement stage as the SSDD processor5. The difference between the SSDD and the DSDD is that the fetch stage is decoupled in the DSDD and the decode stage now has a synchronize behavior. The classification names, SSSS, SSDD and DSDD will also be used to identify processor architectures. These identifiers are used instead the terms VLIW, superscalar and Multiscalar. The later terms are actually instances of the classifications. 43 SSSS nr X D I SSDD D DSDD K M D H D E I (a ) v “ ^ M lD (1) (c) (|] Figure 17: Candidate Processor Classifications To bound the performance for these different processor architectures, more trace simulation is performed but with both detailed and restricted models. Also sensitivity analysis is performed with respect to the issue-width, cache capacity, number of memory ports to the caches and the size of instruction window. With these three architectures now selected, the next sections describe the basic operations of these processors. This includes the functional unit and instruction scheduling resources that implement the dynamic behavior of the processors as well as the architectural parameters used as inputs to the detailed simulations. 3.5.1 SSSS Processor The first processor to define is the fully static processor or basically a Very Long Instruction Word (VLIW6) processor. Unfortunately, no sophisticated compiler technology is used to expose instruction-level parallelism or schedule the instructions to accommodate the parallel resources. It processes instructions in a lock-step or in-order manner. As such it represents a lower bound on the performance of a SSSS processor that maintains in-order execution throughout the stages. 6 Technically this should be called a Large Instruction Word (LIW) processor, because it executes a relatively small (< 10) instructions per cycle. 44 Bypass Figure 18: SSSS Block Diagram For this SSSS processor, a group of instructions each W-words wide are fetched from an on-chip cache and are decoded together as a group (Figure 18). Operands are read from the register file or bypassed during the next stage for register read and instruction decode. Because there are W instruction words the register file must supply at most 2W operands per cycle. Although it has been shown that frequently fewer than 2W unique register specifiers are required per cycle [47]. Unfortunately, the complication of selecting the correct register output and then driving data to a functional unit requires too much overhead for a high frequency clock. Also, the goal is not to stall for any condition other than perhaps a cache miss. This means that an issue-width of W=2 implies 4 read-ports and 2 write-ports to the register file. After instructions are decoded and issued in-order they are dispatched to the execution stage where the functional and memory units reside. There are four categories of instructions that the processor can execute in parallel; branch (Bm), integer (INT), load (LD) and store (ST). No floating point operations are in the benchmark suite, so no resources are allocated for this category. Each functional unit executes the instructions in-order. Once instructions are finished in the execution stage they are simply retired. Additional parameters are used to defined the number of ports (P) to the data cache along with the total cache capacity measured in Kilobytes (K). For the instruction cache, 45 one single W-wide port is assumed. This port supplies the entire group of instructions per cycle. For the purposes of this study misalignment of instructions within the cache block is ignored. 3.5.2 SSDD Processor The SSDD or superscalar model contains two dynamic mechanisms to accommodate the execution of instructions. The behavioral description of this processor includes a reordering execute unit and a speculative retirement unit (Figure 19). The basic categories of instructions executed in parallel remain the same as with the SSSS configuration. The difference being that instructions are reordered in the execution stage. Also instructions that have been predicted to execute by the branch prediction mechanism are allowed to execute and bypass their data values into the pipeline. A reordering write-back stage then temporarily holds them until the branch outcome is resolved. Upon mispredictions the reorder buffer is flushed of incorrectly predicted instructions. Because of these additional hardware mechanisms, two more parameters are needed for the SSDD architecture than for the SSSS. This includes one for the reorder buffer size (RO) and another for the reservation station capacity (RS) per functional unit. K 1 $ Decode ......." X ........... e g g i 1 — 1 ------+ Issue ■ Bm M R ST S “ s ✓ \RO D S Figure 19: SSDD Block Diagram 46 All busses to the register file, reorder unit and bypass are wide enough to accommodate the parallel execution of instructions. This is similar to the HPS design [44] but different from the original Tomasulo Algorithm that provided only one common data bus (CDB) to bypass operands [89]. 3.5.3 DSDD Processor For the multiple control-flow processor separate execution units (EU) are provided to pursue the different threads of computation. A processor is then a collection of these execution units each varying in issue-width (W). The block diagram, in Figure 20, is similar to the SSDD design except that the register files of different units are connected together with a queue mechanism. These queues communicate data values between different threads of computation. Each processor contains a buffered version of the computation of the another processor. This buffered version is used to hold the results of speculative computation until it can be committed to the register file. Execution Unit 1 Execution Unit 2 Fetch < 1 P > < 1 : > 3 f D $ K (a) Figure 20: DSDD Block Diagram DSDD T i nr ■fiW-nr ■ P -k n w -g - 47 In the figure above only two execution units are shown. But this DSDD processor architecture can be scaled to an arbitrary number of execution units interconnected in a ring topology. Each execution unit holds the partial execution of the previous unit until the results are committed. Similar to the SSSS or SSDD architectures, the individual execution units of the Multiscalar issues a relatively small number of instruction per cycle. But there are a number of parallel execution units interconnected on a single chip resulting in an aggregate throughput equal to the product of the number of execution units (E) and the width (W) of each unit. 3.5.4 SSSS and SSDD Simulator The simulator used for the single control flow processors (i.e. SSSS and SSDD) is derived from a trace simulation environment from Johnson [47]. This simulator is similar to the ideal processors of Lam, except that it does not need to consult a control flow dependence graph during execution. This is because it pursues only one control flow. But the basic simulator tasks are the same. A benchmark program is modified with intrusive code and run as a child process along with its input (Figure 21). This generates a trace of the instructions feed into a routine that models how the target processor architecture would execute the instructions. Along with the cycle-count tally other run-time statistics can be gathered. This includes the cache hit rate, functional unit utilizations and distributions of the instruction issued in parallel. — results^) trace Figure 21: Superscalar Simulator (Ssim) A slight modification was made to the simulator so that the start-up code of the Prolog execution was not included. This start-up code down loads the symbol table and 48 initial stack images. Normally, this operation would be performed by the UNIX loader. However, for the benchmark programs ported to the MIPS platform these operations are performed by the application program itself. 3.5.5 DSDD Simulator The simulator of the DSDD or Multiscalar is unlike the single control flow simulator in that it does not modify the object code of the benchmark program. Instead the object code is partitioned by a task generator at run-time and executed by the Multiscalar simulator. .The task generator demarcates speculatively executed threads of computation for each execution unit. In essence, the task generator is predicting branch outcomes and allocating a control flow to an execution unit. If the prediction is correct then computation is allowed to proceed and a new control can be allocated to an available execution unit. If the prediction is incorrect, the thread’s computation results are thrown away by invalidating the buffered version of the computation. Scheduling of instructions within the execution units is unconstrained up to the issue width specified. Unfortunately, this simulator also does not support the various cache architectures nor capacities as does the Ssim. Thus the results presented are for a fixed cache capacity and organization. 3.5.6 Architectural Parameters Although the classifications of the 16-Fold Way uniquely identifies the processor architectures, we still have many other secondary architectural features of the processors to consider. Table 7 summarizes these architectural parameters for the simulation environment consisting of a combination of the taxonomy classification and the pipeline parameters. The pipeline parameters consist of the issue width of the processor (W), number of ports to the caches (P), capacity of the on-chip caches (K), number of execution units (E), capacity of the reorder buffer (RO), capacity of the reservation stations (RS) and finally the functional unit configurations. Some of the parameters are applicable to only one or two of the processor architectures. This is indicated with the processor column containing the classifications. 49 Parameter Processor Values Classification (FDER) {SSSS, SSDD, DSDD) Issue Width (W) SSSS, SSDD, DSDD {W1,W2, W4, W8J Ports (P) SSSS, SSDD, DSDD {P1.P2.P4, P8) Cache Capacity (K) SSSS, SSDD {4Kb, 8Kb, 16Kb, 32Kb, 64Kb, 128Kb, 256Kb} Execution Units (E) DSDD {E1.E2, E4, E8) Reorder Buffer (RO) SSDD {R08, R016, R032, R064, R0128, R0256} Reservation Stations (RS) SSDD {RS2, RS4, RS8) Functional Units SSDD {[1,1,1,1],...,[8,8,8,8]} (i.e. varying the number of functional units in the execution stage.) Table 7: Architectural Parameters There are too many of these parameters to exhaustively search for an optimal combination. This has meant only a subset of the permutations are explored in this study. Isolating the effects of a given parameter is done by fixing all but one parameter and varying one for the ranges indicated above. For example, when measuring the effect that memory ports have on an architecture, the maximum values of the cache capacity, issue width, etc. are used for the SSDD configuration. This measures the greatest performance potential for a given feature. Based upon these measurements we can then propose a smaller range of options for a given architecture instead of all the permutations at once. This helps to narrow the design space considerably and also indicates how much of each parameter contributes to the performance. 3.6 Cost-Performance Model With the basic processor architectures, parameters and the simulators now defined the next crucial part of the evaluation is the implementation of the processor architectures. This involves defining the pipeline of each processor in terms of integrated circuits modules and how the design of these circuits influence the pipeline design. 50 For the purposes of this study there are two influences that circuit implementation has on a processor architectures. The first is the number of clock cycles or latency of a particular pipeline stage requiring it to execute an instruction. For example, a cache memory can require a number of cycles to perform a memory access and this number can vary depending upon how aggressively the circuits of the memory are partitioned. Given this information, the second influence that integrated circuit implementation has on the processor performance is on the cycle-time of the clock. This is also dependent upon how a module is partitioned along with the architectural parameters for a processor (e.g. W, P or K). This effect is measured in terms of the length of the clock period. In this section, we discuss only the first effect of clock cycles per processor stage. In Chapter 4 we will investigate how the architectural parameters effect the cycle-time as well as the silicon area required to implement the architectures. Starting with a block diagram of Figure 22, we can decompose a generic processor into its component modules. This includes a data and control path, an instruction and data cache and an interface to an external portion of the memory hierarchy. The data path is the portion of a processor containing the functional units and the processor state in the register file, along with partial execution of instructions in staging latches. The data path is normally under the direction of a control path. Within this control path is the opcode pipeline, instruction decoders and other circuits that perform the processing of control flow. The cache memory, for both data and instructions, provide large capacity on-chip storage while the external interface module is a collection of pads, control circuity and wiring to connect the internal data busses to the off-chip storage. 51 Control Path External Interface External Memory Figure 22: Block Diagram of Processor The cost-performance model for this study uses four of these basic modules, including the entire data path, data cache, instruction cache and a portion of the control. This is because the data path and caches are usually involved in the critical timing path to process instructions and thus has the greatest influence on the cycle-time. Processing the control flow instructions and exception can also impact the cycle time. For example, the time it takes to calculate a condition code value and propagate this information to a branch offset unit is another candidate critical path. Exceptions or traps also are notorious for complicating a pipeline design due to the amount of state information that must be saved. To fully model these artifacts of processor design would require complete implementation, well beyond the scope of this analysis. Portions of these critical operations are included in the cycle count and cycle-time model. For example, the adder calculating the carry out bit is modeled that can be used for a branch adder. So although this model is incomplete, it does capture most of the critical timing issues involved with VLSI implementation of a instruction set processor. 3.6.1 Mapping From Abstract to Pipeline Stages Chapter 2 partitioned this generic processor architecture into four abstract stages; fetch, decode, execute and retire. These abstract stages describe the policy for statically or dynamically scheduling the instructions within each stage but have no notion of the time required to perform these operations. To determine this, each of these abstract 52 stages require a pipeline stage built from a circuit module. The mapping from these abstract processors stages to circuit modules is illustrated in Figure 23. The figure uses the SSDD or superscalar architecture as an example. I. Abstract Stages: p ~ m f E m m 0 Fetch Decode Execute II. Pipeline Stages: t o mb ' V ' S s * -------------' \ y — I 4 --------- j > , i l l I cycle III. Circuit Modules: 1 Memory Array Column Rd/Wr I Memory Array Column Rd/W Read Decode Memory Array Write Decoder \ Functional \ Unit / 18 < § < 5 Memory Array |1 R o w Decoder Memory Array Column RdAV Figure 23: Pipeline Description with Circuit Modules In this figure, the abstract fetch stage (F) is mapped to the pipeline stages of an instruction cache access (II, 12). This represents the time required to fetch and propagate the opcodes to the control path and register specifiers to the data path. The first cycle of the next abstract stage for decode (D) is mapped to two clock cycles. The first is a register file (RF) access where operands are read in one cycle and instructions are decoded. Instructions and operands are then dispatched during the bypass (BP) stage to 53 the parallel functional units. It is named bypass because a complete cycle is needed to bypass operands from the various functional units within the execution stage during this time. The superscalar dynamic-reorder and execution stage (E) requires four pipeline stages. The first cycle is dedicated for instruction queueing and dispatching from the reservation station (RS). The next cycle is used to perform a memory offset addition (EX). The last two cycles are for the pipelined data cache memory access (D 1 ,D2). This is basically the same timing as the instruction cache. The dynamic retirement stage consists of a reorder buffer (RO) and finally a register write back (WB) stage. The reorder buffer accepts instructions from the parallel functional units and reorders them back into program order. This is done before actually committing the results to the register file image. Both require a clock cycle in the pipeline. Both the SSDD and DSDD processor architectures follow a similar mapping and are the most complex because they have the longest pipeline structure. The SSSS or VLIW architecture does not have a reordering execution or retirement stage so these circuit modules and corresponding pipeline stages are left out of that processor architecture. This SSSS pipeline is shown below in Figure 24. d TYd T ^ b i Figure 24: Pipeline of VLIW 3.6.2 Mapping From Pipeline Stages to Circuit Modules At the next lower level of implementation, in Figure 23, pipeline stages are mapped to integrated circuit modules that perform the functions of the pipeline stages. Each pipeline stage roughly equals a circuit model with the exception of the caches. These modules require two cycles per module. 54 For the fetch stage a pipelined SRAM module is used to implement the instruction cache. For the decode pipeline stage there are two different modules to consider; a register file and a bypass unit. The register file is assumed to be a 32 bit by 32 word multi-ported SRAM. The bypass unit is a bus structure with drivers connecting the functional units and data cache memory back to the inputs of the execution stage. For the execution stage three types of modules are used to model the pipeline timing. The first is the reservation station used to implement the dynamic reorder behavior. The next module is a functional unit. In the illustration this is indicated with an adder symbol. This is because an addition operation is typically the time critical module of the functional units. The last module of the execution stage is the data cache memory. This design is the same one used in the instruction cache for modularity purposes. So it too requires two cycles for a cache access. The last two pipeline stages map the retirement operation to a reorder buffer. It contains the look ahead state of execution currently in the pipeline [47] [79] while the register file of the decode stage contains the in-order image of instructions completed. The in-order and look ahead images are combined using an associative search to find the most recent version of the register element. The last circuit module associated with the retirement pipeline stage is the final writing into the register file. This is the same module used in the decode stage but for a write operation. 3.6.3 Benchmarks The last item to cover before presenting the simulation results is the benchmark suite. The suite of benchmark programs used includes a subset of SPECint92 programs [103] as well as the symbolic computation represented by Prolog execution [37]. A listing of the programs in provided below in Table 8. The SPECint92 programs were selected because of the length of their execution trace. This because most of the Prolog programs have comparatively short execution times and also corresponding small temporal and spatial cache locality. So an attempt was made to increase the size of the benchmarks to represent realistic processor behavior on workstation environments. The Prolog or symbolic computation was chosen because it offers a more challenging form of integer computation than the SPECint92 programs. Typically 55 Prolog execution includes a higher frequency of branch instructions. This implies that basic blocks tend to be smaller. With smaller basic blocks, there are fewer instructions per block which lowers the expected ILP. This will be illustrated in the following sections. Name Suite K Cycles Description of Benchmark circuit Prolog 6,157 Logic circuit synthesis eqntott SPECint92 1,401,518 Logic to truth table format espresso SPECint92 1,163,860 Minimizes boolean functions fast_mu Prolog 376 A fast version of MU-Math problem qsort Prolog 339 Quicksort of 50 integers queens_8 Prolog 429 All solutions to eight queens problem query Prolog 492 Query of population database sendmore Prolog 3,005 Crypto-arithmetic problem simple_analyzer Prolog 1,378 Mode analyzer of Aquarius Compiler tak Prolog 3,593 Takeuchi function evaluation xlisp SPECint92 1,963,110 All solution to eight queens problem zebra Prolog 4,545 Solution to simple word problem Table 8: Benchmark Programs 3.7 Simulation Results Before we delve into the performance results it is important to return to the critical questions of this chapter. Ultimately, we want to see if speculatively executing and reordering instructions within a control flow is important to performance. Additionally we want determine if decoupling the fetch and decode stages of a processor can increase the performance. To answer these questions this section presents the results from the trace simulations varying the dynamic stages in question for the processors executing the suite of benchmarks. First the results of the single control flow processors are reviewed. Each parameter value is varied for each architectural feature. The instructions per cycle (IPC) is 56 measured to determine their relative contribution to performance. Particular attention is given to the features that result in substantial performance improvement. This is done to identify critical architectural parameter values that can be assessed in terms of cycle time and silicon area costs in the next chapter. Next the multiple flow of control architecture (i.e. DSDD) will be presented for different parallel configurations. Not as much effort has been made to characterize the possible variations of individual execution units within the Multiscalar architecture. For example, each execution unit we could be specialized for a particular control flow. This might mean more ALU’s in one execution unit than in another. Instead the architecture is uniform with respect to homogenous execution units. This provides a simple execution model for a compiler. Otherwise the compiler would be burdened with a very difficult task of allocating computation based upon different resources within each execution unit. Although from a cost consideration a processor with heterogenous units could be optimized and thus have a better performance/cost ratio. Unfortunately these variations are beyond the scope of this investigation. 3.7.1 SSSS and SSDD Performance The four most important architectural parameters in common between the SSSS and SSDD processor are the on-chip cache capacity, issue-width, functional unit configuration and number of memory ports to the data cache. These features are interdependent. For example, cache capacity is important because if the caches are too small, then effective memory latency dominates the time of execution to the point where additional resources contribute very little to the overall performance. With sufficient cache capacity, however, other features such as the issue-width can contribute to performance. These types of relationships are illustrated below in Figure 25. The figure combines both the SSSS and SSDD performance results for ease of comparison and are also tabulated in Appendix A. The figure contains a total of six family of curves consisting of the two processor architectures. Each architecture is run with three issue-width values (i.e. W2, W4 and W8). Within these runs, a family of curves consists of a number of memory port configurations up to the total issue-width capacity. So for a issue-width of 57 four (W4) we vary the number of ports from PI to P4. Although some resolution is lost by growing capacity by a power of two sufficient information is gained to determine effective values of the parameters. Wn = n words issued Pn = n ports 2 .5 0 -r- ■ P 4 - W2 W8 W4 2.25 - 2.00 ■■■ SSDD SSDD 1.75 -- ,+ -+ • 1.50 -- SSDD 1 . 2 5 ............ +! -xl SSSS 1.00 SSSS 0.75 + SSSS 0.50 r Cache 0.25 -- Capacity 0.00 « > T j- CD CO ^ Jt 00 to t- co to c m in co to cm in r n to cm m C M t - C M y— C M Figure 25: IPC of SSDD and SSSS Processors All of the processors can issue any combination of instructions for a total capacity of W instructions per cycle. The separate instruction and data caches each block upon misses and are direct-mapped with a block size of 4 words. The service latency of a cache miss is fixed at 12 processor cycles per miss for these simulations. This delay should give ample time to perform an external memory translation and access to a second-level, off-chip cache. The branch unit in this study has perfect prediction. This is so that we can isolate the effects of each of the parameter independent of the prediction strategy used. This also tends to give an optimistic performance rating for these processors but uniformly so. Since perfect branch prediction is assumed this experiment indicates a best case scenario of what the dynamic execution stage can achieve in performance improvement 58 as well as for the other architectural parameters. Later the results of a separate simulation run is presented with a hardware branch predictor. This run measures the effects that speculative execution has on performance. From the figure, several observations and comparisons can be made between the different processor architectures as well as their architectural features. Firstly, the SSDD processor consistently has higher performance than the SSSS processor for a given issue-width, port capacity and cache size. However, this performance gap narrows as the issue-width widens. Secondly, the cache has the greatest impact on SSDD performance for the range of caches tested. The increase in performance is a much as a 2.5 speedup going from a 4kB cache to a 256kB cache. Thirdly, the SSDD processor gains very little benefit from an issue-width greater than 4 words, while the SSSS processor can gain as much as 20% in IPC going from the W4 to the W8 configuration. Lastly, for all of the processor configurations, two memory ports to the data cache practically covers all of the performance benefit of parallel memory operations. As was mentioned before, the difference in performance between the SSDD and SSSS processors stems primarily from the dynamic execution stage and how effective each processor is at using an instruction window. For the SSSS processor this window is limited by the width of the processor. Where as for the SSDD the instruction window to as much as 256 instructions. This instruction window is maintained through a combination of the reorder buffer and reservation station entries. These latter entries allows the SSDD processor to execute independent ALU and branch operations while a cache miss is being serviced. Since this feature is not in the SSSS processor it must suspend operations until the miss is serviced. The fact that caches play such a dominate role in performance in not surprising because of the 12 cycle overhead per cache miss. The majority of the memory references can be captured with a cache for 64kB for the SSDD processor. But for the SSSS processor architecture it seems that only an eight instruction wide processor benefits from such large caches. What is surprising is the lack of performance improvement from multiple ports to the memory. Symbolic computation as represented by Prolog execution in this 59 investigation has previously demonstrated a high frequency o f memory operations [20]. With this compiler [33] the memory references and port characteristics are similar to the SPECint92 programs. This issue will also be addressed in the section on functional unit configurations. 3.7.2 Cache Effects Since the capacity of on-chip caches has such a significant effect on performance this section investigates the architecture of these caches. Specifically we measure the effect on performance by varying set associatively (A), block size (B) and set size (S) where the capacity is expressed a sK = A xB xS . A simplified block diagram is shown below in Figure 26. We limit the cache capacity to a range of 4kB to 256kB because a microprocessor currently incorporates separate instruction and data caches ranging from 4K to 16K bytes of SRAM cache. Future technology could support a 256kB cache, but only if roughly a quarter of a four centimeter square chip is used only for the caches7. DATA Figure 26: Cache Parameters So it is feasible that a microprocessor could use a 2S6K cache on a single die. It is more likely that smaller first-level caches will be backed up by large unified caches [50] for multi-programming environments. However, for the purposes of this study we assume one large first-level cache. We describe the results of the trace simulations in terms of a hit ratio independent of the processor’s architecture. The ratio is based upon the memory addresses generated from the trace simulation. The range of hit ratio shown below, in Figure 27, is limited to 75% to 100%. This was done to magnify the differences between the architectures. The graph on the left shows how increasing block size, for a directed mapped cache (Al), 7 This is an estimations resulting from the area-time cost model of Chapter 4. It is based upon a CMOS process technology of 0.25 urn gate length. 60 results in a higher hit ratio. Ranging from 16 to 128 bytes per block, the graph shows that a block size of 64 captures most of the memory locality with a modest amount of prefetching from the larger block sizes. Hit% B64 £128 100.0.,.. WB64 B32 ■ - ‘B16 95.0... 95.0. - B16 90.0 All A1 85.0 t, 85.0 Bn = n bytes/block .An = n-way assoc. 80.0 80.0 Capacity(kB) 75.0 75.0 0 0 CM < 0 U ) CM < 0 ( O ( O CO ( O CM < 0 m ( O * C O CO Figure 27: Cache Architecture vs. Hit Rate Set size is also varied as the capacity is increased from 4K to 256K. The largest set size corresponds to the smallest block size and associativity. In this case, a cache with 16 bytes per block (B16), an associativity of one (Al) and a 256K byte capacity will have the largest number of sets or 16K sets. Varying the associativity is shown in the figure on the right hand side. From the figure it can be seen that very little is gained after a set associativity of two because of the relatively large capacity of the caches. Although, going from a direct mapped cache to an associativity of two does gain roughly 3% in hit ratio. Thus the block size plays the dominate role in determining a higher hit ratio for these caches. These results are important for the integrated cost model because the architecture of large capacity caches influences the overall cycle-time of the processor. It is obvious from the previous section that larger caches offer better performance if the cycle-time stays the same. The challenge is how to optimize the architectural parameters to achieve both a high hit ratio as well as a low cycle-time. In the next chapter we will describe specifically how each feature effects the clock frequency and the performance. But for 61 now we will describe the caches only in terms of the total capacity and assume a direct- mapped cache with a block size of 16 bytes. 3.7.3 Functional Unit Configurations For microparallel processors another issue is the configuration of the functional and memory resources. The simulation results of the previous sections assumes there are as many resources as needed for a W-word wide processor (i.e. W ALU’s or W branch units). An exception is that the number of memory ports can be less than the word width. Obviously not all of these units are needed because of the relatively low IPC rating. This raises the next question as to what execution resources are needed for the data path of the SSSS and SSDD processors. To answer this we measure the performance of only the SSDD architecture because it has the highest IPC rating and thus should require the greatest execution resources to sustain high instruction throughput. As a first approximation of the number and type of resources needed, the distribution of instruction types in shown below in Figure 28 for each benchmark along with the average distribution on the far right. An interesting fact mentioned earlier is that symbolic computation has a higher percentage of branch instructions than the integer benchmarks. On the average the SPECint92 benchmark consist of 22% branch instructions while the Prolog programs average 33%. One of the culprits behind the higher percentage is because Prolog is a type declaration-free language. Because data types are not declared they must be dynamically determined at run-time before proceeding with execution. Although the compiler can infer the types of many of the procedure argument [33] not all can be determined. The result is that the percentage of ALU operations for the entire suite is slightly lower than an integer program such as eqntott or espresso [103]. 62 Figure 28: Instruction Distribution Assuming this distribution is uniform we can allocate processor resources according to their frequency of occurrence. That is for every load and store unit we should have roughly two ALU and two branch units. Unfortunately this assumption is incorrect. In reality instructions classes tend to follow burst patterns. For example, when a procedure is called a stack frame is usually pushed onto a control stack to save state information. This generates a burst of store operations. Similarly, a procedure return causes a burst of load operations to restore the state of a procedure. So although the distribution is helpful in describing the general resources required some form of validation should be performed for different configurations to ensure that data and control path are utilized fully. To completely characterize the different permutations would require exhaustive search. Instead we present three different simulation runs. Two of the runs isolates the contribution of each type of resource. This is indicated in the figure with either all ones and the functional unit variable (e.g. [i,A,i,i]) or with all eights and the functional unit variable (e.g. [8.A .8.8]). The first case measures the effect of having only one of each unit except for the ALU. This variable is varied to measure its relative contribution to performance. This is done for all of the functional and memory units. In the second case, we perform a similar experiment but with eight units of each category. The third run measures the effect of increasing all the resources at the same time (i.e. [w,w,w,w]). We start with a minimal processor of one branch, ALU, load and store unit (i.e. [i.i, 1,1]). All other architectural parameters are at their maximum (e.g. W8 and 256K). This processor configuration can execute 1.9 instructions per cycle. If we increase one resource at a time we get the curves in the middle of Figure 29. The curves reveal that adding one ALU has the greatest performance advantage. This is not surprising considering ALU operations occur the most frequently. However, the next highest performance curve is for the store unit. This is surprising because it has the lowest average frequency of occurrence. The figure also shows that adding more branch units or load units have little or no effect on performance with only one adder and one store unit. IPC 2.70 t ■ -------[W,W,W,W] 2.50 -• [4,4,4,4] [8,8,8,8] [1.A.1.1] 2.30 \ - - 2.10 ■> ‘ [1,1,1,81 [8,8,8,S] 1.90 K [l,1 ,1 ,1 ] 1 ------ [8,8,L,8] 1.70 ■ ■ O [8,A,8,8] X [B,8,8,8] # Resources 1.50 1 2 4 8 Figure 29: IPC vs. Configurations At the other extreme starting with eight of each resource class (i.e. [8,8,8,8]) we get similar results. That is if we had to omit a functional or memory unit the last type we want to omit is the ALU because with only one of these and eight of the others the 64 performance suffers the most. Similarly, the store, branch and load units follow the order specified by the minimal configuration. So for the benchmark programs traced the processor configurations should follow a simple guideline: Have as many ALU's as possible followed by store, branch and load units respectively. Luckily, this guideline is relatively easy to follow because an ALU is one of the simplest and least costly processor resources to include. However, the store and branch units have considerable overhead associated with them. For the store unit, an additional port to memory is needed. This additional port complicates both the memory cell and overall cache design. Some of these design complexity issues concerning the cache are developed in the next chapter along with how these configurations effect the time spent for operand bypassing. The branch processing unit also requires considerable hardware because it must supply a comparison unit, branch offset unit and control logic (i.e. prediction logic). 3.7.4 Reservation Station and Reorder Buffer The last two architectural features to measure are the buffers used to reorder the instruction sequence in the execution and retirement stages. For the execution stage, reservation stations (Figure 30) provide the synchronization mechanism that allow instructions to await operands and be dispatched. These instruction buffers are distributed one per functional or memory unit and connect to the output busses of the other parallel functional units. Combining all of the reservations stations together represent the total capacity of the decoded instructions awaiting execution. Figure 30: Reservation Station These reservation stations are relatively small buffers compared to the reorder unit or register file. They contain only up to eight entries each. The required depth of the result busses \ ALU 65 station is dependent upon the processing rate of the functional unit and the number of instructions dispatched to the station. Because the functional and memory units all can execute an instruction per cycle in a pipelined manner and there are as many units as the processor can issue. The second feature measured in this section is the reorder buffer (Figure 31). This mechanism stores speculative execution or the look-ahead state of the processor [79]. Conceptually this buffer is a circular queue with head and tail pointers. Entries between the pointers are considered valid. During the instruction decode stage, the next available buffer entry is allocated to the instruction. The tail pointer is used as the next entry address into the buffer and also serves as the tag for the instruction. This tag is used to match the results when the instruction completes execution. instruction data path w Bypass Reorder Buffer Register File Figure 31: Reorder Buffer When an entry at the head of the buffer contains a valid result and does not have an exception it is committed to the register file. If an exception exists, then issuing is stopped and preparation is made to handle the exception. However, at this point the register file is at a precise state so context switching is relatively easy and fast. The contents of the reorder buffer must be merged with the contents of register file. Otherwise the results of speculatively executed instructions that have not been committed are not properly forwarded to the instructions in the pipeline. This merging operation complicates the bypassing operation because operands can be found in one of three places; either in the register file, data path or reorder buffer. This complicates the 66 associative matching of the register specifiers and also means the bypass unit now has to accept inputs from the reorder buffer too. The results for the simulation are shown below in Figure 32 for the reorder buffer on the left and reservation stations on the right. For the reorder buffer the results are similar to another study on available instruction-level parallelism [93]. In this previous study a buffer size of 128 captures 91% of the attainable parallelism and 50% with 32 entries. For this configuration a buffer size of 64 captures well over 90% of the performance potential. IPC IPC 2 .5 0 -r.......................................................... 2.50-|-................ 2.00 2 0 0 1 . 5 o l x ^ .............................................. 1.50-- 0.00 1.00--- 1.00-- 0 50-■........................................................ 0.50-- RO 1 0.00 (O (M ^ 0 0 (D RS —I c o (O cm u ) i o 4 a t - C M Figure 32: IPC for Reorder Buffer& Reservation Stations For the reservations stations the figure on the right indicates that having just 4 entries achieves close to the maximum performance. Even a station of 2 entries achieves roughly 50% of the performance potential. 3.7.5 DSDD Performance For the DSDD classification the Multiscalar [30] simulator is used to estimate the performance of several configurations with varying issue-width and number of execution units. All of the configurations in this section were run with similar perfect branch prediction as was the single control-flow processors along with a 256Kbyte central data cache. An example of a multiple-execution unit processor architecture is shown in Figure 33 with a ring topology of four execution units. The communication between these units is unidirectional and queued. Each unit is capable of executing between one or eight instruction depending upon the configuration. Each execution unit, 67 however, can access only one port of the central data cache. This means a configuration with 4 execution units needs 4 ports even if the width of each units is 2 instructions wide. Each unit performs computation of a separate tasks of the same program. They communicate the results to neighboring execution units either with a register image in the queue or with global data structures in the data cache. For example, a program may contain a loop iteration. A single task can be allocated for each loop iteration and execution can proceed in a manner similar to software pipelining each iteration on each execution unit. The MISC has is a similar allocation strategy while in a vector processing mode [90]. A more interesting allocation strategy is to assign separate control-independent basic blocks to each unit and pass the speculative computation to the neighbor. Because the queues between the execution units can buffer these results the net effect is a reordering buffer that is distributed about the processor. Both of these techniques are used to exploited instruction-level parallelism in the simulator of this processor classification. Execution Unit P0 PI P2 P3 Data Cache Figure 33: A Ring of Four Execution Units Figure 34 presents the performance of four different configurations with increasing issue-width plotted against the number of executions contained is a processor. For example, a W2E2 data point represents a processor with two execution units each 68 capable of issuing two instructions words in parallel. Likewise a W4E2 data point represents a processor with two, four-word wide executions units. IPC W8 W4 W2 (W 4E 2) 3 - - (W 2E 2) W1 Execution Units E1 E2 E4 E8 Figure 34: IPC vs. DSDD Configurations The performance estimations are slightly higher than for the Ssim at the single execution unit point. This can be seen with the W8E1 configuration. Its performance is slightly higher than the Ssim’s simulation for the basic processor architecture. The difference in performance stems from the methods of simulation. The Multiscalar simulator is technically an instruction-level simulator. This means it models the actual operations of the trace by executing the instructions on a simulated data and control path. Normally, this would imply that is more accurate than a trace simulator. Unfortunately, the Multiscalar has different scheduling assumptions for the reorder buffer. It assumes an infinite sized buffer for example. The result is that the performance estimated by the Multiscalar is optimistic compared to the Ssim for the single control flow architectures. If we assume that it is uniformly optimistic across all of the configurations, we can still draw conclusions concerning the relationship between the speedup gains from wider execution units and an increase in the number of execution units. 69 If we first fix the number of execution units and view the increase in performance from varying the width then a general observation can be made. That is increasing the width from one to two words results in the greatest gain. Additionally after four instructions wide little is gained in performance. This is similar to the results of the single control flow architecture. In that case the number of execution units is fixed at one (i.e. E l) and we varied the width from two to eight. If the number of execution units is varied the results indicate that the greatest gain is from decoupling one unit into two units. This results in a speedup of 1.69. After this the speedup begins to taper off going from two units to four and from four to eight respectively. 3.7.6 Effects of Branch Prediction All of the simulation runs thus far have been with perfect branch prediction to measure the upper bound on.the performance of the single and multiple control flow architectures along with the architectural parameters applicable to the pipeline stages of the processors. To understand the effects of branch prediction of these architectures, the simulations were again run with their respective prediction strategies. The dynamic branch prediction unit of the single control flow processor is a branch target buffer of associatively 4 with 2,048 entries. For the benchmarks, this prediction unit has an accuracy of 87%. The Multiscalar uses a slightly different notion of prediction for tasks and but uses a similar mechanism that stores a short history of the branches that needs to be predicted. The accuracy of this unit for the benchmarks is slightly higher at 89%. The effect of the prediction inaccuracy on the performance of these processors is shown below in Figure 35. The SSDD and DSDD processors are measured with the maximum configurations for the other architectural parameters. 70 L JSSD O (b) IPC Wn = n words En = n Exc.units Pn = n Ports 2.50 2.00 'W 4 1.50 1.00 r.W l 0.50 0.00 E 1 E 2 E4 E8 W 2 W 4 W 8 Figure 35: SSDD (a) and DSDD (b) with Branch Prediction The results indicate that for a single control flow processor the performance is lower across all the configurations compared to the runs with perfect prediction. However, the IPC value of the SSDD is still well above the SSSS architecture despite the branch prediction effects. This is because the SSSS and SSDD architectures both suffer from similar a loss in the instruction-level parallelism available to the processor. For the DSDD processor the results are slightly different. Although all of the configurations experience a similar loss in performance the processors with four or more execution units suffer the most. Basically this is saying that the task prediction strategy used by the Multiscalar can do little more than two control flows at a time. Past this very little is gained. In effect the tapering off point now has been shifter from about eight execution units to two. Even with these effects the decoupled architecture with speculative execution support achieves the highest performance with an W8E2 configuration at just over 2 IPC while the best that a single control flow processor can do is estimated conservatively by the Ssim simulator at 1.4 IPC and optimistically at 1.7 IPC by the Multiscalar simulator (i.e. E1W8). 3.8 Conclusions This chapter has selected three general classifications from the 16-Fold Way taxonomy as high performance candidate processors. The evidence that two of these candidates have the highest performance comes from a related study on speculative 71 execution of multiple flows of control. The idealized estimations of these abstract classifications are 13 times and 40 times speedup over a scalar processor. However the abstract processor models do not include overhead for communication and synchronization between threads of computation. Instructions are scheduled so that only the data-flow or true data dependences are maintained. But normally a program requires some form of stack manipulation for book keeping purposes. These complications are ignored by the idealized study but not by the trace simulations. Thus the ideal speedups are extremely hard to realize when we define specific processor architectures for the classifications and start to limit the hardware resources available. The results from the second set of trace simulations agree with the first study that an DSDD processor should have higher performance than an SSDD. However the difference is far narrower than first suggested. This points out a very important moral of this chapter: There is a significant difference between available and attainable instruction-level parallelism. The study on limits of control flow on instruction-level parallelism is a experiment in available instruction-level parallelism much like the early VLIW studies showing an ILP of 99 instructions per cycle for scientific, data parallel computation [24]. Unfortunately, the attainable performance is even much lower for the resource constrained DSDD. This is potentially the fault of the abstract machine model8 because the Multiscalar simulator starts with assembly language code compiled to a sequential or single flow of control processor architecture. Then during run-time of the simulation the instruction-level parallelism exposed. Increased instruction-level parallelism and higher performance should be attainable with a parallelizing compiler. Such a parallelizing compiler would partition the control flow and data sets differently from a compiler optimized for sequential processing. Despite this flaw the Multiscalar does have a considerable performance advantage over single control-flow processor like the SSDD. This supports the thesis that we should make the processor decoupled. But * ■ Here an abstract machine model is defined as a processor architecture along with its associated compiler. 72 because of branch or task prediction inaccuracy it makes little sense to have more than two or four execution units. For the single-control flow processors the most important architectural feature is the cache capacity. The next set of features depend upon the type of processor. Certainly some basic rules become evident. First, that only one port to memory is really needed but a slight advantage does come from a second port. Secondly, instruction window size is the next most important feature. For the SSDD processor this is measured in terms of the reorder capacity while for the SSSS processor the same effect is had by increasing the issue-width. For the functional unit configurations it was shown that adding another ALU is probably the simplest means of improving execution throughput and that surprisingly having another store unit is the next best resource to add. In this chapter adding more architectural features always increased performance because the cycle-time was maintained constant. Thus we have only covered half of the story. The issues of design cost and performance will be quantified in the next chapter with an integrated circuit model of the processor stages to determine if the thesis is still valid in the face of implementation details. 73 Chapter 4: Processor Cost Estimations 4.1 Introduction This chapter describes the cost portion of the cost-performance model. It estimates the cycle-time and silicon area of a processor implemented as a single chip. The model follows a regular, structured design methodology for building VLSI systems [63]. This methodology involves the hierarchical construction of complex computing structures from a library of common circuit cells and modules. The model utilizes the fact that form follows function. The very regularity that goes into the construction of functional units, memory modules, data paths and ultimately the entire processor can be used to predict the size and speed of the design. The previous chapter described several trace simulation studies that estimated the cycle count of a processor given a description of its classification, pipeline description, and other architectural parameters (Table 7). The first goal of this cost model is to adjust these performance measurements with a modeled clock frequency. This frequency is proportional to design complexity. For the purposes of this chapter, complexity is measured by the amount and type of hardware resources incorporated into the processor architecture. This hardware is in terms of the capacity or organization of the cache memory along with the issue or execution width of the processor similarly defined by the parameters to the trace simulators of the previous chapter. In general, larger caches, more ports to memory, or wider issue widths lower the clock frequency. But with fewer of these hardware resources, the results of Chapter 3 show that a processor requires more clock cycles for a given benchmark program. So there is a balance to be struck between clock frequency and design complexity. The purpose of this chapter is to find such a balance. 74 The second goal of this chapter is to estimate the total silicon area needed to implement the features that sustain high performance execution. Under ideal circumstances, an entire wafer can be used to implement a complete computer system. This is illustrated below in Figure 36. On the left, a number of circuit modules are arranged on a wafer-sized die. Such a configuration could contain the processor, cache, main memory, and possibly the I/O interfaces. Unfortunately, wafer scale integration is unrealistic due, in part, to the redundancy of circuit modules required to achieve acceptable yields [5]. This redundancy is used for routing together tested circuit modules to form a complete working system. But, this routing effort can adversely effect on-chip (wafer) communication delays and little can be done to utilize the defective regions of the wafer. Circuit Module Wire Substrate Wafer Scale MCM Figure 36: Alternative Implementations Single Chip (Microprocessor) An alternative is to incorporate multi-chip module (MCM) technology. This technology effectively increases the silicon area budget by interconnecting multiple, bare die onto a small, high-performance wiring substrate. But this technology has its draw backs too. Most notably the problem of permanently affixing known good die to the substrate. Additionally, MCM interconnection between the die is still considerably slower than on-chip communications [5]. The last single chip alternative is the simplest. Though restrictive in total area allowed, it provides a uniform substrate to fabricate both high speed interconnections and transistors. Thus this chapter assumes a single chip implementation. This simplifies the communications overhead and overall floor plan of the chip. There is no need to 75 further partition the design into separate die as with the MCM technology o r provide redundant circuit modules as with wafer-scale technologies. 4.2 Chapter Outline This chapter is organized as follows: Section 4.3 contrasts the cost model of this chapter with other integrated circuit models. Section 4.4 defines the CMOS circuit style and characteristic delays for primitive circuits used in the construction of the modules. The next six sections describe the construction of the cache, register file, functional unit, reservation station, reorder buffer, and bypass unit respectively. For each of these circuit modules the latency and silicon area are modeled in individual subsections. Section 4.11 combines all of the modeled latency and area estimations of the modules into one area time model of the different processor architectures. Section 4.12 concludes the chapter with issues concerning the cost model. 4.3 Comparison to Other VLSI Models Models for integrated circuits are either analytic or empirical. An analytic, integrated circuit model describes the order-of-magnitude for the area and the latency of computing structures [64]. It does not yield absolute values (e.g. millimeters square or nanoseconds) for a given circuit implementation. Its usefulness is from the rnapping it defines from VLSI systems to a mathematical formalism. In contrast, the model presented in this chapter is empirical because it uses the actual layout of circuits as input to the model. In this regard, it is similar to study on area- efficient adders [96] using a simple area model and SPICE simulation for timing. These types of direct measurements suffice for small circuit modules within the processor because they usually do not change as a function of the architectural parameters. Larger circuit modules, like memory arrays, do change with these parameters so they must be modeled. For these circuit modules the area and latency can be calculated with primitive cells used to construct them along with the manner in which the cells are arranged. These primitive cells include of the memory cells, word line buffers, and bit line buffers. Other empirical models based upon a particular circuit structure do exist for single and multi-ported SRAM [92] [76]. However, these studies are slightly different from the model presented here. Both of these other models are for single cycle not pipelined 76 memories and the latter study on multi-ported memory uses BiCMOS instead of CMOS technology. Also, the analysis of previous multi-ported memory design is based upon an obsolete means of sensing or reading the bit lines. This significantly effects the total latency of the memory. 4.4 Circuit Implementation To understand the operation and timing of the processor pipelines and the circuit modules it is important to first cover the logic style used. The majority of this logic is derived from a dynamic logic style [95]. This logic style is contrasted in Figure 37 with a static or complimentary CMOS implementation of a NAND gate along with its dynamic version. The clock symbol ( < > „ ) within the gate denotes that it is dynamic and the n-subscript indicates that an n-logic block is used. In contrast to a static, complementary CMOS implementation, a dynamic logic gate requires only one logic block to implement the desired function. In the figure below, an n-logic block is used where the connplementary version requires both an n-block and p-block. a b Z a n-logic block Z (a ) (b) Figure 37: Complementary (a) and Dynamic (b) NAND Gate With dynamic logic it is convenient to view a clock cycle as two phases; one when the clock is high and the other when it is low. The dynamic implementation uses these phases to first pre-charge and then evaluate the function on opposite phases. During the pre-charge phase, the output of the n-logic block is driven to a logic level of one when the clock ( < ( > ) is at a low phase. This is accomplished with both an output pre-charge transistor (ml) and an isolation transistor (m4) that stops the current flow through the n- logic block during pre-charge. Evaluation occurs during the high-phase of the clock when the pre-charge transistor is turned off and the isolation transistor is turned on. This allows the inputs of the gate to conditionally discharge the output of the gate. Dynamic logic is very useful in high speed circuitry when appropriately sized NMOS transistors are used for the logic block (m2 and m3). A modified version of this circuit style has been successfully employed for both high speed functional units [61], memory designs [1] as well as for an entire data path [45]. Domino logic [95] is an alternate form of dynamic logic. These style of logic gates are interconnected in series with intervening buffer inverters. The evaluation of the gates is analogous to a row of dominos falling over with each gate triggering the evaluation of the next gate. Dynamic domino logic has the advantages over complementary implementations, as does dynamic logic in general, of lower input capacitance and smaller area resulting in higher speed. But a disadvantage of dynamic logic is that it is clocked. This results in timing constraints that must be maintained when interconnecting dynamic circuits. Also dynamic circuits have a dead phase when the circuit is pre charging meaning operations must be overlapped so that time is not wasted. This later problem is solved with a strategy of using a True-Single Phase Clock and special TSPC [102] latches allowing for both phases of a clock to be used for logic evaluation. In Figure 38, a design using TSPC logic and latches results in a pipeline where both phases of the clock are in use. The inverters normally used to buffer the domino gate are modified into dynamic TSPC latches indicated with the arrow in the boxes. An arrow pointing downwards indicates an n-latch device that allows data to flow through during a high phase and latching during a low going edge. Alternatively, an upward arrow indicates a p-latch flowing through during a low phase and latching at a high going edge. Both of these latches utilize the capacitance within the circuit layout to store the value of the latch and so must be cycled to refresh the information. There are techniques to modify these latches into pseudo-static latches so that this refresh is not necessary [19]. 78 S y -out n-logic p-logic n-latch n-logic ^ precharge n-latch p-latch (»&b)+c («M »)+ c out Figure 38: Dynamic Logic with TSPC Latches The resulting timing of the circuit is also shown above. In the first low phase the n- logic block is pre-charging. During the next high phase this logic block evaluates and the n-latch allows the value to pass through and settle at the next p-logic block. During this time the p-logic is in predischarge mode. When the next low phase occurs the p- logic block evaluates and the computed value is propagated through the p-latch. The dynamic latches hold the inputs to the alternating logic gates stable during evaluation. Every phase of the clock is fully utilized in this type of design making it an attractive implementation strategy for pipelined designs. 4.5 Outline of Pipeline With the underlining logic implementation now defined, this section introduces the phase timing diagram of the entire processor pipeline. This section serves as a central reference point for all of the circuit modules and complete pipeline timing. For the purposes of this chapter, one pipeline stage consists of one clock cycle or equivalently 79 two clock phases. The target phase delay time for this investigation is 2.S ns. This value was chosen because the time critical circuit modules of the data paths that have a similar delay value. The pipeline stages of this chapter are different from the abstract stages of the 16- Fold taxonomy. Those classification stages deal primarily with the instruction scheduling behavior of a stage not its implementation (Figure 23). It was also shown in Chapter 3 that one pipeline stage is roughly equivalent to a circuit module, except for the two stage cache memory module. The dynamic logic implementation, as described in the previous chapter, provides an efficient means of exploiting practically every phase of every cycle. This distributes work evenly over the entire pipeline. Because practically each phase is utilized, the clock frequency of the entire pipeline is determined by the maximum latency of any one phase. As an example, Figure 39 illustrates the timing of the pipeline of the SSDD processor architecture. Each phase has been numbered and is referenced in the next sections to assist in locating where it is this diagram. Each of these phases will be described in later sections that explain the operation and implementation of the modules. The first three phases of the pipeline are for the instruction cache access. This requires an address decode (Dec), word line drive (Dr) and bit line read (Rd). The next phase is used to propagate or drive (Dr) the instruction to the data and control paths. Once an instruction is passed to the data path, the register specifiers can be decoded (Dec) for the register file access (Rd) during the next two phases. The next pipeline stage is the bypass stage. Since full bypassing is assumed for all of the processor configurations, the arrows on the bottom of the timing diagram indicated when operands are forwarded to the bypass stage. There are four stages that can bypass a temporary values back to the bypass unit. These consist of the execution stage, two cache access stages and the reorder buffer stages. Bypass occurs during the eighth phase leaving ample time to compare the operand specifiers and control the bypass busses from these sources. All results must be available for bypass during the next high phase to be written into the reservation stations. This is also true for the next 80 three high phases (14,16, and 18) because during this time the results still have not been written into the register file. The next two phases are for the reservation station stage. These consist of a phase that writes (Wr) the instructions and operands into the station. The unit then matches and dispatches (Disp) the instruction to the functional unit for execution when all of the operands are available. Fetch Decode Execute Retire Rd | Bypass Dispi Bypass Bypass 7 "Bypass Figure 39: Timing of Each Phase The first phase of the execution stage (EX) is used to evaluate the adder for memory offsets or for other functional unit evaluations. During the bypass phase of 12, the memory address is driven to the data cache decoders for a sequence similar to the instruction cache timing. Three phases are again required to access the cache and a fourth is used to bypass the results to the data path during phase 16. The next two phases are for the reorder buffer. This buffer unit first writes the results from the execution stage into a priority queue in one phase. Then during phase 18 the data is available for bypass. The reorder buffer then writes the results at the head of the reorder buffer into the register file during phase 19. Once the results are written into the register file the corresponding instruction is considered retired. 81 4.6 Standardized Delay The delay of each phase in the pipeline will ultimately be expressed in terms of micron-nanoseconds/phase (um-ns/phase). This metric takes into consideration both the fabrication technology and the pipeline structure of the processor. A value of 1 (im-ns/ phase means that a 1.0 um CMOS technology will require 1 nanoseconds per phase of the clock. Since there are two phases per clock period this implies that the clock frequency is 500 Mhz (i.e. 2 ns clock period). If the CMOS technology is scaled down then the |xm-ns/phase value is correspondingly scaled down. This is a first order approximation of the effects that technology scaling [17] have a processor’s clock frequency. Two categories of circuits are used to describe the delay; switching and signal. Delay is standardized to a gate model for switching delay and a lumped RC delay model for signal delay. This is done to simplify the analysis and enable modeling of different module configurations without extensive circuit-level simulation. Switching delay typifies the latency of combinational logic. This is the case for adders and memory decoders that have a large number of transistors tightly packed together with relatively short interconnections between them. To measure this type of delay dynamic gates are used to model logic functions (i.e. AND, OR, and XOR). The delay of these gates vary, so different gates will be used depending upon the type of implementation (e.g. NOR-NOR or NAND-NAND). Two of these devices are shown below in Figure 40 with the size of the transistors indicated in lambda (X1). Also shown is a design used for a staging latch in the data path and memory modules. Unlike the combinational logic modules, bus structures and memory lines are dominated by wire or signal propagation delay. These structures interconnect a number of input gates and output drivers distributed over a rather large area. They require a measure of both the gate and wiring capacitance to determine the delay time rather than just the time it takes the individual gates to switch. To model this type of delay a standard buffer design is used for driving large capacitive loads. 1 Lambda (X) is a length unit used in scalable CMOS designs. Its value depends upon the processing technology measured in microns. For example, the technology used in this study is X = 0.5 micron. 82 — 0 0 — oir* < h i n 1 13 T tLf — f[* ILr — IL8 V— O il 8 < M [ out < |> —1 [_ 8 (c) Figure 40: Latch (a), AND (b) and OR (c) Designs Driving a large capacitance load in a minimal time can be performed by staging a series of inverters each with a larger sized set of pull-up and pull-down transistors. This is shown in Figure 41. The ratio of the transistor sizes between stages is commonly referred to as the tapering factor (P) of the inverters. This tapering factor has been analytically derived as e [63]. Meaning that each successive inverter stage should be e times larger. 5= 0.33 ns delay/ stage 4----------- ► rd 480 Rw 3 VA \ 1 9 2 Figure 41: Four Stage Driver Design T Qoad Recently, this has been reinvestigated for a specific CMOS technology [38]. The results of this study is that a tapering factor of 3.75 is actually best when considering the parasitic effects of the transistors. Since it is sometimes difficult to scale transistors to 83 non-integer values a value of P = 4 was chosen for the buffer design. As long as this tapering factor is maintained an accurate measure of the total delay (x) can be estimated for a given load capacitance (C]oa(j) in terms of a standardized ( 3 sized stage delay (5) and the capacitance of a minimum size inverter (Cg). This is expressed as: x = 8 -lo g|J(Cload/C g) (4.1) This formula is valid if the load is a power of P (i.e. P1 ^ , P2Cg, P3Cg, etc.) and if the interconnect resistance (Rw) is small compared to the resistance of the driving transistor. But when buffers are used to drive long wires interconnecting a large number of gate loads (e.g. word lines of a cache) the interconnect resistance can be greater than that of the transistors. For example, in the 1.0 um technology used for this study, a 10,000 \ wire has a resistance of 170 ft, while the 192 X nMOS transistor in the figure above is equal to 125 ft. Thus to model the transmission line effects of the wires the interconnect resistance is also included in the model. Starting with a lumped RC model of the transistor, wire and load we can express the delay as below in eq. 4.2. Distributing the capacitance variable we get the expression in eq. 4.3. The lumped RC delay can then be approximated as in eq 4.4. The first term of eq. 4.4 is calculated in terms of the supply voltage (V ^) and the saturation current (I< js) derived from the technology file. This is a function of the process technology and will change as the devices are scaled. The capacitance is again the load on the last stage inverter. The second term of eq. 4.4 is a discrete version of delay for a transmission line built from n cells each with a characteristic cell resistance and capacitance. This form of the RC delay is used to calculate the final stage delay for large buffers driving highly capacitive long wires. This discrete form follows naturally from the structure of memory arrays and highly loaded busses. 84 Rw ^ C | „ d J lime = (Rg + Rw)C lold time = (RgC,oad) + (RwC,oad) R0 cell -WrjWrj. . -M M q T T time > ^dd^load I , , rvcellv ,'cell Ids ) ( R cellC celln (n-t- 1) (4 2 ) (4.3) ) (4.4) The characteristic delays of both the gates and buffer are measured from CAzM [22] simulations for a 1.0 micron (drawn) technology. The details of the process technology can be found in Appendix B. Although this technology can scale to smaller dimensions, the basic circuit structures for this model will remain the same. No attempt is made to optimize the circuit topology or layout for any particular technology. The latency of these the characteristic devices are tabulated below: Device Delay (um-ns) AND4a 0.44 OR4b 0.49 XOR2 0.44 LATCH 0.66 Buffer (8) 0.33 / stage Table 9: Delay of Characteristic Devices a. Average fan-in is 0.01 ns/input. b. Average fan-in is 0.02 ns/input. 4.7 Cache Memory With the characteristic devices measured and the pipeline timing of the processor defined this section describes the basic structure and operations of the on-chip, pipelined cache memory. This cache memory provides large storage capacity and high bandwidth but at the expense of latency due to pipelining overhead. Chapter 3 describes the block 85 diagram and architectural parameters for the cache model of this study. There are three parameters that will be varied to find an efficient cache organization. They consist of associativity (A), block size (B) and set size (S). Cache capacity (K) is the product of the three parameters (i.e. K = A x B x S ). The trace driven simulations of the previous chapter were run with a capacity ranging from 4 KB to 256 KB. Given this range of cache capacity, a variety of organizations will be investigated in this section so that the overall cycle-time is minimized unless increases in cycle-time can be compensated with a lower miss rate. Every phase of the clock is used by the cache for a specific task. This allows for a very high clock frequency. Unfortunately, this means that the overall latency is higher than a single cycle cache due to the latch overhead and pipeline constraints. The four basic operations shown in Figure 42 consist of a decode phase (1), word line drive (2), bit line drive or sensing (3) and a data drive to the data path (4). Two read operations are shown below overlapping in execution and resulting in a read (SenMn) once per cycle. For reference, the numbers indicated in the drawing correspond to the numbered phases in Figure 39 for the instruction cache timing. However, the basic operations are the same for both the instruction and data caches. memory offset add ^ _________ cache access______________ D R wdln D R data Figure 42: Detailed Timing of Cache 86 Preceding this memory access are two phases that calculate the branch or memory offsets. During the first low phase labeled ADD in the diagram an offset adder is evaluated for the address generation. The resulting address is then driven during the next phase to the decoders marking the beginning of the cache access. Since the address drive overlaps with the address generation phase, it is not considered in the two cycle access of the cache. The SRAM cache is implemented with a standard six transistor design. This means dual rail (i.e. bit and its compliment) are used for both read and write operations. Figure 43 shows a single-ported cache implementation used in this model. On the left side of the figure log2(S) address lines are latched and driven to the decoders. These address lines are then latched and decoded during the phase one of the timing diagram above. One row is selected corresponding to a word line that is driven on the second phase. bit bit Write Driver Decoder >nver iver, wa To Data Patch Figure 43: Circuit Schematic of Cache This enables access to a row or cache block by turning on the two access transistors of the memory cells. If the cycle is a write operation then the write drivers on top of the cache place the data value and its compliment on the two bit lines. This forces the cross coupled inverters to switch to the desired state. If the cycle is a read operation then the 87 memory cell changes the bit lines. The current-sense amplifier detects this and latches the value by the following phase. The data read is then driven back into the data path during phase four. The physical effects from varying the architectural parameters is most evident in word line and bit line capacitance (CW (j]n ),t]n). This is because as we increase the block size (B) more memory cells are connected to each word line increasing both its gate (Cgate) and wiring capacitance. Increasing the set size (S) has a similar effect on the bit lines because of diffusion (Qiff) and wiring capacitance. To a lesser extent the capacitance within the memory cell at the cross-coupled inverter also increases with the number of ports (P). But this is negligible when compared to the interconnections. However, increasing the number of ports does increase the area of the memory array because of the wiring needed to get into and out of the memory cells. This can be seen below in Figure 44 with a single and a dual port memory design. Each additional port adds two bit lines and a word line to the cell doubling the wiring needed. A four port design has a similar increase over the dual port design. In addition, the size of the cross coupled inverters must increased so that a read disturbance is not allowed [95]. This requires a 2 fold increase in inverter size for the dual port and 4 times increase in the quad ported design. Fortunately, these transistors can often be place underneath the wiring within the cell. The resulting layout of these cells is almost completely wire dominated. Appendix C catalogs the actual layout of these memory cells. wdlnO wdlnO bO — TT— bO wdlnl Figure 44: 1 and 2 Port SRAM Memory Cells 88 Table 10 summarizes the characteristics of these memory cells in terms of the cell area and the resistance and capacitance of the interconnect lines. It is evident from the table that cell area is linear with respect to the number of ports. This has a small effect on the wiring capacitance of the cells. For example, the bit line capacitance increases only slightly because of cell height. Similarly the word line capacitance increases from the width expansion of the cells. In the table, the bit line and word line capacitance are separated into wire and gate loads because these devices might not scale linearly with the technology. # Ports Cell Area ^wdln (fF) ^•wdln (Q) Cbtln (fF) f-btln (fl) 1 28x40“ 3.5 + 12.0*’ 0.98 1.7+ 7.0 1.40 2 51 x 43 7.2+12.0 1.79 1.8+ 7.0 1.51 4 65x76 9.5 +12.0 2.28 3.2 + 7.0 2.66 Table 10: SRAM Cell Characteristics a. Area is in terms of width x height. b. All capacitance in terms of wire + gate. 4.7.1 Arrangement of Cache Cells The basic memory cells describe above are arranged in a two dimensional array to form the memory core. The basic layout and abutment strategy of the entire cache including the core and the peripheral circuits is shown below in Figure 45. The layout is similar to the schematic diagram of the cache Figure 43. The address drivers are positioned in the lever right hand comer. Address signals propagate from the drivers up to the decoders. These decoders connect to the word line drivers that select a row of memory within the core. Below the core is a port to memory consisting of write and read circuits. All of the cells are pitched matched to the dimension of the memory cell. In fact the overall area of the cache is basically dependent upon the dimensions of the cross couples-inverters along with the bit and word lines. This cell is measured in terms of its height and width. The height of the memory cell determines the minimum height of the 89 decoders and word line drivers, while the memory cell’s width determines the width of the read and write circuitry. The decoders and write drivers are free to expand in their width to accommodate the small memory cell’s height pitch, while the read and write circuits can expand in their height. Because of these constraints these peripheral circuits tend to be elongated in one dimension in the layout. Word Line Mem IMem width J height Decoder Wdln Drivei Mem Mem Decoder Wdln Drivei : Bit Line Mem Mem Total Area Wdln Drivei Decoder Mem Mem Address Drivers Figure 45: Memory Cell Layout To extend this general layout strategy for the various cache configurations means that the arrangement of the memory core and peripheral circuits must be modified. This is shown below in Figure 36. For example, there are at least three alternatives to constructing a 8 KB cache starting with a 4KB cache. The first involves extending the four-word block size of a standard 4 KB cache to eight words. This is shown on the left of the figure. This requires the least modification because only the memory core is modified along with the addition of the more read and write circuits. The second option is to increase the set size shown in the middle of the figure. This means retaining the same block size but extending the set size by two. This modification requires both rearranging the memory core as well as 90 increasing the decoders by one address line. The last option is to increase the associativity of the cache by two effectively increasing its capacity. The arrangement in this model is to replicate the standard 4 KB module and connect two modules together to an external bus. This is done to minimize the decode and word line delay. Increase Block Size Increase Set Size Memory Rd/Wr Mux Decode Memory Rd/Wr \ Mux / Increase Associativity Decode Memory = Rd/Wr \ Mux / Decode Memory = Rd/Wr \ Mux / data bus Figure 46: Alternative Floor Plans for Cache Memory These general layout strategies are used to calculate both the total area and the relative timing of the cache organizations. Given the memory cell dimensions from Table 10 along with the peripheral circuitry and the arrangement of the cells an estimation can be made of the area. Similarly the delay of the cache can be derived from Table 10 along with the characteristic delays of the devices in Table 9. The next sections describe how each phase of the cache is measured in terms of these characteristic delays. 4.7.2 Decoding Delay The decoding delay involves selecting one row from a set (S). Logically this requires the implementation of a large fan-in AND gate for the decoder. In this model a two-level NOR-NAND structure is used each with a fan-in of four. Notice this logic results in the address lines to the decoders being negative logic requiring an inverting latch for positive logic. Figure 47 illustrates the logic and circuit implementation of the decoders. The delay for this implementation is modeled as the sum of a NOR4 and a NAND4 delay. This results in an intrinsic delay of 0.93 ns. 91 enable < h 5 nand ■ out Figure 47: Schematic of Decoders The delay time for this design grows little as set size (S) is increased. This is because any additional decoding done by the first level NOR gates are evaluated in parallel with the other address lines. Eight additional lines (i.e. 256 fold increase in set size) can easily be added to the decoders. The effect to the NAND gate is only two additional parallel pMOS transistors highlighted in the figure above. From Table 9 this results in only a 0.04 ns increase in delay over a standard 256 set cache organization because each additional input to an OR structure gate with parallel transistors is 0.02 ns. 4.7.3 Word Line Delay The word line delay of the cache module depends upon the number of memory cells. Each memory cell contributes a capacitance and resistance to the word line as prescribed by Table 10. As the total capacitance of the word line increases so does the time it takes to charge and discharge the line. Ideally by increasing the number of stages, the time will grow logarithmically with capacitance. That is if each additional inverter stage is ( 3 times larger than the previous stage. However, when transmission line effects are taken into account the time it takes to drive a cache line grows linearly with block size. In addition, the relatively small height of the memory cells limits the inverter sizes that can fit in the decoder cells. 92 Of the many options available the left side of the graph in Figure 48 illustrates the time required for four driver designs. These drivers range from a three stage buffer (Dr3) to a six stage buffer (Dr6). A single memory port cell is assumed for a cache starting with 256 sets and a block size of 16 bytes. The block size is increased up to 256 bytes while the set size remains the same. Thus the total capacity increases because of block size. um-ns/phase 18.00------------------------------------ Dr3-Pl Dr4-Pl 10.00 — um-ns/phase 40.00 r T5-P1 25.00 - Dr6-Pl 20.00 4- Dm = n stages Pn = n Ports 16 32 64 block size 1 128 Dr5-P4 35.00 ■ - 30.00 ■ - Dr5-P2 15.00 ■ - 5.00------------- : O C R size 1 6 3 2 64 128 Figure 48: Wdln Delay vs. Block Size For drivers of four to six stages, the difference in time for accessing a block is actually quite small. The difference is in the area required for the buffers. Luckily, a five stage driver can be pitch matched to a memory cell with a width of a 600 X . A six stage buffer will be roughly four times the area of a five stage design or equivalently 2,400 X wide. This increase in width would displace 85 memory cells per set. Obviously, the better strategy is to use the five stage because it is fast and small compared to the other designs. With a buffer driver selected at five stages, the graph on the right of Figure 48 shows how the port configurations effect the delay of the word line driving. For block sizes less than or equal to 32 bytes, the difference in delay is rather small. For example, a block 93 size of 16 bytes the difference in time for the configurations is less than 0.4 ns. Although these delays can effect the overall cycle time the major difference is in block sizes of 64 and 128 bytes. In these cases the difference is nearly 4.8 ns and 14 ns respectively. Thus, only a cache with a small block (16 or 32 bytes) can accommodate multiple ports without a significant penalty in increased cycle time. 4.7.4 Bit Line Sensing The next phase of the pipelined cache is the memory read or write operation. Set size (S) effects this delay because of the corresponding increases to the number of access transistors to the bit line and the additional wiring needed for interconnection. To read a large capacity SRAM usually requires specialized sense amplifiers to reduce the delay. This is because a sense amplifier can detect a small voltage or current fluctuation on the differential bit lines during a read operation and therefore does not require full rail-to- rail swings. Voltage sense amp designs typically have a sensing time linearly proportional to the bit line capacitance [92]. Alternatively, current-mode sense amps offer lower latency and over a large range of bit line capacitances. In fact, sensing time is practically constant for the bit line configuration considered in this study [8][75]. Fortunately, this delay is about 2.0 ns for this technology well within the 2.5ns target phase time. 4.7.5 Write Driving Although reading the bit line is not a problem, writing into the memory cells is a potential problem. This is because to write a memory cell the bit lines must be driven to the power supply rails. This forces the memory cell to flip to the correct state. Writing delay is similar to the word line drive phase in that the bit line is about the same capacitance as the word line for the memory cell. Figure 49 illustrates a similar graph as the word line driver case. A five stage driver is used to propagate the signal on the bit lines. However, instead of increasing the block size this figure relates increases in the set size to the delay time. Thus to increase cache capacity the number of rows is increased from 256 for a 4K B cache to 1,024 for a 16KB byte cache keeping the block size fixed at 32 bytes. With any larger set size configuration, the delay is too high that it makes no sense to consider a single cache 94 organized with such a large set size. Instead we will only consider the ranges between 256 and 1,024 in the remainer of this investigation. um-ns/phase Pn = n Ports 120.00 t P8 100.00 - ■ 80 .0 0 6 0 .0 0 -■ 4 0 .0 0 -■ ■ PI 20.00 - - set size 0.00 2 5 6 5 1 2 1 0 2 4 2 0 4 8 Figure 49: Btln Delay vs. Set Size The results from the graph above indicate that adding one port does not significantly increase the access-time with respect to bit line driving. This is because the height of the memory cell and its associated bit line capacitance does not change from a single to a dual port design. This is mainly an artifact of the layout of the memory cell. Adding two more ports does increase the height of the cell so the access-time does increase considerably for the larger set sizes. Irrespective of the number of memory ports, the access time for a set size of great than 1,024 is too high for consideration in the pipelined design. 4.7.6 Combined Cache Timing With the delay of each stage of the cache now defined we can calculate the over all phase delay of the cache for a given capacity. To do this we have to establish a simple heuristic to construct the cache. For each cache capacity, the organization with the minimal time is first selected. In case of a tie the organization with the largest block size wins out. This is because the cache miss rate is usually lower for this organization and 95 requires the least amount of area. The cycle time delay vs. cache capacity graph resulting from this heuristic is a rather jagged step function illustrated below in Figure SO . The graph on the right is an expanded version of the one on the left for the capacity ranging from 4KB to 16KB. The delay of the caches for any size greater than 32KB becomes prohibitively large due to the transmission line effects of the word and bit lines. Thus, instead of increasing capacity with larger set and block sizes, better strategy is to break the cache up into larger blocks and interconnect these blocks to form larger caches. pm-ns/phase 120.00 T..................................... Pn = n Ports 100.00 80.00 60.00 .............................. / .........P4 40.00- 20 .0 0 - Capacity * * 0 0 ID 4 K cm in n r' ▼ “ CM Figure 50: Combined Cache Timing pm-ns/phase 8 . 0 0 t 7.00-- 6 .0 0 -- 5.00-— / - ....................... 4 . 0 0 - / - ..........^ 1. 0 0 " 8K Capacity 16K The block size is determined by finding the maximum block size for a cache whose cycle time is less than the critical cycle time in the pipeline or roughly 2.5 ns. Although a 4K byte block easily fits this criteria, it requires too many 4 KB blocks to create a total capacity needed for most designs that need 32KB to 64KB total capacity. Thus an 8KB block is chosen. This block size is selected because port configurations of PI, P2, and P4 are still within a few nanoseconds of the target cycle time. Also, from the previous trace simulations of Chapter 3 we know that four ports is the maximum number we will ever need. 96 In a later section on bypassing, this 8 KB cache memory block will also be used in the analysis for the bypass unit delay. For now we will assume that the cache decode, word line drive, and bit line drive are set for this 8KB block and will remain fixed. Only the number of the blocks will be varied. 4.7.7 Area Estimations Given the construction of the cache from the previous section along with the area of the memory and peripheral cells, we can now derive the total estimated area for the cache. The area is calculated for different block, set and associativity values. Again the range of caches will be from 4K to 2S6K. Figure 51 illustrates the area measured in millions of square lambda.The figure illustrates the area growth in terms of capacity for three of the port configurations including PI, P2 and P4. These estimations will be used in the global floor plans for the processor designs and gives an approximation of the area consumed by the data and instruction cache units. M X ,2 100000 Blocks P4 10000 1/16 Chip 1000 Pn = n Ports 100 Cache Capacity * 00 * C O m * 00 * C O * C M C O C M Figure 51: Cache Area 97 4.8 Register File The register file is accessed during the initial stages of instruction decode. The capacity of this storage unit is fixed at 32 words but the number of read and write ports vary depending upon the issue width of the processor. The cost model assumes a 3 register-specifier instruction format (e.g. add rl, r2, r3) resulting from the load-store instruction set architecture [51]. Thus in the worst case every instruction decoded in parallel requires two read and one write port. Access is pipelined over two phases with read and write operations alternating on opposite phases. Figure 42 shows the detailed timing of the register file. Unlike the cache memory the register file is small enough in capacity to perform address drive, decode and word line drive in one phase. The next phase is used for the memory access. < prech (18) (19) WRITE Figure 52: Detailed Timing of Register File The register also uses a slightly different memory cell than the cache to reduce wiring needed for read ports. Figure 36 shows this memory cell with its cross-coupled inverters and an additional buffer inverter (Bl). This small buffer decouples the cell from the bit lines during a read operation. This helps in two ways. First the cross coupled inverters do not have to be sized for read disturbances making a write into the cell faster. Secondly, the buffer provides sufficient current driving capability for multiple, single-rail read ports. This is in contrast with the cache memory cell that requires a differential read operation to over come the bit line capacitance. Thus additional register file ports only require one more wire through the memory cell 98 reducing the overall area and wiring capacitance. The write ports still require dual-rail access to switch the memory cell. The overall construction of the register file is the same as the cache memory (Figure 36). Address decoders and word line drivers are pitch matched to the height of the memory cells and are arranged on the sides of the memory core. At the bottom of the memory are the read and write circuity pitch matched to the width of the memory cells. Thus area estimations for the register file are calculated in a similar manner as the cache memory. The primitive cell characterization is also measured in a similar manner as the cache memory. This is provided below in Table 10 characterizing the area, resistance, and capacitance of a single bit of memory. A one word or single instruction issue register is included in the table because the Multiscalar or DSDD processor architecture uses this for a single issue execution unit (i.e. W l). T Write Decoder Write Decoder 01 > § H > E S-'Olw Figure 53: Circuit Schematic of Register File With a large number of read and write ports, the register file cell is as wire dominated as the cache memory. Just as with the cache memory word lines and bit lines crowd the layout of the cells and ultimately determine the critical dimensions. This in turn determines the wiring capacitance and resistance of the lines. This is indicated in the table with separated values for the gate and wire capacitance. 99 # Word Cell Area (>.) ^wdln (IF) Cwdln (Q) l-'bUn (IF) l-'btln (£1) 1 Word (2R-1W) 82x83 6.7 + 6.0 2.9 5.0 + 7.2 2.9 2 Words (4R-2W) 102x 112 9.2 + 6.0 3.9 7.0+ 7.2 3.6 4 Words (8R-4W) 178x 175 14.7 + 6.0 6.1 10.8 + 7.2 6.2 8 Words (16R-8W) 358 x 324 24.0 + 6.0 11.34 21.8 + 7.2 12.5 Table 11: Register File Cell Characteristics 4.8.1 Decode and Drive Delay For both a read and a write access to the register file, the first phase of operations is always dedicated to driving the address to the decoders, decoding the address, then driving the word line. The cost model estimates the time it takes to drive the address by summing the decoder gate input loads and the wiring capacitance. The worst case gate load can be calculated from the largest AND gate input of Figure 36 and the address wiring can be determined from the dimensions of the register file height, determined by the memory cells. The resulting drive time for the address is illustrated below in Figure 36. um-ns/phase 3 .5 0 t - U W rln (D r4 ) □ AND4 E3 Addr (Dr2) 1W 2W 4W 8W Figure 54: Register File Decode Delay The majority of the time spent in the first phase is dedicated to driving the word line (Wrln) of the register file. As the number of ports increase only the word line drive is really effected by the increase in wiring stemming from the larger memory cells. For 3 .0 0 ■ - 2 .5 0 ■ - 2.00 0 .5 0 ■ - : ! • ! 0.00 100 single issue to quad issue processor architectures the first phase takes less than 2.7 ns. Only the eight issue architecture requires more time of a 3.25 nanoseconds. 4.8.2 Read and Write Delay For the read and write access delay times only the read delay is of major concern. The write delay time is below 2.5 ns per phase for all configurations of the register filebecause each write port is allocated separate lines to the memory cell and the increase in height per cell effects the time only slightly. For the read delay, each additional port is connected to the same buffer inverter. In the worst case during a read operation this buffer inverter must drive the equivalent of 2W bit lines in parallel (Figure 55). This increases the read delay time unless larger sized buffers are used within the memory cell. This is shown below in Figure 57 along with the write delay time for the register file configurations with a driver of three stages. The buffer size used in the 1W design is a three stage driver (Rd-Dr3). This is the largest driver that can fit within the memory cell without expanding the pitch match dimensions. If this size buffer is used throughout the configurations, then the delay can grow to as much as 6 ns for a read operation. On the other hand for larger register file cells (e.g. 4W and 8W) a buffer size of four stages can be used because there is space underneath the bit lines for larger transistors. The delay time for these devices are also shown below illustrating that up to eight instructions can be independently read from the register file and still be within the time it takes to decode and drive the word lines (i.e. Figure 55: Reading from multiple ports 3.25 ns). 101 um-ns/phase 7.00 t Rd = Read Wr = Write Dm = n stage driver nW = n words 6.00 ■■ 5.00 ■ ■ 4.00 ■ ■ 3.00 ■ ■ Wf-T>r3 2.00 1.00 ■■ 0.00 1W 2W 4W 8W Figure 56: Register File Read Delay It should be noted that a simple voltage sense amp (i.e. ratio inverter) could be used to reduce the read delay time slightly. By precharging the bit lines and setting the ratio of the n and p transistors of an inverter the bit line can quickly senses a low going signal. However this technique is not considered in this study because with the resized buffer, the register file read delay is not in the critical timing path. 4.8.3 Register File Area The area estimations for different register file configurations are shown below in Figure 57. The total area is indicated on the right side with the line drawn. On the left side the aspect ratio can be derived from the two bars for the height and width of the register file. The rectangular aspect ratio is maintained throughout the configurations helping with the overall layout of the data path. This is because the width of the register file is pitched matched to the functional units and reservation station. As more instructions are issued from the register file, the width increases leaving more room for the routing of bypass busses within the data path. This will be illustrated in a later section on the bypass unit. 102 400.00 350.00 ____________ 300.00 nW = n Words 250.001 ------------------- 200.00 150.00 100.00 50.00 0.00 1W 2W 4W 8W Figure 57: Register File Area 4.9 Reservation Stations The reservation stations of the SSDD processor are relatively small with fewer than eight elements each and are derived from a node table implementation [91]. The basic operations follow that of the register file. Read and write operations alternate on opposite phases. The write operation is indicated in Figure 57. A precharge of the match circuitry occurs in the first phase (8). Tag matching and a writing into the cell occurs during the next phase (9). We can accomplish these later two operations in one phase because the number of cells is limited to eight. This is because the trace driven simulations indicate having more than eight reservation station elements is not very effective. / V \______ (8) (9) (10) (11) [ prech ’ W RITE \sjto_ READ ( ADD ) Figure 58: Reservation Station Timing During the next phase (10) the control unit of the reservation station can read form the status bits and select one of the elements to dispatch an instruction to the functional unit. The status bits are not contained in the data reservation stations. Rather they are 103 maintained in a small independent store that allows simultaneous access to all of the status bits to determine when an entry is ready. An element is ready to execute when all of its operands have been written into the slots of an entry. The logical format of a reservation station entry is provided below. Since each instruction is dispatched to a specialized functional or memory unit no opcode is needed. The v-bit of an entry defines when a entry is valid. The unique identifier (id) is used for matching the results in the reorder buffer. The dtag field identifies the destination register value. The two ready bits (rl & r2) are used to indicate when both of the operands have arrived for execution. In the two data fields each source specifier has a slot for a 32-bit operand and a register value. 98_______________ 80 79_____________________________________ 0 V id. dtag rl r2 tagl srcl tag2 src2 (1) (8) (8) ( l ) 0 ) (8) (3 2 ) (8) (3 2 ) control ^ data ^ Figure 59: Reservation Station Entry Format The major reason for reservation station latency comes from the associative circuity that matches the tags of the register specifiers. This operation selects the data to be written into the reservation station. This is indicated in Figure 36 with the decode block. Each bypass bus requires an input port to the reservation station. Thus the number of functional units plays a major role in determining the number of ports to a reservation station and increases the complexity of matching the register specifiers (tags) associated with the bypass buses. This is because the model assumes each functional unit has its own bypass bus that communications with every other functional unit (i.e. full bypassing). The associative matching time is no more than gate delay since the match line is a wired-OR and the tag bits are matched in parallel. The word line drive delay time is similar to the register file. This is because as the issue width of the execution unit increases the memory cell write ports also increase. 104 TAG DATA (i _r~ d d _ r d ♦-«r * . 1 ® . A 1 f L ^ Y T L - J i wdlnr d r wdlnW r w . Match Circuit :r Figure 60: Reservation Station Circuits For the purposes of this model the memory cell dimension, resistance, and capacitance characteristics are the same as the register file. The big difference is that the write time is very short because the bit line load is at least half of the register file (i.e. the reservation stations are shallow). Thus the timing is only 0.44 ns for the match, 0.50 ns for word line drive and 0.50 ns for write drive delay. The result is 1.40 ns for a write into the reservation station. The read operation requires a simple logical AND operation of the ready bits to pick a ready instruction. A priority circuit can also be implemented in two levels of logic to pick the oldest instruction to execute. This two level logic should be no more than 0.90 ns. The word line drive in the reservation station is again 0.50 ns for this small buffer and read access is half the time it takes for the register file in the worst case or 0.83 ns. The total read delay is then 2.23 for an eight element reservation station. Since the basic reservation station memory cell is basically the same as the register file, the dimensions of the area of the reservation station will be calculated in a similar manner as the register file. The only additional area results from the control logic. This logic is located on the side of the reservation station and should occupy very little space in comparison to the other modules, so it will not be considered in the model. 105 4.10 Functional Units The adder is the most critical functional unit module of the data path. This is because it includes a long carry chain that must be evaluated in one phase. The adder design in this study utilizes a carry reduction technique [18]. This technique reduces the over all evaluation time of an arbitrary sized adder by reducing the carry circuit’s input by two in this instance- This has two benefits. First the carry look ahead (CLA) circuit can be halved in area and the gate delay of the CLA is reduced by one. By halving the area needed for the CLA there is an additional benefit of shorter communication lines for the propagate (p), generate (g) and carry (c) signals. RIPPLE CARRY: REDUCED FORM: Figure 61: Reduction of Carry Logic The technique involves a sequence of transformations that can be derived mathematically but instead we will illustrate the technique above in Figure 36. Starting with a simple carry chain implementation of an adder. The first transformation involves taking the dual function of every other bit of the adder. This is indicated with the arrows. 106 This results in an equivalent implementation of the carry logic. From this implementation alternating the AND and OR gates of the carry logic can be combine as indicated in the last row of the figure. This reduces the number of inputs to the carry generation logic. From this implementation standard carry look ahead techniques can be utilized to generate all carry bits of the adder for sum generation. An important circuit technique used in the adder design is multiple-output domino logic [43]. This circuit techniques allows shared logic sub-functions to be used as outputs from the domino logic. This sharing reduces the circuitry needed to produce functions resulting in high speed designs. This can be applied to the carry look ahead logic. Figure 62 illustrates a CLA implementation for a multiple-output dynamic logic style. As far as the cost model of this chapter is concerned this circuit is the same delay time as a AND5 since it has a 5 input AND gate embedded in it (i.e. five transistors in series). Figure 62: CLA in MODL Along with this CLA and the characteristic cell from the previous section, we can estimate the critical path delay of the adder. This includes one gate delay to generate the propagate and generate signals, three CLA delays to generate all of the carry signals of the adder and a final XOR gate to calculate the sum bits. This results in roughly five gate delays each modeled as a AND5 delay. This equals an evaluation time of 2.25 ns per phase. 107 4.11 Reorder Buffer The reorder buffer is similar to the reservation station in that both temporarily hold instructions in a priority queue structure. Unlike the reservation station, the reorder buffer holds instructions that have completed execution but are awaiting the retirement of instructions proceeding it in program order. The basic organization of the reorder buffer was first described in Chapter 3. In this section its circuit implementation is considered. Two complexities arise from the operations of the reorder buffer. First is within the buffer itself the means of matching a data result to its corresponding entry requires an associative matching similar to that of the reservation station. Secondly, the data results within the buffer must be bypassed to the data path. This requires a parallel access to all of the entries of the buffer. The format of a buffer entry is shown below in Figure 63. The identifier (id) is used to index into the buffer when an instruction arrives. The v-bit indicates when an entry is valid. The e-bit is used to denote an instruction that caused an exception during execution and flags the trap unit to start processing of the exception. The dtag field is the register destination. Since register renaming is assumed in these processor architectures this field can accommodate up to 256 register names. The next two fields hold the data value waiting to be written into the register file along with its program counter (pc) value of the instruction. The later is used to process the cause of an exception. id V e dtag data pc (8) (1) u : (8) (3 2 ) (3 2 ) Figure 63: Buffer Entry Format When an instruction is decoded it is allocated an entry in the reorder buffer. This requires the program counter and identifier to be written into the buffer. After the instruction finishes execution it is presented to the reorder buffer. The identifier is used to index into the buffer and result is written into buffer corresponding to the entry with 108 the appropriate program counter value. Valid instructions that have their results and are at the head of the buffer are retired. The contents o f the buffer is then shifted making room for additional instructions at the tail. (Execution Units) id results pc El I \d hH t^ta E2 E3 En naf~ri j ec I Tail I pc' I r ^ l -1 I data ^ I * I j I SS- (Register File) Head Figure 64: Reorder Buffer Entries The circuit implementation of the reorder buffer uses cells common to the reservation stations. For example, the identifier (id) field is the same as the tag portion of the reservation station. The program counter field is merely a shift register. The difference is that each entry of the data field connects to a common result bus. The identifier field is searched for the entry with the oldest value meaning that it should be retired. The corresponding data value is then written to the register file in the next cycle according. The timing of this modules can be calculated in terms of the standard cache and register file timing. In the worst case, the depth of the buffer is 64 entries. This capacity was chosen because increasing the capacity over 64 entries provides very little performance gain. For a write operation, the decode is an associative search that accesses one of the entries to be written to the register file. This search has been estimated with a worst case delay of two logic delays or roughly 0.88 ns. The word line delay must also be done in this phase but for only 32 bits (i.e. one data word). This will take less than O .SO ns. The next portion of the write phase is allocated to driving the data 109 value onto the result bus and writing into the buffer. For a 64 entry buffer this takes 0.99 ns. Adding these three operations results in a write phase time of 2.37 ns. / S V _____ J ; (i6) (17) (18) (19) ( prech ) 'M /O ft Wr 1 1 W RITE 1 1 ( Rd/Shift) JtegWR ) 1 1 READ Figure 65: Reorder Buffer Timing For a read operation, the entries are read from the head of the reorder buffer requiring only a check of the valid and exception bit. The data is shifted during this read time from of the head from the queue. This operation require less than a gate delay because the time required is simply the clock to output of a register. The register file write operation requires a little more time because a data value must be written into the register file. However, this was described in the previous section and is not in the critical path of the register file timing for large multi-ported register files. 4.12 Bypass Unit The last module to model is the bypass unit. In any pipelined processor architecture instructions are in various states of execution. Although some instructions may be finished with the execution stage, their results may not have been written into the register file. This implies instructions currently reading from the register file during decode may not get the most recent version. The bypass unit solves this problem by selecting the most recent version of a data value from the execution units within the data path. To implement this bypass operation the first three phases of instruction decode are allocated to comparing the source and destination register specifiers within the data path to determine the most recent version (phase 5,6 and 7 of Figure 39). The bypass unit is really a collection of busses and drivers connected to functional units, the register file, reorder unit and cache memory. To calculate the bypass delay we 110 must consider the over all floorplan of the data path that interconnects all of the modules. Figure 66 shows a simplified floor plan for the SSDD processor architecture. Data Path Caches height Register File RRB RRA Reorder Buffer width Figure 66: Bypass Bus Within Data Path This figure illustrates how bypassing can be implemented with a bus structure. First, the instruction cache sends the register specifiers of an instruction to the register file for decoding. Normally, the data values from the register file are placed on to operand busses (T1 & T2). However, if the most recent version of the register is currently being written into the reorder buffer then the functional unit places the appropriate value on one the two operand busses. This data value is then forwarded to the appropriate functional unit. The data cache unit is another potential source of operands, so it too must be able to place a value onto the bypass bus. The time it takes a value to settle on one of the operand busses is considered the critical time of the bypass operation. This can be calculated in a similar manner as the 111 word or bit line delay of the cache memory. However, unlike the word or bit lines each potential source to the operand bus couples a large driver to it. This driver has considerable diffusion capacitance because a four stage driver is needed to propagate the operands the length of the bus. We assume a precharge and evaluate bus (i.e. dynamic) that helps reduce some of this diffusion capacitance because only the n-transistor needs to be connected to the bus along with one small pre-charge p transistor. The wiring characteristic of the bus also comes into consideration for the same reason that the cache word or bit lines must be modeled. To calculate the length of the bus wire we assume that the data path is constructed in the form illustrated above in Figure 66. In this cost model operand bus-throughs are provided within the circuit modules. Since these units are arranged in a linear pattern the length of the bus can be determined by summing the lengths of all of the functional units, reorder buffer memory drivers. The length of these modules are provided in Table 12. Module Width" (X) Description ALU 3,380+ 2*W Drivers Consists of an adder, barrel shifter, logic unit, 4 staging latches, and 2*W drivers Memory Unit 1,300 + 2*W Drivers Consists of an adder and 2*W drivers Reservation St. (450,680,1,300) Four-entry reservation station used for dynamic execution stage of W2, W4, and W8 width. Reorder Buffer 7,500 Sixty-four entry reorder buffer used for dynamic retirement stage. Adder 1,300 32-bit adder Barrel Shifter 1,300 32-bit full barrel shifter Logic Unit 300 Logical AND, OR, XOR operations Staging Latch 120 Simple non-inverting latch Driver 250 Four stage dynamic buffer driver Table 12: Width of VLSI Modules a. All modules are 5,500 X in height and currently have at least 7 operand bus-throughs. To simplify the analysis some of the module are grouped together to form larger units. For example, the memory unit requires an offset adder and drivers for operand bypass but not a barrel shifter or logic unit. For the SSDD processor a functional unit consists of either an ALU or memory unit along with a reservation station. 112 The functional unit configurations of the processor can effect the bypass delay time by both adding to the driver output load and increasing the wiring due to a longer data path. For the driver’s load the culprit is the diffusion capacitance of each driver. This can be easily estimated for a four stage driver with the minimum dimensions of the last buffer inverter stage determined from Figure 41. The CMOS technology file then determines the area and fringe capacitance. The result is that each driver adds 200 fF to the operand bus corresponding to one functional unit’s load contribution to the bus. The wring delay, resulting from the length of the operand bus, can be calculated if assumptions are made about the number of functional and memory units included in the data path. For this study, the issue-width determines the number of units. We have unconstrained the execution of instructions meaning that for a W-wide processor architecture there exists W ALU’s, Load Units, and Store units. This is an excessive number for units as was pointed out in Chapter 3. So we are taking a conservative view with regards to the execution stage. Based upon these assumptions of the number of functional and memory units for each processor, Table 13 summarizes the delay time allocated to the bypass stage for a single-ported 32KB cache. Processor W2 delay, (H x W) W4 delay, (H x W) W8 delay, (H x W) SSSS 2.61, (23 x 10) 3.87, (53 x 16) 8.84, (129 x 30) SSDD 2.79, (26 x 10) 4.12, (54x16) 9.43,(138x30) Table 13: Data Path Delay8 and Dimensions a. All delays in um-ns/phase b. Height x Width is measured in KX. The difference between the SSSS and SSDD processor stems only from the height of the data paths. The SSSS processor does not require a reorder unit of reservation stations so it is slightly shorter and therefore faster than the SSDD configuration. Otherwise the same number of units are assumed for each configuration. The dimensions of the data paths are also summarized in Table 13. The height is calculated by summing the heights of the functional, memory units and reorder unit. The width is 113 determined by the register file. It is assumed that the dimensions of the register file memory cell is the critical-pitch width. 4.13 Summary of Circuit Modules The three circuit modules that change the most with the architectural parameters are register file, bypass unit and cache. The cache delay primarily depends on the capacity and the number of ports to the cache. To reduce the over all delay of the cache, we partitioned it into blocks, each of 8Kbytes, and interconnect them to the data busses for the bypass and retirement units. This minimizes the latency of large multi-ports caches. It was also shown that the bypass unit and register file delay depends upon the issue width of an execution unit. We can compare the growth rate of these units to determine the time critical module for a particular processor architecture configuration. Figure 67 illustrates the timing of these three modules for a variety of processor issue-widths and number of execution units. The cache is fixed at a 32KB capacity implemented with 8KB blocks. For the single control flow architectures only the SSDD is shown (i.e. WnEl). For a simple single issue execution unit (i.e. W1E1) the timing of the modules are relatively matched with each module estimated to have a cycle time of between 4 and S ns. But as the number of execution units is increased (i.e. W1E2 to W1 E8) the cache delay starts to dominate the cycle-time because a multi-ported cache is needed to interconnect the execution units. This situation represents a decoupled architecture growing in the number of execution units. For a dual-issue processor (i.e. W2E1) the bypass delay is the critical delay because of the internal bus structure for bypassing grows with word width. But at four execution units (i.e. W2E4) we transition to a situation where the cache access again is the critical path. The same is true for the four-issue processor (i.e W4E1) except that the transition point is at eight words instead of four. In the case of the eight-issue processor the bypass delay dominates all execution unit configurations. This is because of the internal complexity of forwarding an operand between more functional and memory units. 114 Cycle-Time (fxm-ns) Wn = n W ords *En = n Exe. Units Reg File ■ -Cache □ .Bypass ■ » - C M * C O C M T f « r N ^ B r C M * C O 111 1 1 1 1 1 1 111 111 1 11 IIJ III III u 111 III III L U l i l L U i - i - t - T - C M C M C M C M s r ^ ^ ^ C O O O C O O O Figure 67: Timing of Circuit Modules 4.14 Die Area Estimation This section combines the cache and data path models to calculate the total silicon area of a processor configuration. Although this area estimation lacks the control path and external interface circuitry it can be used to judge the costs of a particular design. This section also bounds chip area to investigate the trade-off involved with limited chip area. Since a single chip implementation is assumed, we start by estimating what a state- of-the art silicon technology will be within the next five years. Based upon historical trends, from Figure 1, we can infer that a chip or die size of 2 cm/side is possible2. The 2 An experimental DRAM chip has been fabricated that measures 4 cm2 in area or equivalently 2 cm / side [55]. 115 size of a MOSFET can also be inferred from the figure. This is about 0.25 microns by the year 2000. Unfortunately, the circuit library used in this investigation is built from a 1.0 micron feature size process technology. However, based upon scaling theory [17] we can accommodate the future technology by allowing for an artificially large die in present device sizes. Such a die will shrink as defined by the scaling factor (k) calculated by the ratio of the current and the future technology, (i.e. k = 1.0 |xm/ 0.25pm = 4). Thus in terms of the present CMOS technology used in the investigation, the die will be approximately 8.0 cm /side. In terms of lambda measurements, this equal 160,000 X per side. Since lambda scales with technology this latter measure will not change. X = 0.50 pm (b) Figure 68: Future Technology (a) Scaled to Current Technology (b) With this die area estimation, we can now fit the data paths and caches of the processor architectures onto a single die. As a first order approximation, the data path outline in this chapter is used as a rough approximation for the control path. Figure 69 illustrates four, four-wide execution units (i.e. W4E4) each with a 16KB instruction cache and a 32KB, quad-ported data cache. The data cache module includes routing channels for the data and address busses to and from the cache and the data paths. 116 Figure 69: Floor plan of W4E4 Architecture From this figure w e can estimate that a four execution unit processor will fit on a single die in a future 0.25 pm CMOS technology. There is even som e room left over for external interface circuity and perhaps memory management hardware. But it does not seem likely that a second-level cache of any great size can be accommodated on the die. Not unless the central cache is reduced in capacity. Similarly two, eight-word execution units (i.e. W8E2) can fit on a single die as can eight, dual issue units (i.e. W2E8). However, the total area of these designs are different because the data paths grow at slightly different rates because of internal bypassing. Also, additional execution units require additional ports to the central data cache. This causes additional growth in the data cache because of the extra decoders and read/write circuits as well as the additional drivers and wiring to the data paths. 4.15 Conclusions This chapter outlines an area-time cost model based upon a library of CMOS cells. These cells are used in the construction of larger circuit modules that implement the functions of pipelined microparallel processor architectures. The circuit modules that 160,000a , — g W4 (control) i— i ' da W 4 (control) W4 (data) W 4 (data) 32kB - P4 (D$) W4 (data) W 4 (data) 5 W4 (control) « W 4 (control) 117 influence the over all cycle-time of the processor are shown to be either the bypass unit or the cache memory depending upon the configuration of the processor’s issue width and number of execution units. The other stages do change as a function of the architectural parameters but to a lesser extent. Another result from the cost model is that future silicon technologies (e.g. 0.2S pm) will increase the effective area of a single die enough to accommodate at least four, quad-issue execution units. This is significant because performance increases with the number of execution units. So we will want to place as many computational units on- chip as the central cache memory can sustain. Although the cost model takes into account many aspect of the operations of the circuit modules there are inevitably aspects of an actual implementation that it does not model. Some of these discrepancies will be covered in the next chapter, where we compare the cost model estimations with actual circuit designs. 118 Chapter 5: Validation of Area-Time Model 5.1 Introduction This chapter validates the integrated circuit cost model presented in Chapter 4. That model describes a processor in terms of the circuit modules within each pipeline stage. The timing of each module is described in terms of the resistance and capacitance of primitive cells, the logic implementation, and the driving capabilities of the CMOS buffers. Although the circuit model is based upon simulations of characteristic devices, a form of verification is needed to validate the assumptions of the model along with the results for large modules. There are at least two means of validating a circuit model. One is to fabricate the devices being modeled and measure the actual response. This is an extremely time consuming and costly proposition considering all of the configurations under investigation. Another option is to use circuit-level simulations. These simulations are traditionally used to analyze and optimize the actual layout of devices. They can be quite accurate for small circuit topologies but depend upon faithful extraction of the physical properties of the layout. This study adopts an approach which uses both simulation and fabrication to tune the model. First circuit structures modeled as crucial to performance (i.e. in the critical path) are simulated as complete modules. A select set of these circuits or components of the modules are then fabricated and measured to validate the circuit simulations. The circuit-level simulator is CAzM [22] which uses an extracted netlist from the layout. The extraction process determines first-order effects from area capacitance along with second order parasitics stemming from fringe effects [95]. Unfortunately for the large modules (e.g. cache module) the resistance is not extracted as this would increase the simulation time considerably if all nodes in a layout were extracted. 119 The circuits used in this validation are derived from a full custom layout design of a superpiplined processor [83] and are slightly different implementations from the ones presented in the cost model. Although these differences are minor, they do effect the timing of the modules. They are presented here to give insights into how circuit topologies can be optimized to reduce delay. 5.2 Chapter Outline This chapter is partitioned by the modules described within the processor circuit model of the previous chapter. This consists of the cache memory, register file, and functional unit. Each section includes a description of the module along with a circuit schematic and timing simulation. The layout of some of these designs have been collected in the Appendix C. The last section concludes with results from the fabricated test structures and describes how these results effect the cost-performance model. 5.3 Cache Memory This section presents circuit simulations of a data cache design. It is a single-port, direct-mapped, 4Kbyte cache with a block size of 4 words and a set size of 256 [54]. This corresponds to a value ofA = 1, B = 4, S = 256, P =1 and K = 4K in terms of the cache architectural parameters. There are five results from the cost model to verify. This consists of the address drive, row decode, word line drive, read access, and write access. Measurements are gathered from the simulation of the entire cache module. 5.4 Decode and Word Line Drive The circuits of the address decoder and word line driver are shown below in Figure 70. The decoder and driver are the same as specified in the circuit model with the exception of the latches. The driver is implemented in a slightly different manner than described in the model. The five stage buffer is spread out over both sides of the n-latch. This effectively reuses the n-latch as one of the buffer stages and reduces the latch overhead. 120 DECODER NORl NANDI P-Latch H L * rsel NOR2 enabl: N-Latch 128 448 wdln 160 B4 B1 B2 B3 Figure 70: Decoder and Word Line Driver The timing of the circuit in the 4K byte cache is shown below in Figure 71. Although the address drive (a) of the cache was not modeled it is indicated here for reference. This address drive time is not in the critical path because its delay is always well below the bit line read or write delay. This is because it has half the number of gate loads per address line as the bit line. The decode delay (b) is surprisingly long when compared to the model results. This is because the model only describes the intrinsic delay of the decode logic and not the latch and clock over head and the difference in implementation. The timing shown below was monitored at the output of the p-latch (rsel). This latch causes a a small delay 121 of 0.4 ns. The clock overhead comes from the fact that the decoder uses the negative version of the clock which is 0.3 ns behind the one shown in the figure. Another difference comes from the fact that the second level of the decoder is implemented in p- logic which is three times as slow as n-logic for the same sized transistors. This means the second gate is really 0.99 ns. Including these over heads the modeled time is 2.04 ns, which is still slightly faster than the observed 2.2 ns. The driver delay for a five stage driver and a four word block should be 1.60 ns. The actual implementation shown 1.9 ns, but this can be explained from the implementation too. Instead of having a p/n latch and then five stage driver, the implementation spreads the buffer drive over two phases cycle-stealing time from the decoder phase. This means the driving of the buffers is initiated before the rising edge of the p-latch. This can be done if the buffer (logic) is fully static. CLK Addrl Addr2 Decode • jr * Wrdline A \ Lfcvuuc / ........ \ V _ 2.2 ns v ........................... c . % y — * ~ \ _______________ (c) '/---------------------- Figure 71: Timing Simulation for Decoding As long as the word line value is stable at the n-latch by the falling edge of the drive phase, the word line drive can actually take longer than one phase of the clock. The timing constraints can then be stated as Ddccode + Dw d|n < Cycle-Time instead of the more limited constraint of Ddecode < Phase and Dw d]n < Phase. This technique can help 122 for relatively short drive times. But it can not compensate for delay in the tens of nanoseconds as with block sizes of 128 bytes or more. 5.5 Write and Read Timing The access time of a read and write operation is provided below in Figure 72. The first cycle is of a memory write operation. During the second (low) phase of the clock, the memory cell is selected and written. The time it takes to drive the bit lines (a) takes approximately 2.0 ns before the memory cells are effected. After this it takes a little less than 0.3 ns for the cell to flip to the desired state (b). The cost model estimates the bit line write operation for this design as 1.73 ns, slightly faster than the real design. The time it takes to flip the cell is not modeled but can easily be incorporated. This cell delay does increase with additional ports to the cache memory cell, because of larger sized cross coupled inverters required. For a first order analysis, however, the 0.3 ns can suffice. Read Disturb (e) — V- Figure 72: Timing Simulation for Memory Access The read operation is also shown above two cycles after the write operation. Again during the low phase of the clock, the memory cell is selected and the access transistors start to pull one of the precharge bit lines low. This takes 1.2 ns before any observable 123 voltage change occurs on the bit line (c). Complete sensing takes an additional 1.3 ns before the output of the sense amp switches to a full logic zero (d). The cost model does not model these operations separately. Instead we assumed that a current sense amp can detect a memory cell value within 2.0 ns for this bit line configuration. This implementation takes 0.3 ns more because the implementation is slightly different form the references sighted in the cost model. An important side effect of a read operation is a read disturbance. This is also measured in the Figure 72e. This occurs when the word line turns on the access transistors to the memory cell it is exposed to precharged bit lines (i.e. both are charged high). This means one of the cross-coupled inverters is momentarily fighting the stored charge of the bit lines. Since these invertors are sized correctly, they can overcome this disturbance of the memory cell and ultimately it settles back to the original state. If the memory cell is not sized correctly, the cell will actually flip and the information will be lost. 5.5.1 Cache Layout The basic layout of the cache follows the description of the previous chapter except for an interface to the external memory. In the actual design, a duplicate read and write port is added in the block diagram of the layout shown below in Figure 73. This extra port appears at the top of the illustration. The extra port reduces global routing congestion because the 16-byte-wide data bus does not have to traverse the length of the cache to get to the external bus. The area component of the cost model does not include this extra port. Only the cache core area is calculated. Also not included in the models estimation is the miss logic and power distribution lines. The miss logic consists of a small set of circuitry to compare the tags of the cache line and the target address along with logic to determine cache state. This is a rather small block shown in the lower left portion of the figure. The power distribution is also shown to the left of the cache core. This overhead can be deleted depending upon the number of metal layers included in the process technology. For this design, the only two metal layers were allocated for power and signal 124 distribution so the layout includes this overhead. For other designs which can use an alternate layer for power distribution, this overhead can be eliminated. The internal cache core is roughly 13,000 X x 5,900 X. The overall area of the layout is 9,400 X x 16,000 X.The cost model area estimation for this design is 11,400 X x 5,800 X. The discrepancy between the layout and the model arises from the fact that power distribution is incorporated in the core along with a pre-charge cell which adds to an over head in the area of the design. 4____________ 9.400 X____________ ¥ External Bus (2nd-Level $) Read Port Write Port Memory Array w/ Decoders Miss Logic Addr Driver Addr x dme32 cline32 0[7] chne32 dine32 0[6] cline32 cline32 0[5] cline32 cline32 0[41 cline32 cline32_0[3] chne32 cline32 012] chne32 cline32 Oil] chne32 dine32 0[0] 16,000 X Write Port -Read Port Internal Bus (Data Path) Figure 73: Cache Layout 125 5.5.2 Register File This section presents circuit simulation of a 32-bit x 32 word register file. The basic organization and implementation is the same as the model. The timing diagram shown in Figure 74 is a register write operation. In this design, the address drive, word line decode and word line driver are all part of the first phase (a) and take 2.13 ns. Also during the first phase the bits lines are precharged. Then during the next phase, the bit line is evaluated which takes 1.20 ns. The model estimates the first phase to be 2.30 ns, which is slightly conservative in this case. But for the bit line, the actual implementation is slightly slower because the model predicts this drive time to be 1.14 ns. Another discrepancy is with the time it takes to flip the memory cell. This time is measured at 0.42 ns and corresponds to the delay from the bit lines through the access transistors to the register file memory cell. CLK Write Decode Wdln * 2.13 ns Bit 1.20 ns I T ------------ Cel1______________ I ^ T q.42 ns________________ Figure 74: Register File Write For the read operation, the measured time of the word line drive is the same as the write operation, though the bit line timing is different for the read operation. The measurement of this time is 1.30 ns. The model estimates this time to be 1.63 ns which is a very conservative estimation. 126 CLK Decode Read Wdln 2.13 ns Btln Figure 75: Register File Read 5.5.3 Register File Layout The estimations for the layout of the register file turnout to be very optimistic when compared to the over all area of the actual design. There is a discrepancy for similar reasons as the cache memory. The model only uses primitive cells as a means of estimating the dimensions of the entire design, but there are considerable over heads for power and ground routing in the register file. The large power supply busses help both to reduce the IR drop on the lines and also serve as an area to place bypass capacitors underneath the lines. The model estimates the dimensions of this design to be 2,200 X x 6,500 X. The actual memory core is 2,500 X x 7,800 X. The over all dimensions are indicated in the Figure 76. 5.6 Adder The measurements of the previous sections describe memory modules of various configurations. This section describes the implementation of a high speed adder design. The basic design follows the strategy of the carry reduction technique as described in the previous chapter. Unfortunately, the details of this implementation are too involved for the scope of this investigation. However, we can use the measurements of the design as an upper bound of the evaluation time. The time to generate a carry out bit for this adder design is 3.1 ns. Which will be used as the critical path of the evaluation time. 127 ircuits Memory Cell Word 1 Line j Drivers Q t - Address Drivers Figure 76: Register File Layout 5.7 Fabricated Test Structures A test structure was fabricated in the 1.0 um CMOS technology to measure the accuracy of the circuit-level simulator. The test structure includes a current sense amp and the adder design described in this chapter. The current sense amp is a small, clocked, analog circuit which tests the signal response of an amplifier design connected to varying capacitance loads. This setup is of a lumped capacitance version of the bit lines. Since the sense amp is relatively immune to bit line resistance this setup will suffice for the real bit line configurations. The adder is a full 32-bit adder with staging latches on the input and output to stage data values. A clock pulse is provided to the adder which evaluates the addition and latches the result at the end of the clock pulse. The results of 128 the sense amplifier and adder are provided below along with the CAzM measurements. The times indicated are for the cycle-time of the test circuit. Circuit CAzM (ns) Measured (ns) Setup Sense Amplifier 4.0 3.2 1.0 pF Load 32-Bit Adder 3.1 1.9 Generating Carry Out Table 14: Results of 1.0 um CMOS Test Structure The results of the test structure is that CAzM is conservative for these circuit topologies. However, it is premature to infer that all circuit simulations are will be slower than the actual fabricated devices. These are small test structures which do not include the over heads of long communication wires. The results do raise the confidence-level in the circuit simulations. If the actual devices ran much slower than simulations than we could not be as confident in the modeled delays. Now at least we can bound the times of the modules and infer that the actual modules should not run slower than the model delays. 5.8 Conclusions This chapter provides measurements from circuit-level simulations and fabricated devices which are included in the cost-performance model of this dissertation. Comparing the model’s estimations and the simulation measurements reveal several discrepancies, but non too large to void the usefulness of the cost model. Typically, the cost model is optimistic when compared to CAzM simulation for buses with many loads. For example, the cost model estimates word lines, of both the cache and register file, to be slightly faster than simulations. However, from fabricated test structures CAzM simulation is conservative for the logic circuits and analog sense amplifiers tested. Some of the other discrepancies between the model and the simulations are caused because the cost model does not include all of the operations of the modules. For example, the register file write operation should include a 0.42 ns overhead for flipping the memory cell. Similarly, the smaller cache cell needs 0.32 ns to write into the cell. 129 However, these details were omitted from the description of Chapter 4 because they can be insignificant when compared to other consequences on the module delays resulting from large variations in the architectural parameters. Lastly, the cost model is optimistic with respect to area estimations because it does not take into account the peripheral circuity and power and ground busses. However, the purpose of the area estimations is to provide a first-order approximation of the processor architectures that can fit on a single die along with the cache memory. In this respect the cost model suffices for the architectures under consideration in this investigation. 130 Chapter 6: Conclusions 6.1 Introduction This chapter summarizes the results of the cost and performance model. This includes combining the trace simulations of the SSSS, SSDD and DSDD processor architectures along with the cycle-time estimations from the integrated circuit model. When combined, these results establish the validity of the thesis that a decoupled fetch architecture with a dynamic execution and retirement stage is the best microprocessor architecture. In fact it is shown that a "Dynamic Double Duo" consisting of two, dual issue execution units is the best processor architecture. The chapter concludes with a summary of research contributions and related research issues encountered while performing this investigation. 6.2 SSSS and SSDD Processors The purpose of contrasting the performance of the SSSS and SSDD processor architectures is to determine if dynamic execution and retirement significantly contribute to the performance of single control flow architectures. If the cost vs. performance trade off of these two features is favorable, then clearly they should be included in a processor as long as there is sufficient silicon area. Taking future CMOS technology into consideration it seems evident that there will be ample silicon area to support as many as four SSDD processors each four words wide. So silicon area does not seem to be an issue. However, to determine the cost vs. performance trade off with respect to cycle-time, Figure 77 combines the results for the single control flow processors along with the cost model estimations. As before, the figure relates the word width, cache capacity, and port configurations except that a maximum of two ports is considered for the four-word and eight-word issue processors. Otherwise, the figure is arranged in a similar manner. The y-axis is now measured in terms of Billions of Instructions Per Second (BIPS). This is 131 simply calculated as instructions per cycle (IPC) divided by the cycle-time (ns). Both of these estimations take into account the architectural parameter values of the processor architecture. W4 W2 W8 0.25 0.20 - - - x ' §0 . 1 5 Wn = n Words Pn = n Ports S o .10 ■ ■_x a v ^ P i i Cache Capacity 5SS 0.00 ^ jr jr ^ jr o o C O C O Tt 0 0 or or co c\j Figure 77: Adjusted IPC o f SSDD and SSSS Processors The most important result from this figure is that all of the SSDD processor configurations have higher performance than the SSSS. Even with the additional costs included in the cycle-time of the dynamic execution and retirement stages. This implies that a single control flow processor can accommodate the reservation stations and a reorder buffer without losing their performance advantage. Another important result from the figure is the rapid decrease in performance due to the increase in cache capacity. To increase this capacity, additional 8Kbyte cache modules are placed on the bus to the data path. This increases the communications delay for load and bypassing an operand. But, the most surprising result is that a two-word processor now out performs both the four and eight word processors. There is a slight advantage going with the two word configuration (0.25 BIPS) vs. the best of the four word (0.23 BIPS) configurations. The culprit behind this stems from the assumptions made for the bypass unit. A four-word issue processor has two more bypass buffer loads 132 than a two-word processor. These two additional loads are equivalent to 16Kbytes of cache which the two-word processor can accommodate with the slightly shorter cycle time because the data path of the two-word processor is physically shorter. This additional cache capacity is what allows the two-word processor to out perform the four- word design. Not as surprising, the eight word processor has much lower performance than both the two or four-word issue processors, again because of the increase in bypassing loads from the functional units internal to the data path. 6.3 DSDD Processor For the decoupled processor architectures, the effects of the cost model on the performance is shown below in Figure 78. The processor configurations for this simulation is the same as in Chapter 3, except that a cache of 32K bytes is assumed and now we measure the performance in BIPS as in the previous section. Without the effects of the cost model, adding more execution units to the processor resulted in an increase in performance uniformly across all of the issue-widths. With the cost model, it is still true that adding more execution units helps, but only up to four units. After this point the multi-ported cache is too slow and eliminates any performance advantage from decoupling the fetch. W4E4 0.60 -r 0.50 -■ > E 2 ' ■ W 1 W 2E1 W1E2 W2 SO .30 W4 0 .2 0 -K W8 ft,0.10 ■ ■ 0.00 E8 E1 E2 E4 Figure 78: Adjusted DSDD Processor 133 What is also true is that the DSDD or decoupled architecture has higher performance than the SSDD or single control flow architecture. This can be seen from the fact that the maximum performance of the single execution units (i.e. W2E1) is lower than that of two or four execution units (i.e. W1E2, W2E2, and W4E4). Therefore, a decoupled architecture consistently has better performance than a single control flow processor despite the additional costs in global communications between the execution units. 6.4 Validating the Thesis The key results from the trace simulation for the single control flow architecture is that only a small capacity reservation station and reorder buffer are needed to achieve considerable advantage over a processor without these features. The cost model has also shown that theses two features do not impact the cycle-time significantly for the buffer sizes needed. This is because other stages in the pipeline require more time than these buffers and so they come almost for free, save for a slight increase in silicon area and bypass delay over the fully static processor (i.e. SSSS). Thus because the dynamic execute and retirement features cost very little in cycle-time, yet help in significantly reducing the cycle-count they should be included in single control flow processor architectures and also in the individual execution units of the multiple control flow architectures. Decoupled fetch has also been shown to improve performance significantly despite the central communications overhead of a multi-ported cache. From the results of the Multiscalar simulations with perfect branch prediction and the cost model, the best microparallel architectures is a DSDD processor architecture composed of four execution units, each four-word wide. 6.5 Lessons Learned To prove the thesis, we have shown results from two different simulation environments and an one empirical cost model. Unfortunately, this has resulted in a rather fragmented presentation. To provide a single comparison of the SSDD and DSDD processors along with the cost model we can illustrate the various configurations of microparallel processors in the form of a lattice1. Within this lattice, a processor is 1 Technically the lattice is inverted because the base element is a the top. 134 assumed to have dynamic execution and retirement stages because they have been proven to be efficient means of improving performance. 018 Level 0 W1E1 * ' . ( 1.00) 0.29 | f \ 0.25 Level 1 W1E2 W2E1 W1E2 V2*',1,,, X u o ) / '. (i-i^) ''' X y \ X 0.42 0 .31 . ' ’ ....t ! . 0.22 Level 2 W1E4 W2E2 'W4E1 y 'v (1 .6 5 ) (1.25) N (l./U) ■ / >E4 ,'< ( 1 6 5 ) '' X X . 0 5 7 ' ' ' ' . i \ 0.48 y \ 0.39 y ' ' . 0.12 t w u f i W 4 E 2 W 8 E 1 Level 3 W 1 E 8 ,<V065) ,'\(1 .8 0 ) X (3.89) nto ' 'v . 057 ' ' 0-22 / ' 'w 'JifR W4E4 'W8E2 Level 4 W 2L 8 , (3 99) '.(3.24) ,''.( 1 .8 8 ) \ x x x X XX X ' . „ „ X 0.36 y t i s W4E8 ' W8E4 W™ 2 4 ) ,------ (4.08)-------------------------------- BIPS ' ' n x a ''' WnEm ' ' (Cycle-Time) Level 6 'W8E8 (4.36) n = words . m = execution units Figure 79: Microparallel Cost-Performance Lattice A processor configuration is represented by a vertex in Figure 80. Each vertex within the lattice contains a description of the word width and number of execution units for a given processor (e.g. W2E1 is a single, two word-wide processor). We can double the potential instruction throughput by either increasing the width or the number of execution units for a processor. This is accomplished by descending the lattice one vertex at a time by either going to the left or the right. Going to the left means that we add twice as many execution units and connect them to a central cache memory as described previously. Going to the right means that we keep the same number of execution units and increase the issue width of each unit. This implies each level of the 135 lattice has the same aggregate instruction throughput and that level j has a potential throughput of 2J instructions per cycle. The highlighted path descending within the figure marks the way to achieve the highest performance for symbolic computation on microparallel processors. Notice that we take a circuitous route. This illustrates that to get to the highest performance processor does not always imply decoupling the fetch stage or increasing the issue width. Rather it depends upon the circumstances of the configuration. Sometimes it is better to increase the issue width rather than duplicating the execution units. Although ultimately the higher performance processors are all decoupled. A design principle resulting from this figure is that the processors which are located at the center of the lattice have the lowest cost with respect to cycle-time. In fact, one reason why we take such a path is because it has the lowest relative processor cycle time. This cycle-time, which is attached to the lower right of each vertex, is normalized to the single issue, single execution unit (i.e. W1E1). For example, at Level 1, a W2E2 configuration has a cycle-time which is 1.25 times slower than a W1E1 processor but is the lowest of all three at that level and therefore has the a distinct cost advantage over the other designs at that level. This is true for the W4E4 configuration as well. This implies that the best microparallelprocessor is one with equal width and multiplicity. It is a secondary matter which path is taken (i.e. left-right vs. right-left). All that matters is we return to the center. This is true for different implementations of the cache and its capacity. For example, if we implement the cache as one large circuit module instead of as a collection of banks, then the number of ports to the data cache dominate the cycle-time. This means that it is better to first increase the width of the processor, then increase the number of execution units, because that path is the least in cycle-time cost. The reason behind this stems form the fact that as we descend the lattice the number of ports to the central cache memory or the number of words issued increases. From the cost model, we know that these are two architectural parameters which influence the cycle-time of the processor. It is therefore a balance between these two parameters, which dictates the centralist position. 136 W 1E1 .00) # words # Ports. 2E1 \(1.17) WfE2 (1.23) wIe i W2E1 W2E2 (1.70) (1.23) (1.48) Figure 80: Alternative Path to Center 6.6 Effects of Branch Prediction Unfortunately, the validation of the thesis uses an argument with perfect branch prediction. The question is whether the results remain the same and if the thesis is still valid in the face of imperfect predictions. For the single control flow architectures, the combined effects of the cost model and imperfect branch prediction is shown in Figure 80. The results indicate that the overall performance is lowered, but the basic results remain the same. That is for a single control flow, a two-word wide architecture is best and that the dynamic behavior of the SSDD still benefits the architecture enough to overcome any performance loss due to cycle-time degradation. Issue-Width W2 W4 we Figure 81: Adjusted SSSS vs. SDDD both with prediction For the decoupled architectures, we have a similar result. The best architecture is still a decoupled one, but instead of a W4E4 configurations being the best allocation of resources, a W2E2 (i.e. a Dynamic Double Duo) now outperforms the rest. This can be explained by the fact that with an imperfect branch prediction, the instruction-level 137 parallelism is reduced. This is evident from the trace simulations alone. Because of this reduction in available ILP, the four-word wide configurations lose performance relative to the two-word wide executions units due to internal bypassing over heads. Now, the two-word wide processors are more cost effective and thus have higher adjusted performance. 0 .2 5 W n = Words Em = m Exc. Units 0.20 5P -15 * SO.05 ■ ■ 0.00 ■ W l Execution Units -I----------- 1 E1 E2 E4 E8 Figure 82: Adjusted DSDD with prediction From the results of imperfect branch predictions, we can now make a modified statement concerning the thesis. Fundamentally, the architecture with the greatest cost leverage is one with equal width and multiplicity, because the internal bypassing and cache communications are balanced and at a minimum (Figure 80). The issue of whether the “best” architecture should be a W2E2 or W4E4 depends upon the accuracy of the branch predictions. Currently with a branch prediction of roughly 90% a W2E2 is best because the amount of ILP can sustain this configuration. But as the accuracy increases, the W4E4 configuration becomes more attractive because there is sufficient ILP to sustain the four word wide data paths. If for some reason, we could magically increase the ILP within the control flows even further, say with compile-time techniques, then at some point even the W8E8 becomes attractive. It is, however, highly unlikely from the indications of the trace simulation and the cost lattice that a fringe architecture (i.e. W1E8 or W8E1) will outperform such a balanced design because of the loss in performance due to the longer clock cycle. Thus in all cases a decoupled fetch architecture, balanced in width and multiplicity, will outperform a single control flow architecture. 138 6.7 Summary of Research Contributions The three main contributions of this dissertation include; firstly the development of the 16-Fold Way processor taxonomy, secondly a cost-performance model which combines estimations for silicon area, cycle-time, and instruction throughput and thirdly an adjusted cost-performance performance measurement o f the three processor architectures. The microparallel taxonomy is simple, yet expressive enough to describe a wide array of processors, along with the component static and dynamic behavior of processor stages. Although the classification scheme is abstract, we can derive some conclusions about the relative performance of the individual classifications. This was demonstrated with the aid of a related study on control-flow parallelism. Also, the classifications can be refined into processor architectures for detailed trace simulation and analysis. The second contribution, of this dissertation, provides a mapping between these processor architectures and circuit modules. This mapping also defines an intermediate pipeline description of processor stage’s cycle-count. The silicon area-time or cost model describes a specific pipeline structure for high frequency operations implemented with a dynamic logic CMOS style. Measurements are provided in a 1.0 um technology, but can be scaled to future technologies with minimal modification to the model. The third contribution consists o f estimations, based upon the cost-performance model, which indicate that we could construct two execution unit, dual-issue processor capable of an instruction throughput just over 210 MIPS in 1.0 um CMOS. 6.8 Future Research Directions During the course of this investigation a number of questions arose that unfortunately could not be answered with this study. These issues could possibly be solved but not within the time or space of this investigation. They are provided here for motivation of future work and reference. The first question is: How accurate is trace simulation for processors? This general method can be a relatively fast and accurate means of estimating the cycle-count of a simple pipelined processor. But accuracy and correctness come into question when 139 complex dynamic behaviors are introduced into a processor, along with parallel execution of instructions. Unfortunately, this dissertation could not completely validate the performance results by comparing execution with actual superscalar implementations. It is possible, however, to validate some of the superscalar configurations by executing benchmarks on representative processor architectures. For example, three configurations to test are: Processor Name Configuration [W,K,P,RS,RO] DEC 21064 [2, 8, 1,-, -] Intel Pentium [2, 8.2 ,-,-] PowerPC 604 [4, 16, 1, 2,16] Table 15: Candidate Processors for Validation Time did not permit such a validation of the Ssim simulator for these various processors. One problem is the difference between the MIPS instruction set architecture and all of these processors. A more acceptable method might be to normalize these examples to a generic load-store instruction set architecture. Then build a universal trace simulator that only schedules these generic instructions and new processors can be validated as they are built. For this study, it was felt that results from a highly quoted simulator (i.e. Ssim by Johnson) could be scrutinized more effectively and it provides a common platform for discourse and comparison. So instead of developing such a simulator, which is a sizeable task, it was more important to expand the study with a VLSI cost model. The Multiscalar, or multiple flow of control processor, poses a more fundamental problem. Neither a complete processor or its compiler is currently available. So, validating this architecture with an actual system is not really possible at this time. Another problem with this study is that the trace simulation environment starts with an object code for a sequential processor (e.g. MIPS compiled with gcc). It would be more accurate to start with a (micro) parallelizing compiler that would produce different object codes for the individual execution units. Then we could trace simulate various 140 processor configurations along with the compiler. Although the PIPE [101] had such a compiler, it lacked techniques for speculative execution or branch predictions. The MISC [90] is another dynamic fetch architecture, but it too is not full developed and only preliminary results exists. So the Multiscalar is the best alternative to date. This will change, because the decoupled fetch processors have a great opportunity to improve performance, if they are developed with the memory bottleneck in mind. The second question is: What are the effects that compiler technology have on this study? For a single control flow processor, established compiler techniques of instruction scheduling, register allocation, and speculative memory operations for parallel execution could all increase performance. Some can even reduce the performance advantage of hardware assisted execution. For example, prefetching data could reduce the miss penalty so that smaller caches could be used. Also, advanced instruction scheduling could reduce the need for reservation stations because compile time analysis could determine operand availability and schedule accordingly. Lastly, boosting [81] is a minimal hardware and advanced software technique which could replace the reorder buffer’s complexity and lessen the advantage of the dynamic execution stages. The last question is: What other processors exist in the dynamic fetch region? Although we have identified some characteristics of the processors that could exist in these unexplored classifications (i.e. DDDD, DSSS, DSDS), this does not imply that they are the only ones to explore. Indeed, it seems that the variety of microparallel processor architectures expands with each new generation of microprocessors. Some are merely renditions of established mainframes on a single chip. But others now incorporate novel cache buffers and more elaborate instruction scheduling mechanisms [71]. A conjecture of this dissertation is that the 16-Fold Way taxonomy completely encapsulates microparallel processors. However, it will take further investigation of the multiple control flow or dynamic fetch processors to substantiate this claim. 141 Appendix A: IPC of Processors A.1 Table Description The follow tables are the instruction per cycle (IPC) estimations of the processor architectures and benchmarks described in Chapter 3. For the single control flow processors (e.g. SSSS and SSDD) a data and instruction cache size of 4K to 2S6K is used for each benchmark program. The harmonic mean (hm) is calculated in the right-most column of each table. The last table is the multiple control flow processor architecture (i.e. DDDD). Table 16: SSSS-W8-P1 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.60 1.05 1.07 0.58 1.05 0.58 0.87 0.58 1.51 0.69 0.88 0.78 8k 0.76 1.19 1.17 0.69 1.33 0.71 1.26 0.66 1.55 0.84 1.17 0.95 16k 1.12 1.31 1.23 0.94 1.38 1.29 1.50 0.80 1.57 0.93 1.24 1.16 32k 1.41 1.46 1.34 1.28 1.42 1.29 1.58 0.99 1.59 0.99 1.29 1.30 64k 1.41 1.58 1.50 1.28 1.42 1.35 1.58 1.23 1.59 1.15 1.29 1.38 128k 1.41 1.66 1.68 1.30 1.47 1.37 1.58 1.39 1.60 1.19 1.29 1.43 256k 1.41 1.69 1.71 1.30 1.47 1.37 1.58 1.55 1.60 1.24 1.29 1.46 Table 17: SSSS-W8-P2 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.62 1.09 1.09 0.61 1.07 0.59 0.90 0.59 1.53 0.72 0.91 0.81 8k 0.79 1.24 1.20 0.73 1.35 0.73 1.33 0.68 1.57 0.89 1.24 0.98 16k 1.22 1.36 1.27 1.01 1.41 1.35 1.60 0.82 1.59 0.99 1.31 1.22 32k 1.57 1.52 1.38 1.42 1.44 1.35 1.70 1.03 1.61 1.06 1.37 1.37 64k 1.57 1.63 1.55 1.42 1.44 1.41 1.70 1.28 1.61 1.25 1.37 1.46 128k 1.58 1.72 1.75 1.45 1.49 1.43 1.70 1.46 1.62 1.29 1.37 1.52 256k 1.58 1.75 1.77 1.45 1.49 1.43 1.70 1.64 1.62 1.35 1.37 1.55 142 Table 18: SSSS-W8-P4 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.62 1.10 1.10 0.61 1.07 0.60 0.92 0.59 1.54 0.72 0.93 0.81 8k 0.80 1.25 1.21 0.74 1.36 0.73 1.37 0.68 1.58 0.90 1.27 0.99 16k 1.24 1.37 1.28 1.02 1.41 1.36 1.66 0.82 1.61 1.00 1.34 1.23 32k 1.62 1.53 1.39 1.44 1.44 1.36 1.76 1.03 1.62 1.08 1.41 1.39 64k 1.62 1.64 1.57 1.44 1.44 1.42 1.76 1.29 1.63 1.26 1.41 1.48 128k 1.63 1.73 1.76 1.47 1.50 1.45 1.76 1.48 1.63 1.31 1.41 1.54 256k 1.63 1.76 1.79 1.47 1.50 1.45 1.76 1.66 1.64 1.37 1.41 1.57 Table 19: SSSS-W8-P8 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.62 1.10 1.10 0.61 1.07 0.60 0.92 0.59 1.54 0.72 0.93 0.81 8k 0.81 1.25 1.21 0.74 1.36 0.73 1.37 0.68 1.58 0.90 1.27 1.00 16k 1.25 1.37 1.28 1.02 1.41 1.37 1.66 0.82 1.61 1.00 1.35 1.23 32k 1.62 1.53 1.39 1.44 1.44 1.37 1.76 1.04 1.62 1.08 1.41 1.39 64k 1.62 1.64 1.57 1.44 1.45 1.43 1.76 1.30 1.63 1.27 1.41 1.49 128k 1.63 1.73 1.76 1.47 1.50 1.45 1.76 1.48 1.63 1.31 1.42 1.54 256k 1.63 1.76 1.79 1.47 1.50 1.45 1.76 1.66 1.64 1.37 1.42 1.57 Table 20: SSSS-W4-P1 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.55 0.84 0.87 0.54 0.89 0.54 0.74 0.53 1.17 0.58 0.92 0.69 8k 0.69 0.94 0.94 0.63 1.07 0.64 1.01 0.60 1.19 0.69 1.02 0.81 16k 0.97 1.04 0.98 0.81 1.11 1.07 1.15 0.71 1.21 0.75 1.05 0.96 32k 1.18 1.16 1.05 1.05 1.13 1.07 1.19 0.86 1.22 0.80 1.09 1.06 64k 1.18 1.25 1.15 1.05 1.13 1.11 1.19 1.03 1.22 0.91 1.09 1.11 128k 1.18 1.32 1.26 1.07 1.17 1.12 1.19 1.14 1.23 0.94 1.09 1.15 256k 1.18 1.35 1.27 1.07 1.17 1.12 1.19 1.24 1.23 0.98 1.09 1.16 143 Ihble 21: SSSS-W4-P2 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.57 0.87 0.88 0.56 0.90 0.54 0.75 0.54 1.19 0.60 0.98 0.71 8k 0.72 0.97 0.95 0.66 1.09 0.65 1.03 0.61 1.22 0.71 1.08 0.83 16k 1.05 1.07 0.99 0.86 1.13 1.11 1.17 0.73 1.24 0.78 1.12 1.00 32k 1.29 1.19 1.07 1.14 1.15 1.11 1.22 0.89 1.25 0.83 1.17 1.10 64k 1.29 1.28 1.17 1.14 1.15 1.15 1.22 1.07 1.25 0.95 1.17 1.16 128k 1.29 1.35 1.29 1.16 1.18 1.17 1.22 1.20 1.26 0.98 1.17 1.20 256k 1.29 1.38 1.30 1.16 1.18 1.17 1.22 1.31 1.26 1.03 1.17 1.22 Table 22: SSSS-W4-P4 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.57 0.87 0.88 0.56 0.90 0.55 0.76 0.55 1.19 0.60 0.99 0.71 8k 0.73 0.97 0.95 0.66 1.09 0.65 1.04 0.62 1.22 0.72 1.09 0.84 16k 1.06 1.07 0.99 0.87 1.13 1.11 1.19 0.73 1.24 0.79 1.13 1.00 32k 1.31 1.19 1.07 1.15 1.15 1.11 1.24 0.90 1.25 0.84 1.18 1.11 64k 1.31 1.28 1.18 1.15 1.15 1.15 1.24 1.08 1.25 0.96 1.18 1.17 128k 1.31 1.35 1.29 1.17 1.18 1.17 1.24 1.21 1.26 0.99 1.18 1.21 256k 1.31 1.38 1.31 1.17 1.18 1.17 1.24 1.33 1.26 1.03 1.18 1.23 Table 23: SSSS-W2-P1 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.46 0.59 0.63 0.45 0.71 0.45 0.56 0.46 0.92 0.48 0.68 0.55 8k 0.56 0.65 0.67 0.51 0.82 0.53 0.70 0.51 0.94 0.56 0.73 0.63 16k 0.76 0.71 0.70 0.64 0.84 0.81 0.77 0.59 0.95 0.61 0.75 0.73 32k 0.89 0.78 0.74 0.77 0.85 0.81 0.79 0.69 0.96 0.64 0.77 0.78 64k 0.89 0.83 0.79 0.77 0.85 0.83 0.79 0.80 0.96 0.70 0.77 0.81 128k 0.89 0.87 0.86 0.78 0.87 0.84 0.79 0.86 0.96 0.72 0.77 0.83 256k 0.89 0.88 0.87 0.78 0.87 0.84 0.79 0.92 0.96 0.75 0.77 0.84 144 Ibble 24: SSSS-W2-P2 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.48 0.61 0.64 0.47 0.73 0.46 0.57 0.47 0.97 0.49 0.70 0.57 8k 0.58 0.68 0.68 0.53 0.85 0.54 0.72 0.52 0.98 0.57 0.75 0.65 16k 0.79 0.74 0.70 0.67 0.87 0.83 0.79 0.60 1.00 0.62 0.77 0.75 32k 0.94 0.82 0.75 0.82 0.88 0.83 0.81 0.71 1.00 0.65 0.79 0.81 64k 0.94 0.88 0.80 0.82 0.88 0.85 0.81 0.83 1.01 0.72 0.79 0.84 128k 0.94 0.92 0.87 0.83 0.90 0.86 0.81 0.90 1.01 0.74 0.80 0.87 256k 0.94 0.93 0.88 0.83 0.90 0.86 0.81 0.96 1.01 0.77 0.80 0.88 Table 25: SSDD-W8-P1 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.71 1.24 1.49 0.66 1.41 0.66 1.06 0.67 2.72 0.79 1.10 0.95 8k 0.95 1.46 1.69 0.82 1.98 0.85 1.85 0.78 2.78 1.00 1.54 1.21 16k 1.62 1.65 1.80 1.19 2.09 1.77 2.42 0.98 2.82 1.12 1.64 1.58 32k 2.36 1.92 1.98 1.85 2.12 1.77 2.65 1.30 2.84 1.21 1.74 1.86 64k 2.36 2.12 2.18 1.85 2.13 1.86 2.65 1.74 2.85 1.43 1.74 2.01 128k 2.38 2.27 2.55 1.90 2.24 1.89 2.65 2.09 2.86 1.48 1.74 2.11 256k 2.38 2.32 2.59 1.90 2.24 1.89 2.65 2.46 2.86 1.55 1.74 2.16 Table 26: SSDD-W8-P2 circ eqnt espr fast qsor quer send simp tak xlis zebr hm\ 4k 0.74 1.36 1.56 0.72 1.46 0.67 1.14 0.69 2.75 0.85 1.18 1.00 8k 1.01 1.62 1.77 0.90 2.08 0.87 1.99 0.81 2.80 1.09 1.69 1.29 16k 1.82 1.82 1.89 1.38 2.19 1.94 2.67 1.01 2.83 1.24 1.82 1.72 32k 2.79 2.12 2.08 2.37 2.22 1.94 2.96 1.36 2.84 1.34 1.94 2.04 64k 2.79 2.32 2.30 2.37 2.21 2.05 2.96 1.85 2.85 1.63 1.94 2.22 128k 2.81 2.46 2.63 2.45 2.34 2.10 2.96 2.26 2.86 1.69 1.95 2.35 256k 2.81 2.49 2.66 2.45 2.34 2.10 2.96 2.71 2.86 1.78 1.95 2.40 145 Table 27: SSDD-W8-P4 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.75 1.41 1.58 0.73 1.47 0.68 1.15 0.70 2.75 0.87 1.21 1.02 8k 1.02 1.67 1.80 0.91 2.09 0.88 1.99 0.81 2.80 1.13 1.76 1.31 16k 1.87 1.89 1.93 1.40 2.20 1.98 2.67 1.02 2.83 1.29 1.90 1.76 32k 2.91 2.18 2.12 2.41 2.23 1.98 2.96 1.38 2.85 1.40 2.04 2.09 64k 2.91 2.37 2.34 2.41 2.22 2.10 2.96 1.89 2.85 1.72 2.04 2.28 128k 2.93 2.51 2.64 2.50 2.35 2.15 2.96 2.31 2.86 1.79 2.04 2.40 256k 2.93 2.54 2.68 2.50 2.35 2.15 2.96 2.78 2.86 1.87 2.04 2.46 Table 28: SSDD-W8-P8 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.75 1.43 1.58 0.73 1.48 0.68 1.15 0.70 2.75 0.87 1.22 1.02 8k 1.03 1.69 1.81 0.91 2.11 0.88 1.99 0.82 2.80 1.14 1.77 1.32 16k 1.88 1.90 1.93 1.40 2.20 1.99 2.67 1.03 2.83 1.31 1.92 1.77 32k 2.94 2.20 2.12 2.42 2.23 1.99 2.96 1.38 2.85 1.42 2.05 2.10 64k 2.94 2.39 2.34 2.42 2.23 2.11 2.96 1.90 2.85 1.75 2.05 2.29 128k 2.96 2.52 2.65 2.50 2.36 2.17 2.96 2.32 2.86 1.82 2.06 2.42 256k 2.96 2.55 2.68 2.50 2.36 2.17 2.96 2.79 2.86 1.91 2.06 2.48 Table 29: SSDD-W4-P1 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.70 1.24 1.49 0.67 1.38 0.66 1.10 0.67 2.65 0.79 1.36 0.96 8k 0.95 1.46 1.69 0.82 1.92 0.84 1.85 0.78 2.70 1.00 1.56 1.21 16k 1.63 1.65 1.79 1.19 2.00 1.76 2.42 0.98 2.74 1.12 1.64 1.57 32k 2.36 1.91 1.97 1.85 2.03 1.76 2.64 1.29 2.75 1.21 1.74 1.84 64k 2.36 2.11 2.18 1.85 2.03 1.85 2.64 1.73 2.76 1.43 1.74 1.99 128k 2.37 2.26 2.54 1.89 2.13 1.89 2.65 2.07 2.77 1.48 1.74 2.09 256k 2.37 2.31 2.58 1.90 2.13 1.89 2.65 2.44 2.77 1.55 1.74 2.14 146 Table 30: SSDD-W4-P2 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.74 1.36 1.55 0.72 1.42 0.67 1.12 0.69 2.68 0.85 1.18 0.99 8k 1.00 1.61 1.77 0.90 1.99 0.87 1.93 0.80 2.72 1.09 1.69 1.28 16k 1.80 1.82 1.88 1.37 2.06 1.92 2.56 1.01 2.74 1.24 1.82 1.70 32k 2.75 2.11 2.07 2.32 2.09 1.92 2.82 1.35 2.76 1.34 1.94 2.01 64k 2.75 2.31 2.29 2.32 2.09 2.03 2.82 1.83 2.77 1.63 1.94 2.19 128k 2.76 2.45 2.62 2.40 2.20 2.08 2.82 2.22 2.77 1.69 1.94 2.30 256k 2.76 2.49 2.66 2.40 2.20 2.08 2.82 2.64 2.77 1.77 1.94 2.36 Table 31: SSDD-W4-P4 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.75 1.41 1.57 0.72 1.43 0.68 1.13 0.69 2.68 0.87 1.21 1.01 8k 1.01 1.67 1.79 0.90 2.00 0.87 1.93 0.81 2.72 1.13 1.75 1.30 16k 1.82 1.88 1.91 1.37 2.08 1.96 2.56 1.01 2.75 1.29 1.89 1.73 32k 2.80 2.17 2.10 2.34 2.11 1.96 2.82 1.36 2.76 1.40 2.03 2.05 64k 2.80 2.36 2.33 2.34 2.11 2.08 2.82 1.85 2.77 1.71 2.03 2.23 128k 2.82 2.50 2.64 2.42 2.22 2.13 2.82 2.25 2.77 1.77 2.03 2.35 256k 2.82 2.53 2.67 2.42 2.22 2.13 2.82 2.69 2.78 1.86 2.03 2.40 Table 32: SSDD-W2-P1 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.64 1.18 1.31 0.61 1.01 0.61 0.82 0.60 1.48 0.77 1.24 0.84 8k 0.81 1.34 1.46 0.71 1.24 0.77 1.16 0.68 1.48 0.98 1.35 1.00 16k 1.20 1.45 1.53 0.98 1.26 1.53 1.37 0.80 1.48 1.10 1.42 1.23 32k 1.56 1.58 1.64 1.38 1.26 1.53 1.44 0.99 1.48 1.18 1.49 1.38 64k 1.56 1.65 1.73 1.38 1.26 1.53 1.44 1.22 1.48 1.39 1.49 1.45 128k 1.56 1.69 1.76 1.41 1.30 1.55 1.44 1.37 1.48 1.42 1.50 1.49 256k 1.56 1.70 1.77 1.41 1.30 1.55 1.44 1.51 1.48 1.46 1.50 1.51 147 Table 33: SSDD-W2-P2 circ eqnt espr fast qsor quer send simp tak xlis zebr hm 4k 0.65 1.25 1.34 0.61 1.01 0.61 0.82 0.60 1.48 0.82 1.25 0.85 8k 0.82 1.41 1.49 0.71 1.24 0.77 1.16 0.68 1.48 1.05 1.35 1.02 16k 1.20 1.51 1.57 0.98 1.26 1.53 1.37 0.80 1.48 1.18 1.42 1.25 32k 1.56 1.63 1.67 1.38 1.26 1.53 1.44 0.99 1.48 1.26 1.49 1.40 64k 1.56 1.67 1.74 1.38 1.26 1.53 1.44 1.22 1.48 1.49 1.49 1.46 128k 1.56 1.71 1.77 1.41 1.30 1.55 1.44 1.37 1.48 1.51 1.50 1.50 256k 1.56 1.72 1.78 1.41 1.30 1.55 1.44 1.51 1.48 1.54 1.50 1.52 Table 34: DSDD circ fast qsor quee quer semi send simp tak zebr W1E1 0.75 0.75 0.75 0.75 0.74 0.74 0.74 0.75 0.75 0.75 W1E2 1.37 1.33 1.32 1.34 1.32 1.36 1.36 1.36 1.39 1.34 W1E4 2.25 2.15 2.15 2.19 2.05 2.21 2.28 2.25 2.29 2.12 W1E8 3.14 2.96 2.98 3.08 2.69 3.00 3.27 3.08 3.33 2.68 W2E1 1.19 1.10 1.09 1.15 1.08 1.20 1.18 1.17 1.19 1.17 W2E2 2.18 2.08 2.09 2.18 2.00 2.21 2.17 2.15 2.22 2.09 W2W4 3.55 3.68 3.75 3.88 3.32 3.53 3.65 3.54 3.72 3.21 W2E8 4.85 5.08 5.21 5.45 4.28 4.67 5.05 4.73 5.29 3.98 W4E1 1.39 1.42 1.43 1.53 1.34 1.24 1.45 1.41 1.30 1.25 W4E2 2.47 2.66 2.67 2.85 2.48 2.04 2.59 2.54 2.39 2.18 W4E4 3.90 4.64 4.69 4.99 4.09 3.59 4.25 4.04 3.95 3.36 W4E8 5.11 5.84 5.94 6.42 4.96 4.41 5.62 5.06 5.51 4.21 W8E1 1.49 1.64 1.64 1.72 1.55 1.46 1.58 1.55 1.42 1.29 W8E2 2.61 3.11 3.07 3.25 2.84 2.41 2.81 2.78 2.47 2.29 W8E4 4.03 5.39 5.43 5.69 4.82 3.89 4.45 4.38 4.24 3.53 W8E8 5.17 6.63 6.87 7.21 5.98 5.54 5.63 5.41 5.99 4.31 148 Appendix B: CMOS Technology Table 35: HP-l.Oum (CMOS26B) Transistor Parameters Name W/L N-Channel P-Channel Units v,h 1.5/1.0 0.692 -0.885 V Vth(Vds = 0.05 V) 9.0/1.0 0.670 -0.869 V Idss(Vgds = 5V) 3.75 -1.74 mA K p 45.0 -18.0 uA/V2 Delta Length 0.234 0.283 um Delta Width 0.367 0.598 um SPICE TRANSISTOR MODELS: .MODEL nfet NMOS LEVEL=3 PHI=0.600000 TOX=1.7800E-08 XJ=0.2000000U TPG=1 + VTO=0.7144 DELTA=7.1240E-01 LD=1.4430E-07 KP=1.1087E-04 + UO=571.5 THETA=1.2510E-01 RSH=1.0990E+01 GAMMA=0.5728 + NSUB=3.7200E+16 NFS=4.9990E+12 VMAX=1.9750E+05 ETA=3.5650E-02 + KAPPA=1.1810E-01 CGDO=4.1991E-10CGSO=4.1991E-10 + CGBO=3.8469E-10 CJ=1.3857E-04 MJ=0.6776 CJSW=4.081 IE-10 + MJSW=0.306226 PB=0.800000 * Weff = Wdrawn - Delta_W * The suggested Delta_W is 2.9700E-07 .MODEL pfet PMOS LEVEL=3 PHI=0.600000 TOX=1.7800E-08 XJ=0.2000000U TPG=-1 + VTO=-0.9002 DELTA=4.7700E-01 LD=1.0310E-07 KP=3.4454E-05 + UO=177.6 THETA=1.6570E-01 RSH=1.0390E+01 GAMMA=0.4895 + NSUB=2.7170E+16 NFS=5.0000E+12 VMAX=3.6930E+05 ETA=7.4960E-02 + KAPPA=8.9950E+00 CGDO=3.0002E-10 CGSO=3.0002E-10 + CGBO=4.1395E-10 CJ=5.9085E-04 MJ=0.4997 CJSW=8.9271E-11 + MJSW=0.038310 PB=0.850000 * Weff = Wdrawn - Delta_W * The suggested Delta_W is 3.7360E-07 149 Appendix C: Memory Physical Layout C.1 Register File Cells There are four cell types used in the register file design. They consist of an address decoder, word line driver, memory cell, and read/write circuitry. This appendix illustrates the layout of the most crucial register file cell, the memory cell. The other circuits are important but the memory cell determines the pitch matching dimension and clearly illustrates the wiring complexity that occurs with multi-ported designs. The three figures that follow are of the Wl, W2, and W4 register file designs. This translates to a 3-port, 6-ports and 12-port design respectively. All of the dimensions and electrical characteristics can be found in Chapter 4 for these and the cache memory cells. For all of the memory cells, the word lines are the horizontal (metal one) wires and the bit lines are the vertical (metal two) wires. Figure 83: 3-Port Register File Cell 150 Figure 84: 6-Port Register File Cell Figure 85: 12-Port Register File Cell C.2 Cache Memory Cells There are six cell types used in the cache memory design. They consist of an address driver, address decoder, word line driver, memory cell, read sense amp, and write driver. Only the memory cell and word line driver are illustrated. This is because they best illustrate the pitch matching constraints that occur when a small memory cell is designed. A 1-Port, 2-port, and 4-port memory cell is illustrated. Just as with the register file the dominate factor in designing these cells are the wiring over head to get in to and out of the cross-coupled inverter. 152 Figure 86: 1-Port Cache Cell Figure 87: 2-Port Cache Cell Figure 88: 4-Port Cache Cell References T. Adams and H. Tomg, "A Comparison of List Schedules for Parallel Processing Systems," Communications of the ACM, Vol. 17, Dec. 1974. J. Alowersson, "A CMOS Circuit Technique for High-Speed RAMs," Department of Computer Engineering, Lund University, 1994. M. Afghahi and Christer Svensson, "Performance of Synchronous and Asynchronous Schemes for VLSI Systems," IEEE Transaction on Computers, Vol 41., No. 7, July 1992. TY. Agawala and J. Cocke, "High Performance Reduced Instruction Set Processors," IBM T.J. Watson Research Center, Technical Report #55845, March 1987. H. Bakoglu, "Circuits, Interconnections, and Packaging for VLSI," Ad dison Wesley Publishing Company, 1990. H. Bakoglu and T. Whiteside, “RISC System/6000 Hardware Over view,” IBM RISC System Technology, 1990. J. Beer, "Concepts, Design, and Performance Analysis of a Parallel Prolog Machine," The Technical University of Berlin, Ph.D. Thesis, 1987. T. Blalock and R. Jeager,"A High-Speed Clamped Bit-Line Current- Mode Sense Amplifier," IEEE Journal of Solid-State Circuits, Vol. 26, No. 4, April 1991. B. Case, “Superscalar Techniques: SuperSPARC vs. 88110,” Micro processor Report, Vol. 5, No. 22, Dec 4th, 1991. B. Case, “Intel Reveals Pentium Implementation,” Microprocessor Re port, Vol. 7, No. 4, March 29th, 1993. P. Chang et al., “IMPACT: An Architectural Framework for Multi-In- struction-Issue Processors,” Proceedings o f the 18th Annual Interna tional Symposium on Computer Architecture, May 1991 A. Charlesworth, "An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FBS 164 Family," IEEE Comput er,Vol. 14, Sept. 1981. [13] R. Cohn et al., “Architecture and Compiler Tradeoffs for a Long In struction Word Microprocessor,” The Third Conference on Architectur al Support for Programming Languages and Operating Systems, April 1989. [14] R. R Colwell et al., “A VLIW Architecture for a Trace Scheduling Compiler,” Second International Conference on Architectural Support fo r Programming Languages and Operating Systems, October 1987. [15] R. Cryton et al., "An Efficient Method of Computing Static Single As signment Form," 16th Annual ACM Symposium on Principles of Pro gramming Languages, Jan. 1981. [16] S. Dasgupta and J. Tartar, "The identification of maximal parallelism in straight line microprogramming," IEEE Transactions on Computers, Vol. C-25, Oct. 1976. [17] R. Dennard et al., "Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions," IEEE Journal o f Solid-State Circuits, Vol. CS-9, No. 5, Oct. 1974. [18] A. Despain, Private Communications. [19] D. W. Dobberpuhl et al., “A 200 MHz 64-b Dual-Issue CMOS Micro processor,” IEEE Journal of Solid-State Circuits, Vol. 27, No. 11, Nov. 1992. [20] T. Dobry, "A High Performance Architecture For Prolog," Kluwer Ac ademic Publishers, 1990. [21] L. Dowd, "The Cray T3D: from fringe to forefront. (Cray Research Inc.’s supercomputer based on DEC’s Alpha AXP 21064 processor)" DEC Professional, Vol. 12, no. 12, Dec 1993 [22] D.J. Erdman et al., "CAzM: A Numerically Robust, Table-Based Cir cuit Simulatior," Microelectronics Center o f North Carolina, Techni cal Report TR89-23, 1989. [23] J. Ferrante et al.,"The Program Dependence Graph and Its Use in Op timization," ACM Transactions on Programming Languages and Sys tems, July 1987. 156 [24] J. A. Fisher, ‘Trace Scheduling: A Technique for Global Microcode Compaction,” IEEE Transactions on Computers, Vol. C-30, NO. 7, July 1981. [25] J. A. Fisher, “Very Long Instruction Word Architectures and the ELI- 512,” Proceedings of the 10th Annual International Symposium on Computer Architecture, June 1983. [26] J. A. Fisher and B. Ramakrishna Rau, “Instruction-Level Parallel Pro cessing,” Science, Sept. 13th, 1991. [27] M. J. Flynn, “Some Computer Organization and Their Effectiveness,” IEEE Trans, on Computer, vol. c-21, no. 9, Sept. 1972. [28] R. Forsyth et al., “T9000 - Superscalar Transputer,” Hot Chips III Pre sentation, Aug 1991. [29] C.C. Foster and E.M. Riseman, "Percolation of Code to Enhance Par allel Dispatching and Execution," IEEE Transactions on Computers, Vol. C-21, Dec. 1972. [30] M. Franklin, "The Multiscalar Architecture," Ph.D. Thesis, University o f Wisconsin - Madison, 1993. [31] M. Garey and D. Johnson, "Computers and Intractability," W . H. Free man and Company, New York, 1979. [32] P. P. Gelsingeri et al., "Microprocessors circa 2000," IEEE Spectrum, October 1989. [33] T. Getzinger, "Abstract Interpretation for Compile-Time Analysis of Logic Programs," University o f Southern California, Ph.D. Thesis, ACAL-PR-93-09. [34] J. Goodman et al., “PIPE:A VLSI Decoupled Architecture,” Proceed ings of the 12th Annual International Symposium on Computer Archi tecture, June 1985. [35] L. Gwennap, “Motorola Details Plan to Extend 68K Line,” Micropro cessor Report, Vol. 6, No. 15, Nov 1992. [36] P. M. Hanson, "Coprocessor Architectures for VLSI," Ph.D. Thesis, University ofCalifomia-Berkeley, Report No. UCB/CSD 88/466, Nov. 1988. 157 R. Haygood, "The Berkeley Benchmark Suite," University of Califor nia Technical Report. N. Hedenstiema and K. Jeppson, "Comments on the Optimum CMOS Tapered Buffer Problem," IEEE Journal o f Solid-State Circuits, Vol. 29, No. 2, Feb. 1994. J. Hennessy and N. Jouppi, "Computer Technology and Architecture: A n Evolving Interaction," IEEE Computer, Sept. 1991. J. Hennessy and D. Patterson, "Computer Architecture A Quantitative Approach," Morgan Kaufmann Publishers, Inc., 1990. B.Holmer, "Fast Prolog with an Extended General Purpose Architec ture," The 17h Annual International Symposium on Computer Archi tecture,May 1990. R.W. Horst et al., “Multiple Instruction Issue in the NonStop Cyclone Processor,” Proceedings o f the 17th Annual International Symposium on Computer Architecture, May 1990. I. Hwang and A. Fisher, "Ultrafast Compact 32-bit CMOS Adders in Multiple-Output Domino Logic, IEEE Journal of Solid-State Circuits, Vol. 24, No. 2, Feb. 1989. W. Hwu, "Exploiting Concurrency to Achieve High Performance in a Single-chip Microarchitecture," University of Califomia-Berkeley, Ph.D. Thesis, Report No. UCB/CSD 88/398, Jan. 1988. B. Irissou, "Design Techniques of High-Speed Datapaths," University o f California Berkeley, Report No. UCB/CSD 93/748, Nov. 1992. W. Jaffe et a/.,“A 200 MFLOP Precision Architecture Processor,” Hot Chips IV Presentation, Aug 1992. M. Johnson, "Superscalar Microprocessor Design," Prentice Hall, En glewood Cliffs, New Jersey, 1991. N. Jouppi, "The Nonuniform Distribution of Instruction-Level and Ma chine Parallelism and Its Effect on Performance," IEEE Transactions Computers, Vol. 38, No. 12, Dec. 1989. [49] N. P. Jouppi and D. W. Wall, “Available Instruction-Level Parallelism for Superscalar and Superpiplined Machines,” The Third Conference on Architectural Support for Programming Languages and Operating Systems, April 1989. [50] N. Jouppi and S. Wilton, "Tradoffs in Two-Level On-Chip Caching," Proceedings o f the 21th Annual International Symposium on Computer Architecture, May 1994. [51] G. Kane, "mips R200 RISC Architecture," Prentice Hall, Englewood Cliffs, NJ. 1987. [52] S. Keckler and W. Dally, "Processor Coupling: Integrating Compile Time and Runtine Scheduling for Parallelism," Proceedings o f the 19th Annual International Symposium on Computer Architecture, June 1992. [53] P. Kogge, "The Architecture of Pipelined Computers," Hemisphere Publishing Corporation, 1981. [54] Y. Koh, "Construction of SLAM Cache System," University of South ern California, ACAL Technical Report, 1994. [55] H. Kotani et al., "A 256 Mb DRAM with 100 MHz Serial I/O Ports for Storage of Moving Pictures," 37th International Solid-State Circuits Conference, 1994. [56] M. Knieser and C. Papachristou, “Y-Pipe: A Conditional Branching Scheme Without Pipeline Delays,” The 25th Annual International Symposium on Microarchitecture, December 1992. [57] M. Kuga et al., “DSNS (Dynamically-hazard-resolved, Statically- code-scheduled, Nonuniform Superscalar): Yet Another Superscalar Processor Architecture,” Computer Architecture News, vol. 19, no. 4, June 1991. [58] M. Lam, “Software Pipelining: An Effective Scheduling Technique for VLIW Machines,” ACM SIGPLAN ‘88 Conference on Programming Language Design and Implementation, 1988. [59] M. Lam and R. P. Wilson, “Limits of Control FLow on Parallelism,” Proceedings of the 19th Annual International Symposium on Computer Architecture, June 1992. 159 [60] P. Law, "High-Speed CMOS Design," Aug 1992. [61] H. Lindkvist and P. Andersson, "Techniques for Fast CMOS-based Conditional Sum Adders," ICCD, Oct 1994. [62] R. Marko and M. Beck, “National’s Swordfish A Superscalar with DSP,” Hot Chips III Presentation, Aug 1991. [63] C. Mead and L. Connway, "Introduction to VLSI Systems," Addison Wesley Publishing Company, 1980. [64] C. Mead and M.Rem, "Cost and Performance of VLSI Computing Structures," IEEE Journal o f Solid-State Circuits, Vol. SC-14, No. 2, April 1979. [65] K. Murakami et al., “SIMP (Single Instruction stream/ Multiple in struction Pipelining): A novel High-Speed Single-Processor Architec ture,” Proceedings o f the 16th Annual International Symposium on Computer Architecture, May 1989. [66] M. Nakajima et al., “OHMEGA: A VLSI Superscalar Processor Archi tecture for Numerical Applications” Proceedings o f the 18th Annual International Symposium on Computer Architecture, May 1991. [67] H. Hanawa, "On-chip Multiple Superscalar Processors with Secondary Cache Memories," ICCD, 1991. [68] A. Nicolau, “Percolation Scheduling: A Parallel Compilation Tech nique” CS Technical Report TR 85-678, Cornell University, Ithaca NY, May 1985. [69] J. Osterhout et al., "Magic Tutorial #l:Getting Started," University of California, Berkeley, Magic Documentation, Sept. 1990. [70] Y . Patt etal., “Run-Time Generation of HPS Microinstructions From a VAX Instruction Stream,” Micro 19 Workshop, New York, Oct. 1986. [71] V. Popescu et al., “The Metaflow Architecture,” IEEE Micro, June 1991. [72] K. Saraswat and F. Mohammadi, "Effect of Scaling of Interconnections on the Time Delay of VLSI Circuits," IEEE Transactions on Electronic Devices, Vol. ED-29, No. 4, April 1982. 160 [73] B. Sano and A. Despain, "The 16-Fold Way: A Microparallel Taxono my," The 26th International Symposium on Microarchitecture, Dec. 1993. [74] K. C. Saraswat and F. Mohammadi, "Effects of Scaling of Interconnec tion on the Time Delay of VLSI Circuits," IEEE Transactions on Elec tronic Devices, Vol. ED-29, No. 4, April 1982. [75] E. Seevinc et al.,'"Current-Mode Techniques for High-Speed VLSI Cir cuits with Applications to Current Sense Amplifiers for CMOS SRAM’s," IEEE Journal o f Solid-State Circuits, Vol. 26, No. 4, April 1991. [76] A. Silburt et al.,"A 180 MHz 0.8-um BiCMOS Modular Memory Fa cility of DRAM and Multiported SRAM," IEEE Journal of Solid-State Circuits, Vol. 28, No. 3, March 1993. [77] A. Singhal, “A High Performance Prolog Processor with Multiple Functional Units,” Proceedings of the 16th Annual International Sym posium on Computer Architecture, May 1989. [78] M. Smith, M. Lam, and M. Horowitz, “Boosting Beyond Static Sched uling in a Superscalar Processor,” Proceedings of the 17th Annual In ternational Symposium on Computer Architecture, June 1990. [79] J.E. Smith and A. Pleszkun,"Implementationof Precise Interrupts in Pipelined Processors," Proceedings o f the 12th Annual International Symposium on Computer Architecture, June 1985. [80] J.E. Smith, “Dynamic Instruction Scheduling and the Astronautics ZS- 1,” IEEE Computer, pp. 21-35, June 1989. [81] M. Smith, M. Horowitz, and M. Lam, “Efficient Superscalar Perfor mance Through Boosting,” Fifth International Conference on Archi tectural Support for Programming Languages and Operating Systems, September 1992. [82] G. A. Slavenburg et al., “The LIFE Family of High Performance Single Chip VLlWs,” Hot Chips III Presentation, 1991. [83] A. Srivastava and A. Despain, "Prophetic Branches: A Branch Tech nique Architecture for Code Compaction and Efficient Execution," The 26th International Symposium on Microarchitecture, Dec. 1993. 161 [84] P. Statt, "Power2 Takes the Lead - For Now," Byte Magazine, Jan. 1994, pg. 77. [85] C. Stephens et al., “Instruction Level Profiling and Evaluation of the IBM RS/6000,” Proceedings o f the 18th Annual International Sympo sium on Computer Architecture, May 1991. [86] S. Tanaka et al., "A 120 MHz BiCMOS Superscalar RISC Processor," IEEE Journal of Solid-State Circuits, Vol. 29, No. 4, April 1994. [87] A. Taylor, "LIPS on a MIPS: Results from a Prolog Compiler for a RISC," Logic Programming: Proceedings o f the 7th International SRI International Artificial Intelligence Center. October 1983. [88] J.E. Thorton, “Design of a Computer-The Control Data 6600,” Scott, Foresman and Co., Glenview EL 1970. [89] R.M. Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM J. Research Development, vol 11. pp.25-33, Jan. 1967. [90] G. Tyson, M. Farrens, and A. Pleszkun, “MISC: A Multiple Instruction Stream Computer,” The 25th Annual International Symposium on Mi croarchitecture, December 1992. [91] G. A. Uvieghara, “SOVAR: Smart Memories for Out-of-order Execu tion VLSI Architectures,” University of California, Berkeley, UCB/ ERL M89/42, April 1989. [92] T. Wada et al. "AN Analytic Access Time Model for On-Chip Cache Memories," IEEE Journal o f Solid-State Circuits, Vol. 27, No. 8, Au gust 1992. [93] D. Wall, "Limits of Instruction-Level Parallelism," DEC Western Re search Laboratory, Palo Alto, Research Report, 93/6. [94] L. Wang and C. Wu, “Distributed Instruction Set Computer Architec ture,” IEEE Trans, on Computers, vol. 40, no. 8, Aug. 1991. [95] H. Weste and K. Eshraghian, "Principles of CMOS VLSI Design A Systems Perspective," Addison-Wesley Publishing Company, 1985. [96] B. Wei and C. Thompson, "Area-Time Optimal Adder Design," IEEE Trans, on Computers, vol. 39, no. 5, May. 1990. 162 [97] [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] [109] L. Wirbel, "HP/Convex, IBM stir superserver field," Electronic En gineering Times, no. 789, March 21 1994. A. Wolfe and J. P. Shen, “A Variable Instruction Stream Extension to the VLIW Architecture,” The Fourth Conference on Architectural Sup port fo r Programming Languages and Operating Systems, April 1991. A. Wolfe and R. Boleyn, "Two Ported Cache Alternatives for Super scalar Processors," The 26th Annual International Symposium on Mi croarchitecture, December 1993. S. Yau et al.,"On Storage optimization of horizontal microprograms," 11th Annual Microprogramming Workshop, 1978. H. Young, “Evaluation of a Decoupled Computer Architecture and the Design of a Vector Extension,” Computer Sciences Technical Report #603, University of Wisconsin - Madison, July 1985. J. Yuan and Christer Svensson, "High-Speed CMOS Circuit Tech niques," IEEE Journal of Solid-State Circuits, Vol. 24, No. 1, February 1989. _, SPECINT92 and SPECFP92 Benchmark Suite. -, “64-Bit Microprocessor Programmers Reference Manual,” Intel Corporation, Mt Prospect IL, 1990. -,“80860 User’s Manual,” Intel Corporation, Santa Clara CA, 1989. -.“80960CA User’s Manual,” Intel Corporation, Santa Clara CA, 1989. _, "PowerPC 601 RISC Microprocessor Technical Summary," Motoro la Corp., Rev 1., MPC601/D. PowerPC 603 RISC Microprocessor Technical Summary," Motorola Corp., Rev 1., MPC603/D. -, "PowerPC 604 RISC Microprocessor Technical Summary," Motorola Corp., Rev 1., MPC604/D. 163
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11257212
Unique identifier
UC11257212
Legacy Identifier
9621624