Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
(USC Thesis Other)
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ADAPTIVE DYNAMIC THREAD SCHEDULING FOR SIMULTANEOUS MULTITHREADED ARCHITECTURES WITH A DETECTOR THREAD by Chulho Shin A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER ENGINEERING) December 2002 Copyright 2003 Chulho Shin Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3093914 Copyright 2003 by Shin, Chulho All rights reserved. ® UMI UMI Microform 3093914 Copyright 2003 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089-1695 This dissertation, written by CHULHO SH X K l under the direction o f h TS dissertation committee, and approved by all its members, has been presented to and accepted by the Director o f Graduate and Professional Programs, in partial fulfillment of the requirements fo r the degree of DOCTOR OF PHILOSOPHY ~ y Director Date D ecem b er....18. 2 0 0 2 Dissertation Committee ■ e Chair Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D edication To Annie Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgem ents My foremost gratitude goes to my advisor, Prof. Jean-Luc Gaudiot. Throughout the program, he has always been the last resort in many ways: academically, finan cially and spiritually. Whenever I was bewildered, he always was by my side, ready to give a helping hand. His incessant encouragement kept my life alive throughout the program. I also want to thank Prof. Doug lerardi and Prof. Massoud Pedram for kindly being members of the dissertation guidance committee. W ithout their reviews of the dissertation and priceless comments, this dissertation would not have been completed. I want to thank all the colleagues I had during the program: Moez Ayed, Jim Burns, Yung-Syau Chen, Jerry Cheng, Halima Elnaga, Stephen Jenks, Dongsoo Kang, Jungyup Kang, Chinhyun Kim, Hiecheol Kim, Seong-Won Lee, Wen-Yen Lin, Wonwoo Ro, Hung-Yu Tseng, Jongwook Woo, Namhoon Yoo and Dae-Kyun Yoon. For many years, I really enjoyed discussions with these colleagues. Especially, for the recent months, I indeed enjoyed discussions with Seong-Won for the common interest areas of our researches. I also deeply thank Wonwoo and Jungyup for the iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. time and efforts they kindly spent in proof-reading this dissertation and giving valuable comments. I cannot thank enough for the help I got from the Mindspeed Technologies (formerly, Conexant Systems Inc.) for its Educational Reimbursement Program that financially helped me successfully complete the program. While I was working at the Mindspeed Technologies, Keith Bindloss, Alan Taylor and Dan Pettyjohn helped me greatly in many aspects, especially by allowing me to rearrange work hours so that I could invest time in this dissertation. W ithout their encouragement and interests that had kept me awake, this dissertation would not have been what it is. The last but not the least. I cannot thank enough my parents for their selfless support in various ways for many years. W ithout their support, this dissertation could not even have been started. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C ontents D edication ii Acknowledgem ents iii List Of Tables viii List O f Figures ix A bstract xi 1 Introduction 1 1.1 Simultaneous M ultithreading................. 1 1.2 Performance of SMT p ro cessors............................................................... 7 1.3 Research G o a ls ............................................................................................ 10 1.4 C ontributions............................................................................................... 13 1.5 Organization of D issertation..................................................................... 13 2 Background Research 15 2.1 Taxonomy of Architectural Technologies .............................................. 15 2.1.1 Main Architectural Technologies................................................. 15 2.1.2 Performance Enhancement Technologies ................................. 18 2.2 Simultaneous Multithreading (SMT) and Chip Multiprocessing (CMP) A rchitectures.......................................... 21 2.2.1 Chip Multiprocessing (CMP) ..................................................... 21 2.2.2 Simultaneous Multithreading ..................................................... 22 2.3 Related W orks............................................................................................... 26 2.3.1 Hardware Thread Scheduling for S M T .................................... 26 2.3.2 Software-Oriented Processor C o n tr o l....................................... 27 2.3.3 OS-level Job Scheduling for SMT pro cesso rs........................... 28 3 A daptive D ynam ic Thread Scheduling 31 3.1 Overview......................................................................................................... 31 3.2 Hardware Thread Scheduling in S M T .................................................... 36 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 Adaptively Switching Fetch Policies...................................................... 39 3.4 Elements of the Adaptive Dynamic Thread Scheduling with the Detector T h r e a d .................................................... 43 3.4.1 Per-Thread Status Indicators ................................................... 43 3.4.2 Thread Control Flags and Thread Selection Unit ................. 43 3.4.3 Detector Thread .......................................................................... 44 3.4.4 Why Detector T h re a d ? ................................................................ 45 3.5 Implementation of Adaptive Dynamic Thread Scheduling with the Detector T h r e a d .................................................... 46 3.5.1 Software Overhead of Adaptive Dynamic Thread Scheduling with the Detector T h r e a d ....................................... 47 3.5.2 Fetching Instructions from the Detector T h r e a d .................... 48 4 D etector Thread 51 4.1 Overview........................................................................................................ 51 4.2 The Detector T h re a d ................................................................................ 52 4.3 Software Architecture of the Detector T h r e a d .............................. 56 4.3.1 Pseudo Code of the Detector T h r e a d ...................................... 57 4.3.2 Determination of Threshold V a lu e s ........................................... 59 4.3.3 Determination of Next Fetch P o lic y ........................................... 61 4.3.3.1 Underlying P rem ises.................................................... 61 4.3.3.2 Type 1 H e u ristic ........................................................... 62 4.3.3.3 Type 2 H e u ristic ........................................................... 63 4.3.3.4 Type 3 heuristic ........................................................... 64 4.3.3.5 Type 4 H e u ristic ........................................................... 67 4.3.4 Identifying Clogging Threads ................................................... 68 4.3.5 Changing Dispatch Policies ...................................................... 68 4.4 Applications of the Detector Thread ................................................... 69 4.4.1 Efficient Job S ch ed u lin g ............................................................. 69 4.4.2 Low Power C onsum ption............................................................. 70 4.4.3 Program and Data P re fe tc h ...................................................... 70 4.4.4 Speculation C ontrol...................................................................... 70 5 M ethodology 72 5.1 Our Sim ulator............................................................................................. 72 5.2 Benchmark applications and their combinations .............................................................................................. 74 5.3 Fetch Policies Modeled in S im u latio n ................................................... 75 5.4 Modeling the Detector Thread’ s Behavior............................................ 77 6 Experim ental R esults 79 6.1 The Need for the Adaptive Dynamic Thread Scheduling in Simultaneous M ultithreading.......................................... 79 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.1.1 Performance of Various Fetch Policies........................................ 79 6.1.2 Pseudo Fetch Policy - M a x ............................................................ 82 6.1.3 Accuracy of the pseudo fetch policy M a x .................................. 86 6.2 The optimal threshold values and policy determination heuristics................................................................. 87 7 Conclusions 93 8 Future Work 96 Reference List 99 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List O f Tables 2.1 Comparison of hardware space complexity [7] 25 5.1 Simulator configuration............................................................................... 73 5.2 Various combinations of a p p lic a tio n s..................................................... 76 5.3 Various Fetch Policies te s te d ..................................................................... 77 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of F igures 1.1 Different combinations of th r e a d s ................................................ 6 1.2 More frequent thread controls available in SMT architecture . ... 9 2.1 SSMT organization [13].................................................................... 29 3.1 When ICOUNT does not work well? ................................................... 34 3.2 Detector Thread H ardw are............................................................. 41 3.3 Hardware Implementation of Detector T h r e a d ......................... 47 3.4 Fetching from the normal threads and the detector t h r e a d .. 50 4.1 States a thread can assume in an SMT processor...................... 53 4.2 The hardware thread scheduling works on the jobs that have been once selected by the system job scheduler..................................... 55 4.3 A pool of jobs that the job scheduler needs to select every scheduling quantum................................................................................................ 56 4.4 Software architecture of the detector thread ...................................... 58 4.5 The framework of a detector thread in pseudo code (abridged) . . . 60 4.6 How the threshold value affect number and quality of switchings? . 61 4.7 Type 1 heuristic for determination of a new fetch policy........... 63 4.8 Type 2 heuristic for determination of a new fetch policy........... 64 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.9 Type 3 heuristic for determination of a new fetch policy.......... 66 5.1 Pipeline stages modelled in SimpleScalar and SimpleSM T....... 74 6.1 How often a fetch policy becomes the best perform er?...................... 81 6.2 How often a fetch policy becomes the worst p erfo rm er?................... 82 6.3 Throughput of the virtual fetch policy Max in each in terval............ 84 6.4 IPC of Max and average IPC of various other fetch policies.............. 86 6.5 Effect of the threshold value on switch occurrence and quality . . . 88 6.6 Effect of the threshold value and policy determination heuristic on throughput (average of all m ix e s ).................................................. 89 6.7 Effect of the threshold value and policy determination heuristic on throughput (for hil-hiM-f mix o n ly ).............................................. 90 6.8 Effect of the threshold value and policy determination heuristic on throughput (for mxI-mxM-mx mix o n l y ) ..................................... 92 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A bstract Simultaneous Multithreading (SMT) attempts to attain the high resource utiliza tion by allowing instructions from multiple independent threads to coexist in a processor and compete for shared resources. Earlier studies, however, have shown that the performance of a realistic SMT architecture begins to saturate as the number of hardware contexts increases beyond four or five. In this work, we attempt to prove our contention that a single fixed hard ware thread scheduling strategy cannot provide optimal results for the various un predictable thread combinations it faces. We propose an approach that partially schedules threads in software in the form of a detector thread at nominal hardware and software costs. Our approach offers the capability to adaptively switch thread scheduling policies depending on the situations which are likely to vary constantly in a hard-to-predict manner. We show that there is much room for performance improvement for the adaptive dynamic thread scheduling approach. The results we have obtained by simulating a realistic SMT architecture show that no single fetch policy is the best solution for xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. more than 50% of the total time. We show that significant performance improve ment for SMT with eight contexts can be attained by allowing switching of fetch policies. We propose a software architecture for the detector thread and evaluate various heuristics for determining better fetch policies which comprise the core function of the software. We found that, under a restrictive configuration, our approach could outperform a fixed scheduling by up to 30% o. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction 1.1 Sim ultaneous M ultithreading Simultaneous Multithreading (SMT) or Multithreaded Superscalar Architecture [20, 54, 88, 87, 27, 36] was introduced to remedy limited utilization of wide-issue superscalar architecture. Increasing instruction issue rate by widening the proces sor pipelines could not attain high utilization even if advanced technologies such as non-blocking caches and intelligent branch prediction are employed mainly be cause of limited instruction-level parallelism inherent in applications . As shown in [91, 50, 92], most applications by themselves seldom offer inherent instruction- level parallelism (ILP) of more than 7 instructions per cycle, even with the most advanced techniques and the most optimistic assumptions. It was shown that ex ploiting thread-level parallelism (TLP) on top of multiple-issue platform in the 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. form of SMT architecture can solve the problem of limited instruction-level paral lelism by filling the pipehne slots wasted unused with instructions from multiple independent threads [88]. Tullsen, et al. explained under-utilization using two terminologies: vertical and horizontal wastes. Vertical wastes are generated in the pipelines of a superscalar processor when slots are left unused because one entire stage becomes idle without accepting any instructions. For instance, when no instructions can be fetched at all or no instructions can be allocated to functional units for execution in a cycle, one cycle worth of vertical waste is resulted. Horizontal wastes occurs when one pipeline stage is partially unallocated. For instance, if only five instructions are issued to the instruction queues while the maximum capacity is eight instructions, then we have three instructions worth of horizontal wastes. Two main reasons for vertical and horizontal wastes are the inter-instruction dependencies (data- or resource-related) and long-latency operations [88], W ith multiple threads available, the vertically-wasted slots can be filled with instructions from other threads that are independent of the instructions of our interest. That is, the TLP is exploited to eliminate vertical waste of pipelines. Resource conflicts and latencies can result in vertical pipeline wastes as well and they can also be eliminated by exploiting the TLP. The horizontal pipeline wastes appear when a fetch fragmentation occurs due to conditional branches and a part of fetched instructions are not eventually executed. W ith multiple threads available, 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. such horizontal wastes can also be eliminated by filling them with instructions from another independent thread. Consequently, an SMT processor can sustain higher utilization than a wide-issue superscalar [77, 88], It was shown that PowerPC 620 reached throughput of from 0.96 to 1.77 IPC (instructions per cycle) [18] with the maximum issue rate of 4. Tullsen et al. [88] showed that even an 8-issue Alpha-like superscalar architecture failed to sustain 1.5 instructions per cycle. On the other hand, they also showed that with simultaneous multithreading where multiple contexts are supported and instructions from multiple threads may be issued to the instruction queues in the same cycle, the maximum of 2.5 times of throughput was achieved over an equivalent 8-issue superscalar architecture. The increase in utilization is achieved for the cost of added hardware circuitry; multiple register files, program counters, per-thread retirement mechanism and other additional wirings are needed on the processor. At each cycle, instructions can be selected from multiple threads to fill pipeline slots. The low resource utilization of wide-issue superscalar processors comes from functional unit latencies and memory access latencies, inter-instruction dependen cies, resource conflicts, and mispredicted branches. The SMT architecture can, without the heavy overhead of context switching, eliminate stalls incurred by the instructions inside a thread by issuing instructions from other processor-resident 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. threads to fill idle slots in the wide-issue pipelines. Because threads are indepen dent, the only dependency between instructions from different threads are resource conflicts where two or more instructions need the same resource. The SMT architecture has advantages such as code compatibility as superscalar architecture does. Of course, compiler optimization targeted for SMT architecture should boost performance of the applications originally developed for the base su perscalar architecture with common instruction set architecture. We view the SMT architecture as a platform where multiple independent applications can be simul taneously executed to attain a higher throughput. It is not designed to parallelize or multithread a single application for higher performance. Consequently, our work does not heavily rely on success of compiler technology though it can be helpful. For multithreaded parallel applications which consist of threads that belong to a single multithreaded application, performance of the whole application can be greatly affected by the quality of programming or compiler. However, our work is mostly focused on multiprogrammed or multi-user environments where combi nations of multiple threads that an SMT processor faces are significantly varied over time (e.g., from a uniform combination of compute intensive applications to a “colorful” mixture of compute intensive and memory intensive applications). The compiler or the programmer of each application is not prescient of what other ap plications it will be executed with and of how it will interact with the applications. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As a result, it becomes indispensable to adopt a more intelligent and more dynamic thread scheduling capability in order to sustain high throughput. The properties of multiprogrammed and multithreaded workloads are very dif ferent. The goal for a multithreaded workload is to achieve higher performance of a single application. The goal for a multiprogrammed workload is to achieve higher throughput [23]. If a group of threads remain in the group for the whole lifetime of each thread in the group, simply achieving high throughput would not be enough because under-utilization time will anyhow appear during the lifetime. However, if it is a multi-user system where scores of jobs are waiting to be executed or a web server in which thousands of threads are created and killed every minute, simply improving throughput is directly related to the performance because more user requests or more jobs will find themselves completed in shorter time resulting in better “satisfaction” . An SMT will not simply face either entirely multiprogrammed workloads (Fig ure 1.1 e) or entirely multithreaded workloads (Figure 1.1a). As shown in Figure 1.1 b-d, there can be mixtures of multithreaded applications and independent threads. That is, when the number of contexts required to run a multithreaded application is smaller than that of those available, new multithreaded applications or indepen dent applications can occupy the remaining contexts. This implies that a static scheduling alone is not sufficient. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. threads/application 8 4 2.7 2 1 threads 8 4+4 3+3+2 2+2+3+1 1 x8 applications 1 2 3 4 8 i t ™.'.:.! i i - 1-4. - 1 .. Z J ± Z 2 ZEI2 I tii \ P i n n.wn 2-4 i- _ -.i 1-3 I — 11 ! r. ITT**:--;] " ■■ I 3 1 L _ « j a) b) c) d) e) Figure 1.1: Different combinations of threads Effectively handling independent threads is increasingly important, not just in the context of SMT. Multiple independent threads are easier to find now compared to the past: even on a small PC, multiple independent threads can be easily found with the advent of operating systems which support multitasking and multithread ing. On web server hosts, threads (or processes) are generated as servlets by a Webserver at a rate of hundreds or even thousands of threads per minute [37]. Even on a small PC, a user may browse the internet while listening to music, running spreadsheet applications while her email software periodically checks new emails. The operating system in the meantime may also keep checking the batteries and I/O activities. When a set of applications that run on an SMT processor happens to exclusively consist of memory-intensive threads, even an SMT processor can experience low 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. throughput as long as the set does not change. SMT will tolerate long latencies of memory access operations up to some limit: if all threads are waiting for data to arrive, there is nothing SMT can do any longer to tolerate the long latencies. We feel that there is a way to sustain high throughput by carefully selecting threads to fetch for during the next cycle; in the previous example, instructions with fewer memory accesses can be first chosen. Or, the system job scheduler can be notified so that a different set of threads are brought into the processor. 1.2 Perform ance of SM T processors Studies have shown that when the number of threads allowed in a processor becomes more than four, performance saturates and in some cases even degenerates because of limitation in the shared instruction queue, fetch throughput, or contention in the L2 cache [87, 35]. In these studies, an attempt was made to overcome the satura tion by finding a better fetch mechanism or increasing the number and availability of resources that would otherwise become bottlenecks (such as register files and instruction queues). It was also shown that increasing the size of the caches can contribute to delay the saturation point in terms of number of threads. Unfortunately, such remedies do not work in all cases because their effectiveness mostly relies on the properties of application mixtures. For instance, increasing the size of the caches cannot help when the set of applications currently in the system are not likely to make frequent memory accesses or have too large a working 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. set. Likewise, giving priority to the threads having fewer branches during prior executions might not be sufficient to resolve the bottleneck when the branches of applications running in the system are not hard to predict, which is often the case with applications with a number of loops. We believe that one fixed thread scheduling that performs better than others “on the average” cannot deliver the performance we desire in SMT architectures which support more than four thread contexts. Instead, we claim that adding an adaptive dynamic thread scheduling may be the only way to significantly improve performance of SMT processors for multiprogrammed workloads and to prevent them from saturating or letting performance degrade as the number of threads increases. Figure 1.2 shows why an adaptive thread scheduling approach is more feasible in SMT. The arrows point to the moments where the processor can make decision on what threads it will choose for fetch or dispatch. As shown in the figure, in SMT, decisions can be made at each cycle. This implies that there are more windows of opportunities open to SMT architecture in which more efforts can be made to improve performance. In an SMT processor, at critical points throughout the pipeline, threads should be selected from which instructions are chosen to proceed to the next stage. Similar things happen in a single-threaded processor when the processor is interrupted while executing a process. In such events, a system process would eventually be brought 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. into the processor and attem pt to make a best selection among ready processes [5]. In a single-thread processor, this selection happens only on context switching boundaries whereas this selection can be made at each cycle in an SMT architecture. M ultithreading with single context M ultithreading with m ultiple contexts SM T Figure 1.2: More frequent thread controls available in SMT architecture W ith the adaptive dynamic thread scheduling, when a change in the system environment is detected, the next-interval fetch policy is decided and put into effect. However, having multiple fetch policies and decision-making algorithms in hardware could entail a high overhead. In this dissertation, we propose to develop our detector thread approach which will help lower the hardware requirement and also make use of unused pipeline slots to run decision-making algorithms and fetch policies. This approach has the additional advantage that thread scheduling can be manipulated even after the chip has been produced because the detector thread is programmable. 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The detector thread can also help alleviate the overhead of the system job scheduler by shortening its stay in the processor and analyzing information before the job scheduler needs them. 1.3 Research Goals We agree to the claim [20] that SMT will be one of the major platforms that will allow to fully exploit tomorrow’s billion-transistor realty. One great issue though is, as mentioned above, that as the number of threads that can run simultaneously in an SMT processor increases, the performance saturates and cannot return further advantage for employing more contexts. Observed saturation points are usually between four and six threads [76, 63]. As we said earlier, our contention is that such saturation is because only one fixed hardware thread scheduling is exploited no m atter what kind of application sets simultaneously exist in an SMT processor. ICOUNT\87] superseded all other fetch policies in performance on the average and it was used as the single fixed hardware thread scheduling standard. It is true that, in general, ICOUNT best accounts for what is going on in SMT pipelines. Since it gives priority to the threads that have fewer instructions in earlier stages of the pipelines (decode, rename stages and instruction queues), balanced usage of instruction window takes place. Also, since it gives more opportunities to the threads whose instructions drain through the pipeline more rapidly, more efficient use of the pipeline is resulted [87], 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Put in other words, ICOUNT gives lower priority to the threads with more instructions in the pipeline. As the number of instructions that belong to a thread increases, the probability that an instruction can find instructions that it depends on increases exponentially. Consequently, it is imperative that each thread resident in an SMT processor maintain as few instructions as it can for better chance that its instructions will find more independent instructions. An extreme case would be only one thread occupying the whole pipeline exploiting zero thread-level parallelism. However, it was also observed that there are cases where alternative fetch poli cies work better than ICOUNT does. That is why Tullsen et al. stated that “weighted combination” of ICOUNT and BRCOUNT might work the best. They were, however, concerned that it would increase hardware complexity. While ICOUNT is the scheduling policy that works best on the average and in a collective manner, it does not address problems as directly as other policies such as BRCOUNT and MISSCOUNT do. Suppose the set of applications in the SMT processor at the moment consists of four control-intensive applications and four other applications. Furthermore, these four control-intensive applications (with lots of conditional branches) are experiencing high branch prediction misses at the moment. Then, the processor will suffer from wasted slots filled with wrong-path instructions of such control-intensive applications while inhibiting the other four threads from exploiting the resources in the pipeline. In these specific cases, if BRCOUNT (or even BPMISSCOUNT) had been used, the four control-intensive 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. threads will find fewer chances to get fetched. Consequently, the number of in structions of control-intensive threads will reduce while the number of instructions of the other four threads will increase, resulting in balanced use of instruction win dow and other resources. In terms of instruction counts, there may be imbalance if BRCOUNT is used. However, the effect of imbalance in branch counts may be more significant than the effect of imbalance in instruction counts and if that is the case, BRCOUNT will perform better than ICOUNT. The first research goal is to find out whether there is room for improvement over ICOUNT fetch policy which is known to perform better on the average than all other fetch policies. If much room were not available for improvement, our approach would not be able to enhance the performance. Chapter 6 shows that sufficient room (about 30%) is available for performance improvement. The second goal is to investigate issues regarding how to implement the adap tive dynamic thread scheduling with the detector thread to validate the idea. We will also evaluate cycle budget to see whether enough idle slots are available for ex ecution of the detector thread instructions. We will design the software framework for the detector thread based on that. The next goal is to evaluate the heuristics for determining a new fetch policy for the next scheduling quantum upon detection of low throughput. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. We use SimpleSMT simulator for realistic simulation of SMT architecture [52], Applications from SPEC 2000 are used as our workloads. For multiprogram work loads, we form thirteen different mixtures (each with eight applications) based on three factors; single-application throughput, memory footprint and floating-point or integer. 1.4 Contributions We present a novel idea of adaptively switching fetch policies depending on the varying system states. We challenge to the idea of having one fixed fetch policy that outperforms other alternatives on the average. We show that in SMT such adaptive scheduling can be done with low hardware and software costs. With this idea, we are proposing a way to delay the saturation point in terms of number of threads. Using an assisting thread in order to help improve performance of multiprogrammed workloads is a new approach; conventional studies were mainly focusing on improving the performance of a single application by helping it with additional helper threads. 1.5 Organization of D issertation This dissertation is organized as follows. In Chapter 2, main architectural tech nologies and performance enhancement technologies are surveyed. Related works 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are also presented. The idea of adaptive dynamic thread scheduling is presented in Chapter 3. Details of the detector thread software and heuristics exploited in the software is presented in Chapter 4. Our evaluation methodology and how we formed our workloads with SPEC 2000 are explained in Chapter 5. Presentation and analysis of experimental results is covered in Chapter 6 and the dissertation is finally concluded by Chapter 7. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Background Research 2.1 Taxonom y of A rchitectural Technologies In this section, we review the existing technologies and ideas in computer architec ture to view the position of our work in the domain. One of the major goals of the latest computer architecture researches is how to efficiently exploit the increasing on-chip realty estimated to reach the level of one billion transistors by 2010 [69, 46]. We classify the technologies to achieve the goal into two: main architectural tech nologies and performance enhancement technologies. 2.1.1 M ain A rch itectu ral T echnologies Main architectural technologies are the ones that require a major overhaul in most aspects including hardware design, operating system, compiler and programming 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. languages. The technologies that fall in this category include the wide-issue su perscalar, VLIW, fine-grain multithreading, CMP (chip multiprocessors) and SMT (simultaneous multithreading). Wide-issue superscalar architecture is simply an effort to extend the existing superscalar architectures to a higher issue rate such as six, eight or more instructions per cycle. For this effort to be successful, a larger number of functional units and more efficient design of instruction scheduling are necessary. Even though this architecture supports binary compatibility between generations of architectures, advancement in the compiler is required to exploit the extended power available in the widened pipelines. VLIW [22, 14, 61, 19, 49, 16] architectures attem pt to extract instruction-level parallelism using a compiler. The compiler schedules instructions statically while taking into account inter-instruction dependencies. One of VLIW architecture’s advantages is the relatively low hardware requirement because instruction schedul ing is mostly done by the compiler. Another advantage is that it allows low-power design and higher clock rates because it demands lower hardware complexity. Multithreaded or multiple-context architectures are comprehensively surveyed and explained in [16, 90]. One of the most interesting fine-grained interleaved multi threading architectures is Cray MTA which was evolved from earlier multithreaded architectures, HEP and Horizon [90, 1, 85, 78], It was designed to combine pipelined 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. VLIW and interleaved multithreading. Each VLIW packet consists of three oper ations and 128 threads are supported to hide latencies up to 128 cycles. Each of the three operations should use one of three units: memory, arithmetic/logical or arithmetic/logical/branch. Both SMT architecture and chip multiprocessors (CMP’ s) attempt to get around the problem of limited instruction-level parallelism by introducing thread-level par allelism. The main difference lies in the level of sharing: an SMT processor allows all threads to share most resources while a chip multiprocessor system isolates most resources to each of its processing element. An SMT processor will perform better than a CMP system when a “massively parallel” application is running because it will have more resources that can be exploited by a single thread. A CMP proces sor has advantage in that it requires less global wiring between contexts and thus it will be easier to manufacture it with low power consumption and higher clock rates. Studies in this area are discussed in Section 2.2. There was a study that surveyed and classified various SMT architectures [47]. It proposed a novel classification approach based on the three parameters: the number of threads that can be fetched in a cycle, the number of instructions available for issue and the number of operations per instruction. It compared the processor coupling [41], the aforementioned multithreaded superscalar architecture proposed by Hirata et al. [36], the LCM architecture proposed in [26], the M-Machine [21] and the SMT architecture [88, 87], The study concluded that higher throughput 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. can be achieved by increasing the number of threads that can be fetched at a time, the number of instructions per thread, or both. The classification method clearly identified differences between various multithreaded architectures. 2.1.2 P erform ance E nhancem ent T echnologies Performance enhancement technologies do not require significant efforts in almost every aspect. They can often require efforts in only one aspect or two. For ex ample, the cache hierarchy is classified to belong to this group because it does not “directly” affect the compiler or the operating system issues and can entirely be treated as hardware design issues although it can be assisted by the compiler or the operating system for better performance. In addition to better caches, branch prediction, trace caches, value prediction and processor-in-memory (PIM) belong to this category. PIM Techniques was introduced to solve the problem of “Processor and Mem ory Performance Gap” or “Memory Wall” by putting the memories closer to the processing core. By physically placing the processor adjacent to the memory in the same chip, the approach attem pts to deliver high memory bandwidth and re duced memory latency compared to the conventional processor and memory rela tionship. Though the idea of putting logic and the memory into one single chip is not new, until recently the fabrication technology was not yet ready to allow accept able yield while meeting the desired combination of speed and memory density [8]. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. PIM techniques gain tremendous attentions from data-intensive applications such as multimedia applications not only for their latency and bandwidth advantages but because these applications require the capacity to handle a large volumes of data [17]. Multiple efforts have been made to develop the next-generation computer ar chitectures using the PIM techniques. Patterson et al. proposed the IRAM ar chitecture [71]. Chong et al. contrived RAD RAM Architecture using the Active Pages [64]. Kogge et al. have proposed another PIM architecture for the HTMT (Hybrid Technology Multithreaded) machine [45], Torrellas et al. proposed the FlexRAM architecture [40]. The idea behind all the PIM architectures instead of Memory-In-Processor ar chitectures is that the on-chip memory capacity is greatly increased by using DRAM technology instead of much less dense SRAM memory cells [46]. Problems of PIM approaches include area and power consumption effect of increasing bandwidth to DRAM core and applications that are greater than the size of the allocated DRAM. As the technologies advance, the PIM architectures could be one of the attractive performance enhancement technologies. For optimized performance, though, de velopment in compiler technologies specialized in specific PIM architectures seems necessary. The PIM technology along with multiple-context processing core might also be a powerful platform for high-performance computing in the future as well. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Value prediction is another idea that has been proposed to exploit additional die space to be available in the near future. Its idea is based on the value locality. Value locality implies that the same location of storage is very likely to hold the same data value. It can break data dependencies under the prediction of a certain data while allowing a stalled instruction that cannot advance because of data dependencies to advance resulting in more parallelism between instructions. Several different types of value predictors have been proposed: Last value predicted [53], stride prediction [25], context-based prediction, or hybrid model [11, 12]. Also, a compiler based approach was proposed in [51], Trace cache [72, 73, 68] is based on the idea that a certain execution order of instruction stream is repeated frequently. Therefore, a structure called trace cache is designed to hold the trace of instructions and helps to fetch the instruction stream. Although instructions from different locations are accessed by program flow, Trace cache can provide the instruction access pattern and the instruction can be accessed at a high fetch rate. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2 Sim ultaneous M ultithreading (SM T) and Chip M ultiprocessing (C M P) A rchitectures The wide-issue superscalar architecture relies on the technologies [80, 81, 15] that enable high-speed instruction fetch in the presence of instruction cache access laten cies caused by missed blocks where multi-ported, interleaved non-blocking caches are exploited. Because of limited inherent instruction-level parallelism of appli cations [91, 50, 92], raising the issue rate alone ends up with diminishing returns. The solution for this problem is to exploit thread-level parallelism and there are two ways to accomplish that: simultaneous multithreading and on-chip multiprocessing. 2.2.1 C hip M u ltip rocessin g (C M P ) Nayfeh et al. [62] developed a multiprocessor architecture where resources are par titioned between processors on a single chip or an MCM (multichip module). They show that a configuration of two processors per cluster and a smaller cache was found to better perform and to have more cost/performance than a single proces sor and a large cache. Also, they showed MCM packaging techniques makes the performance of four-cluster configuration linearly scaled as processors are added to each cluster. Hammond et al. [30, 31] developed the Hydra CMP which contained four MIPS-based processors with on chip LI caches and shared L2 caches. For effec tive parallel programming, Hydra supported thread-level speculation and memory 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. renaming. However, to fully utilize the processors, huge programming efforts seem necessary. The study justified that CMP is a better alternative to wide-issue su perscalar architecture because of implementation and performance advantages of CMP. 2.2.2 Sim ultaneous M u ltith read in g Hirata, et al. [36] proposed a simultaneous instruction issuing multithreaded proces sor architecture which is similar to SMT architecture. They showed that functional units can be efficiently utilized in conventional RISC processors if instructions are issued from different threads. Also, by developing their own static code scheduling technique, they showed that functional unit conflicts between single loop executions can be efficiently handled. They developed another scheme that may be used to parallelize loops that are difficult to parallelize in vector or VLIW architectures. They used ray-tracing and radiosity applications that are commonly used for gen erating realistic images. They used these applications because they have inherent coarse-grained parallelism and can readily generate many threads that can be si multaneously executed by their multithreaded superscalar processor. This study was mainly concerned with improving performance with loop parallelization. Loikkanen, et al. [57, 27] evaluated their fine-grain multithreading paradigm and unique multithreaded superscalar architecture. They showed their approach increases instruction-level parallelism and long-latency operations are hidden by 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. it. Their architecture model is similar to ours and they also employed centralized instruction window. They call their superscalar architecture a “superscalar digital signal processor” . Hily and Seznec [34] proposed three branch prediction strategies for simultane ous multithreading. They showed that the simple 2-bit predictor behaves increas ingly poorly as the number of threads increases. The result of their simulation study is as follows. First, the sizes of tables (pattern history table/branch target buffer) depend on the number of active threads with a beneficial sharing effect in a mul tiprogramming environment according to branch predictors. Second, the smaller the sizes of the tables are, the higher the misprediction rate becomes because of conflicts in the BTB (branch target buffer) in multiprogramming or parallel pro cessing. Third, they showed 12-deep return address stack per thread enhances the accuracy of branch prediction. They used several applications found in SPLASH2 benchmark suite as their workloads. In another study, they pointed out that L2 contention may limit the performance of SMT processors as the number of threads increases beyond four [35]. Another study evaluated SMT while varying its issue bandwidth and thread contexts to find out its saturation. The study found out that adding threads above a four-issue bandwidth would give only a marginal return [77]. The work also compared the multithreaded superscalar architecture with a multiprocessor chip 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and found out that the multithreaded superscalar architecture showed 1.8 times speed up over the multiprocessor chip. Lo, et al. [55] proposed guidelines for improving the performance of programs executed on SMT architecture by applying three compiler optimization methods: Ioop-iteration scheduling, software speculative execution, and loop tiling. The re sults show that threads should be cycle-based parallelized, that software speculative execution should not be applied to non-loop programs, and that exact sizing of tiles for matching cache sizes need not be concerned. Bekerman, et al. [7 ] have investigated how hardware complexity differs in three different architectures, superscalar, SMT and CMP architectures as shown in Ta ble 2.1. The figures were obtained using their own architecture models and cannot be directly apply to other models. Also, for SMT they covered a limited set of con figurations where each thread is statically assigned I / T fetch bandwidth where I represents the total issue bandwidth of the processor and T represents the number of threads supported by the processor. In this kind of model, resource under utilization occurs when the number of threads is less than the number of threads supported by the processor. For instance, say, the processor supports 4 threads and the fetch bandwidth is 8, then in this model, each thread is statically assigned only 2 fetch bandwidth and when only 3 threads are available, 2 fetch slots are wasted. This table shows that an SMT processor needs far more hardware complexity in scheduling logic than a CMP processor does. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Superscalar CMP SMT Number of Threads 1 T T Instruction Window N T(N/T) T(N/T) Fetch Bandwidth I T(I/T) T{I/T) Number of FU’s F F F Decode I Til jT) T{I/T) Rename Registers P T { I/T f T(I/T)'2 Scheduling Logic F N 4 T(F/T){N/TY FN 4 Commit I T (//T ) T(I/T) Table 2.1: Comparison of hardware space complexity [7 ] Burns et. al. [10] estimated that SMT with small-to-moderate area overhead can achieve up to 142% increase in throughput. The baseline architecture used in the study was MIPS R10000. The study went through comprehensive analysis of area complexity caused by extending the baseline architecture for SMT. Krishinan et. al. [48] examined multithreaded superscalar processors for automatically paral lelized and explicitly parallelized applications in a shared-memory multiprocessor environment. They have shown that an SMT-based multiprocessors are the most cost effective for parallel applications. They showed that SMT-based multiproces sor system has higher performance and is more robust in low-performing memory systems by comparing it to a conventional superscalar multiprocessor system where processing elements are small-scale superscalar processors. Also they showed that although CM P’s are simpler, a plain SMT-based system has higher performance and is a better option for workload with sequential applications. Another thing they found is that through optimizing hardware support in synchronization performance, SMT-based multiprocessors can perform faster than CM P’s. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3 R elated Works 2.3.1 H ardware T hread Scheduling for SM T Tullsen, et al. [88] compared several SMT models to a wide superscalar, a fine-grain multithreaded processor, and single-chip multiple-issue multiprocessing architec tures. The study showed that SMT has 4 times the throughput of a superscalar, and double that of fine-grain multithreading. Interestingly, they compared SMT to an on-chip multiprocessor system with similar degree of resources and showed that SMT outperforms. Alternative organizations in design space were explored according to complexities in design. In their other study [87], they showed that their 8-thread/8-issue SMT achieved up to a throughput of 5.4 instructions per cycle which is 2.5-fold improvement over a superscalar with similar resources. In this study, they evaluated various fetch partition policies. Also, thread selection schemes were explored. The schemes include BRCOUNT, MISSCOUNT, ICOUNT and IQPOSN where the priority is given to threads with fewer branch instructions in the pipeline, threads with fewer outstanding D-cache misses, threads with fewer instructions (or faster-moving threads) in decode, rename and centralized instruc tion queues and newer instructions in the instruction queue, respectively. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3.2 Softw are-O riented P rocessor C ontrol Our detector thread approach has one thing in common with the technologies used in the Code Morphing software for Crusoe™ processor [43]. Both depend on hardware and software and that the hardware-related operations are programmable and one process (Code Morphing Software or Detector Thread) is always active in the processor. However, the Crusoe software is a single-context processor and the Code Morphing software does all the work. Similarly, our detector thread just takes care of thread scheduling by having an effect on dynamic decision on fetch policies. Because the detector thread is inside the SMT, all other normal threads can be active at the same time. Another study that investigated the use of a special thread is speculative pre- computation [95] which aims at realizing speculative precomputation in one of the two threads available on the Hyper-Threading architecture. The study is targeted at improving performance of single-threaded applications on two-context SMT pro cessors. Detector thread idea itself is not entirely new. DanSoft [29] proposed the idea of nanothreads in which one of nanothreads is given the control of the processor upon the stall of a main thread. The idea was based on a CMP with dual VLIW single-threaded cores and its success hinges on the effectiveness of the compiler. Assisted Execution [82] extended the nanothread idea for architectures that allow simultaneous execution of multiple threads including SMT. It attem pts to improve 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. performance of a main thread by having multiple nanothreads perform prefetch and its success also hinges on compiler. Simultaneous Subordinate Microthreading (SSMT) [13] was proposed to at tem pt to improve performance of a single primary thread by having multiple sub ordinate microthreads do useful works such as running sophisticated branch predi cation algorithm. The idea was not based on SMT architecture and also requires effective compiler technology. However, their idea about implementation of the microthread was very relevant with implementation of our detector thread and our proposed implementation is greatly inspired by SSMT’s organization (Figure 2.1). Speculative data-driven multithreading [74] takes advantage of a speculative thread, called a data-driven thread(DDT) to pre-execute critical computations and consume latency on behalf of the main thread on SMT. This study also was fo cusing on improving the performance of a main thread. Luk [58] also proposed pre-executing for more effective prefetch for hard-to-predict data addresses using idle threads to boost performance of a primary thread. 2.3.3 O S-level Job Scheduling for SM T processors Parekh et. al. [67] investigated issues related to job scheduling for SMT proces sors. They compared performance of oblivious and thread-sensitive scheduling. The study concluded that thread-sensitive IPC-based scheduling can achieve signifi cant speedup over round-robin methods. However, this study concerns system job 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 f F etc h 3uffer llContext uRAM uRA M S e q u e n c e r u T h re a d Buffer D e c o d e L ogic / R e n a m in g Logic --S^SSKaJDif- Spawn 8 Event R e s e rv a tio n S ta tio n s . ■ ■ FU FU • • • FU FU Figure 2.1: SSMT organization [13] scheduling and cannot be directly related to dynamic thread scheduling. Also, the job scheduler will have to be brought into the processor resulting in context switch of user threads. This job scheduler, however, can take advantage of our detector thread approach and it will be discussed in section 3. Another similar study [79] investigated job scheduling for SMT processors. The study proposed a job scheduling scheme called SOS where an overhead-free sample phase is involved where performance of various schedules (mixes) are sampled and 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. taken into account for the selection of tasks for the next time slice. This approach can also benefit from our approach because the detector thread will be always active and make use of unused pipeline slots and resources to find out what threads should not be selected in the next job scheduling time slice while lowering the burden of the job scheduler. A study to detect per-thread cache behavior using hardware counters and help job scheduling based on the information obtained on SMT was done by Suh et al. [83]. This approach is similar to our idea of relating the detector thread with job schedulers. However, it does not aim at controlling thread fetch policies. Our adaptive dynamic thread scheduling approach should not be confused with adaptive process scheduling [59] which addresses OS job scheduling issues for SMT processors: the goal of our approach is to offer more efficient thread scheduling at the level of instruction fetch in the SMT pipeline. The goal of adaptive process scheduling is to find the best co-scheduled set of applications from the perspective of the system job scheduler. 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 A daptive D ynam ic Thread Scheduling 3.1 Overview Simultaneous Multithreading offers new opportunities in hardware thread schedul ing. It allows to select threads in every cycle for fetching instructions. Furthermore, it allows to fetch instructions from multiple threads. In conventional architectures, choosing threads is not allowed as frequently. W ith a single-context processor, a new thread cannot be chosen until a resident thread is blocked and switched out even though it would result in less overhead in context switching if the thread is a light-weight process[3]. W ith a multiple-context processor, a new thread can be chosen when the active thread is blocked if the processor exploits blocked mu- tithreading. Or, a new thread can be chosen every cycle if the processor exploits interleaved multithreading. In an SMT processor, in every single cycle, the following should be determined. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Threads from which instructions are fetched. • How many instructions are fetched from each chosen thread. We assumed ICOUNT2.8 as the default fetch policy for an SMT processor. In this scheme, instructions are fetched from up to two threads to prevent fetch block fragmentation. From the first thread, as many instructions as up to eight are fetched and then instructions are fetched from the second thread for the remaining bandwidth. This scheme with the fetch policy of ICOUNT was shown to produce highest throughput by [87] and their fetch mechanism was based upon the fetch optimization schemes proposed in [15]. As stated earlier, we claim that one constant fetch policy that chooses a certain number of threads among available candidate threads may not achieve the highest throughput that can ever be achieved even though the best performing fetch policy, ICOUNT is exploited. The reason is that the set of threads running on an SMT processor can constantly change as the job scheduler of the operating system can dynamically change the mixes depending on ever-changing system states. ICOUNT was found to perform the best when its performance was averaged throughout the long span of application life. However, if you look more closely at small time quanta, it is very likely that other possible fetch policies can work better than ICOUNT in many cases. We will show this through simulation in Chapter 6. For example, let’s suppose there are four threads in an SMT processor. Two threads are experiencing a high degree of branch mispredictions and the other two 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are not. Since ICOUNT looks at the counts of each thread’s instructions found in decode, rename and the instruction queues, it cannot take into account the fact that the counts for the two threads that experience heavy branch mispredictions are exaggerated. It will instead keep trying to maintain the similar number of instructions among the threads in the early stage of the pipeline. However, the two clogging threads will fetch many wrong-path instructions to waste pipeline slots. In this kind of situation, using BRCOUNT fetch policy should be helpful to sustain the throughput of the processor. Since BRCOUNT gives priority to the threads with fewer conditional branches, in the situation of interest, the two threads with fewer conditional branches are prioritized and given more chances to have their instructions fetched into the fetch buffer. Consequently, the number of wrong-path instructions will decrease and the two clogging threads will be throttled. Let’s take another example. This time, suppose there are two kinds of appli cations in an SMT processor. The first kind is a memory-hog application which requires frequent memory accesses as shown as the thread B in Figure 3.1(a). The applications of the second kind (A, C and D) are the ones with a relatively small number of memory accesses. The figure shows four threads (A, B, C and D) and each thread has sixteen instructions. A small rectangle with the thread identifi cation letter inside represents an instruction. The black rectangles represent load instructions while white ones do non-memory instructions. Assume that the mem ory accesses in the thread B result in cache misses. Then, applying ICOUNT fetch 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. policy “fairly” to the four threads will cause “unfairness” in scheduling. In addition, assume the instruction queues can contain sixteen instructions and the instructions contained in the box with dashed outline are the ones found in the instruction queues of our SMT processor. I — 7T “X “ A W c t r - c 5 « s r « TT 1T C TT TT TT b B S S T • _ J » TT TT D , I T I T (a) A snapshot of the instruction window for ICOUNT (b) A snapshot of the instruction window for MISSCOUNT Figure 3.1: When ICOUNT does not work well? Using ICOUNT, four instructions from each thread are found in the instruc tion queues. However, the instructions from B are all load instructions to miss the data cache. If the four instructions are all referencing different blocks in the cache, it will generate four consecutive data cache misses. While the miss is pending occupying valuable places in the instruction queues. If the places were allocated to other thread’s instructions, it would have been used more efficiently sustain ing the throughput at a higher level. That is shown in Figure 3.1(b). Here, we use L1MISSCOUNT fetch policy instead of ICOUNT. According to the policy the 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. number of pending LI misses are taken into account and the thread with fewer pending LI misses are fetched first over others. If the SMT processor is four-issue and no stalls are generated with non-memory instructions in the specific exam ple, for the instruction contained in the sixteen-instruction window, we achieve 3.5 (14/4) IPC of throughput with L1MISSCOUNT while we achieve 2.75 (11/4) IPC of throughput with ICOUNT. Adaptively switching fetch policies after detecting changes in the system state will help sustain throughput though it should sacrifice the performance of the clog ging threads while they are clogging the pipelines. Such threads will later get second chances to have their instructions executed (especially when the processor encounters idle periods, when other threads are also clogging or the priority of the threads are raised) and eventually finish their desired tasks under the control of the system job scheduler. While the adaptiveness that helps improve the performance of an SMT proces sor, if the cost of realizing it is prohibitively expensive, it won’t be worth the efforts. In this study, we propose a way to realize the adaptive dynamic thread scheduling on an SMT processor with low hardware and software costs. In our approach, we use the idle slots of the SMT processor pipeline to execute instructions of the de tector thread to perform its algorithms to detect system clogging and take actions to remedy the current situation to sustain high throughput. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In addition to detecting low throughput and affecting the hardware thread se lection unit to make wiser decisions, the detector thread can also identify and mark clogging threads based upon its own identification algorithm so that the system job scheduler, when it is invoked, can readily find out the clogging threads and suspend them while switching in new threads that will be better suitable for the current state of the processor. When system state changes or the processor be comes idle, the once-clogging threads will be given a second chance to run on the processor. This assistance to the job scheduler should be very helpful because the overhead of a system thread can be very high on an SMT especially if the operating system kernels are implemented in a multithreaded fashion because more threads would go through context switching. In this chapter, we will review the thread scheduling methods used in the ex isting SMT architecture models. We will extend the existing set of fetch policies and then discuss the ways to realize adaptive dynamic thread scheduling with the detector thread approach. At the end, we will discuss implementation issues in both hardware and software aspects. 3.2 Hardware Thread Scheduling in SM T Tullsen et al. [87] evaluated thread scheduling in two major points where choice of threads can be made: fetch and issue. Their main concern was that the issue stage can become a bottleneck because SMT provides higher throughput and its 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. earlier stages will find more instructions (due to thread-level parallelism) giving more pressure to the issue stage. Their conclusion, however, was that the issue bandwidth is not a bottleneck whereas the thread scheduling in fetch stage mostly affects the throughput of the processor. For this reason, we also focus on the thread scheduling for instruction fetch, not for issue. Four different fetch policies for SMT architecture were evaluated in [87], • BRCOUNT: It counts the number of conditional branch instructions in de code/rename stages and the instruction queues and schedule first the thread with fewer branches in an attem pt to reduce the number of wrong-path in structions. • MISSCOUNT: It prioritizes threads with fewer data cache misses. That is, threads with fewer long-latency operations are chosen over others. Conse quently, instructions (and their thread) that will end up being in the pipeline for shorter duration of time will be advanced first. • ICOUNT: Threads with fewer instructions in decode/rename stages and the instruction queues are fetched above others. It gives priority to the faster- moving threads by balancing the number of instructions of each thread over the pipeline. This is the fetch policy that works pretty well in general. • IQPOSN: It gives lower priority to the threads whose instructions have been staying in the pipeline the longest. The position of a thread’ s instructions 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in the instruction queues is watched and the threads whose instructions are closer to the head gets fewer chances to be fetched. We consider more fetch policies [75] to select threads for instruction fetch in a cycle: • LDCOUNT: The threads with fewer load instructions in the recent past are given priority for instruction fetch. • MEMCOUNT: The threads with fewer memory access instructions in the recent past are given priority. • L1MISSCOUNT: The MISSCOUNT above is only for the data cache. L1MISSCOUNT includes instruction cache misses as well. • L1IMISSCOUNT: The threads with lower LI instruction cache misses are fetched first over others. • ACCIPC: IPC values are accumulated for recent past and used to determine the threads for instruction fetch. The threads with lower value are prioritized. • STALLCOUNT: Threads with fewer pipeline stalls are fetched first over other threads. One stall occurs when an instruction of a thread does not advance to the next stage of the pipeline. The above fetch policies are again summarized in Table 5.3. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 A daptively Switching Fetch Policies Though ICO U NTmight be the fetch policy that can end up with the best utilization on the average as stated earlier, it does not mean that it will outperform all other fetch policies in every single scheduling quantum throughout the life of a mix of applications. Thus, we will perform simulation to find out in each quantum of eight kilo cycles what fetch policies perform the best and what policies do the worst. That way, we can see whether there are a lot of cases where coming up with a new policy other than ICOUNT would bring positive result or not. We will discuss how we perform this simulation in details in Chapter 5 and the result of simulation in Chapter 6. To mention the result of the simulation here, we observed about 27% of performance improvement over fixed fetch policy (ICOUNT) with a pseudo policy (MAX) where the best fetch policy in each time frame is exploited. This number should roughly represent how much performance we can achieve by introducing adaptive dynamic thread scheduling. This implies that if we can find a metric that tells more directly what is wrong in the system, we can switch to the policy that watches the metric and can end up with better performance. Realizing adaptive switching of fetch policies is a challenging work because of the followings: • Having multiple fetch policies implemented in hardware will require high cost in circuits. 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Plus, detecting low throughput and determining the best new policy to employ in the next quantum should also be implemented in hardware and it will add even more cost in hardware. • Low throughput detection algorithm and new policy determination algorithm can be improved and optimized. If they are implemented in hardware, it will be harder to fix or improve them. • Adding more work in hardware to detect the clogging threads and mark them will further increase the cost. As stated earlier, using the detector thread approach for adaptive dynamic thread scheduling solves all such problems. W ith a detector thread which runs low throughput detection and new policy determination algorithms in software, the required functions specified above need not be implemented in hardware, instead can be implemented in software, offering the possibility of future fixes or upgrades of the detector thread codes for the algorithms. The detector thread will work in this way to realize adaptive dynamic thread scheduling. Depending on what kind of application mixes are resident in the proces sor, if low throughput condition is met, a new fetch policy can be activated. Second, unused pipeline slots can be used to detect changes in throughput in the system and take one of the actions (updating specific flags) to improve system status if the change is determined to be in the category of the one leading to low-throughput status. 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A detector thread is a special thread which reads thread status indicators and updates thread control flags based on the current values of the indicators so that the thread control hardware can take any necessary action to improve performance of an SMT processor. The per-thread status indicators are updated by circuitries throughout the processor pipeline based upon specific events such as cache miss, pipeline stalls, population at each stage, etc. Per-Thread Counters I ■ ■ •v j. t i r t r i r i r f r i r i r p n n r a u ----- fllllll ; i ] T t r T ? flags c^> "b ran d S i v rtio n Unue Figure 3.2: Detector Thread Hardware The role of the detector thread is to check the values of the various thread status indicators and, based on the conditions dynamically defined in software, properly update thread control flags as shown in Figure 3.2. A thread will have its own set of flags. Flags may be as simple as the priority order of the resident threads. Then, the thread selection unit simply issues instructions from threads in their order of priority. Although per-thread status indicators, thread control flags, and thread selection units are fixed in hardware, we can control thread control 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. behavior around those hardware resources by writing a different program code for the detector thread. Depending on the values of the indicators that the detector thread watches, some threads can be given priority over others. This prioritization is put into effect by updating the thread control flags. One instance of thread control flag could be suspend-thread. If a flag of a thread is set, the hardware thread selection unit will stop allocating any new slots to the thread until the flag is reset. Later when the system job scheduler is activated, it will suspend the thread while bringing in a new ready thread whether that was previously suspended or is totally new. The detector thread has its own program RAM (DT PRAM) sufficiently large (2 or 8KB) to fit its small program text. Its data accesses should mostly be made to special registers such as the per-thread counters and general-purpose registers. More sophisticated algorithms or heuristics might need additional temporary data storage (DT DRAM) but its memory usage can be readily limited if sufficient efforts to optimize it is made. Now, let’s go on to the next section to discuss the details on how DT will be functionally organized. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4 Elem ents of the A daptive D ynam ic Thread Scheduling w ith the D etector Thread 3.4.1 P er-T hread Statu s Indicators Per-thread status indicators represent a set of counters that show the most recent state of the threads resident in the processor. Included are counters for cache misses (data and instruction), number of pipeline stalls, number of loads, number of long-latency operations, number of instructions in each stage, number of new instructions in each stage, etc. 3.4.2 T hread C ontrol F lags and T hread S election U n it At each cycle, threads are selected and instructions of the selected threads are chosen for fetch and issue. In that selection process, the thread selection hardware looks at the thread control flags to make a decision as to whether the thread is to be chosen. This hardware unit represents the thread control provisions throughout the pipeline stages. For example, in the fetch stage, if a thread’s no-fetch flag is set, no instructions will be fetched from the thread until the flag is reset. The thread selection unit watches the thread control flags and takes appropriate actions depending on the value of the flags. The flags could include suspend, no-fetch, no-issue, etc. However, flags that need immediate action are not included in this set of flags because, with the detector thread at the lowest priority, it cannot be 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. guaranteed that the detector thread will update this flag early enough. The thread selection unit is not unique to our approach. Its equivalent function is required even in SMT architecture with fixed thread scheduling. The difference is that thread scheduling policy itself is embedded into the hardware in this case. 3.4.3 D etecto r T hread As described earlier, the detector thread plays a major role in this process as shown in Figure 3.2. It keeps watching the per-thread status indicators and updates the flags with the critical values of the indicators. The indicators are updated by hardware on predetermined events in places spread across the pipeline. As described earlier, the detector thread has the lowest priority among threads. As long as the pipeline is well utilized, the detector thread will not be activated often. However, to prevent it from starving, the thread selection unit will have to select the detector thread so its instructions can be fetched at a minimum rate. For the detector thread approach to work successfully, it should be equipped with intelligent heuristics or algorithms to dynamically detect clogging (low through put) and to choose a better fetch policy for the next time frame. However, since the resources allowed for the detector thread are quite limited in order to mini mize hardware overhead, the algorithm to be employed in the detector thread code should have limitations in the size of the data structures it can take advantage of and in the length of its program codes. It will be our next challenge to design 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. efficient detection algorithms and policy selection algorithms. This will be beyond the scope of this paper because the main goal of this work is to present there are enough room in which the adaptive thread scheduling approach can play. 3.4.4 W h y D etecto r Thread? It is true that the detector thread can be implemented in hardware. The advantage could be that an existing thread context need not be permanently assigned to the detector thread. However, the cost of hard-coding multiple policies might offset this advantage. Another likely advantage of the hard-coded approach is that scheduling actions will be taken faster because it would not depend on the availability of empty pipeline slots. However, as stated earlier, when an SMT processor suffers from low throughput, empty pipeline slots will be available more easily. The most significant advantage of the detector thread approach over its hard-coded counterpart is that because the detector thread relies on software, the algorithm is not fixed and it can be enhanced or corrected even after the processor is taped out. Thus, eventually, after an SMT processor is taped out, the same processor can be retargeted for two different applications (for example, for servers and desktops) by equipping it with two different detector thread algorithms. When an SMT processor is highly utilized, the detector thread will barely find idle slots that it can trickle its own instructions into the pipeline. However, this will not cause starvation of the detector thread. During high utilization, it is 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. not necessary to have the detector thread get invoked and involved to improve throughput. On the contrary, when a processor is experiencing low throughput, many empty slots will be found simply because the processor is not highly utilized and the detector thread will find a lot of idle slots that it can have its instructions fetched, issued and executed in. 3.5 Im plem entation of A daptive D ynam ic Thread Scheduling w ith the D etector Thread The detector thread can be implemented in a way based on the ideas proposed in the study of Simultaneous Subordinate Microthreading (SSMT) [13], The study showed that small program image of the microthread can be placed in a small on-chip RAM while preventing interference between the instructions fetched from the microthread and the primary thread. However, still there could be non-trivial overhead in the data path because both threads will compete to get a hold of it. This problem become more severe when nine threads compete for access as in our case (the normal eight threads plus one detector thread). That is why we propose to have a separate data RAM for the detector thread as shown in Figure 3.3. DT DRAM and DT PRAM are both initialized upon reset by the OS through DMA and the DT DRAM is exclusively accessed by the detector thread. DT PRAM can 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. be loaded with new codes by the OS via DMA if the OS determines that a new algorithm should be engaged to enhance the throughput. M M U DT O RAM P C T SU FU FU FU FU G P R RA T DT RA T F e tc h B uffer DT B uffer In stru ctio n W indow DC D T P R A M D e c o d e L ogic / R e n a m in g Logic Figure 3.3: Hardware Implementation of Detector Thread 3.5.1 Softw are O verhead o f A d ap tive D yn am ic T hread Scheduling w ith th e D etecto r T hread It is imperative that we know whether the detector thread will find sufficient slots it can take advantage of to execute its own program codes to detect low throughput and determine the new fetch policy to exploit. In order to find it out, we evaluated 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. through simulation the occupancy rate of the instruction fetch buffer. The average value was 7.7 with a deviation of about 0.1%. Considering size of the fetch buffer which is eight, this will allow sufficient number of slots for the detector thread. W ith the eight kilo cycles of low throughput checking interval, it thus is estimated that up to 2,400 detector thread instructions may be executed. This should be more than enough to handle the work for the detector thread. 3.5.2 F etching In stru ction s from th e D etecto r T hread Figure 3.4 illustrates how instructions are fetched from the detector thread. The baseline SMT architecture we use is the ICOUNT2.8 configuration proposed in [88]. This scheme gives flexibility while allowing to avoid fetch fragmentation. In the figure, four program counters for normal threads are shown. At the moment, the priority of the threads are in the order of PC 0, PC 2, PC i and PC6. The rest of the order are not shown here. The thread selection unit will take a look at PC 0 first because it is on top of the priority array. Its access happens to start at the address of 12 in Banka. All instructions up to the block boundary are put into the instruction fetch buffer, say, from 12 through 15 in this scenario. (If the address happens to be 08, the starting word in the block, all eight instructions in the block will be fetched consuming all fetch bandwidth for one thread and no more fetch will be attempted for other threads.) Since in ICOUNT2.8 the thread selection unit attem pts to fetch from up 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to two threads, it continues with the next along the priority array. However, the next one, PCVs address falls onto the same bank (Bank0) as PC0, it cannot be fetched. The next thread, PC], thus is considered. Because it points to (Bank]) without causing bank conflicts with the first fetched thread, instructions of the thread can be fetched. Its address happens to be 30 and instructions up to the block boundary are fetched. Thus, two instructions (30 and 31) are fetched from PC], If a fetch encounters cache miss, the chance will be consumed. That is, if the first thread misses cache, then only one more fetch is attempted. So far, the fetch scheme is the same as the one used in [88]. P C 6 (pointing to Bankj) is not even considered in this example because two threads are already fetched. To support the detector thread, the thread selection unit (TSU) has to do one additional thing: fetch from the detector thread. In the given example above, we now have two remaining slots and the TSU will fetch two instructions from the detector thread to fill the slots. However, if there is no slots left, no instructions will be fetched from the detector thread. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. PC. PC, PC, PCC T Bank. 00 01 02 03 04 08 09 10 11 16 17 18 19 20 24 25 26 27 28 B ank, 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 31 B a n k , B a n k , 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 ■ ■ ■ 20 21 22 23 2 4 25 26 27 28 29 30 31 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 — m t S h i - 24 25 26 27 28 29 30 31 Figure 3.4: Fetching from the normal threads and the detector thread 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 D etector Thread 4.1 Overview In the previous chapter, we claimed that a single fixed scheduling would not reach the performance that could be achieved through adaptive scheduling where various fetch policies are adaptively switched depending upon the varying system state. We proposed an approach to realize the adaptive dynamic thread scheduling with the detector thread where the core function is accomplished in software while taking advantage of idle slots found in the pipeline, especially when the utilization does not maintain at a “satisfactory” level. We also provided a way to implement it in hardware while minimizing its interference with normal threads. In this chapter, we will have a closer look at the detector thread with focus on its functionality, software architecture and design issues such as determination of threshold values and new fetch policies. 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2 T he D etector Thread We first need to define the following terminologies to describe a detector thread. In an SMT processor, a thread goes through various states as one does in a single context processor. These states are illustrated in Figure 4.1. Each state in the figure is explained below. • Resident: A context is allocated to the thread. As long as the thread is not context-switched, the thread will be resident on the processor and there will be no overhead making this thread active or dormant. Each cycle, a resident thread can assume one of the following two sub-states. - Active: Any slots in any stages in the pipeline are allocated to the thread. - Dormant: No slots are allocated in a cycle with no resources being used by the thread. • Non-resident: The thread is not in the processor yet waiting to get loaded to the processor and allocated to a context. A non-resident thread can - Blocked: waiting for some event to occur - Ready: waiting to be chosen by the job scheduler The detector thread can be regarded as a special thread with the lowest priority. Its instructions are not fetched until unused and wasted fetch bandwidth becomes available. Consequently, when the processor is enjoying the high utilization, its 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Active (Resident) Dormant (Resident) Ready Blocked Running Blocked R eady (a) States of a thread in a single- (b) States of a thread in an SMT pro context processor cessor Figure 4.1: States a thread can assume in an SMT processor instructions will not be fetched and that is entirely normal because there is no need to improve performance. The detector thread is always resident that becomes active when slots are left unallocated to active normal threads. It will be dormant as long as utilization is high because no slots will be available for the detector thread. The responsibility of a detector thread is to detect low throughput of the pro cessor pipeline as it occurs and take relevant actions so that throughput can at least be sustained to a desired level. The straightforward way to achieve it is to monitor the IPC value for the last time interval (r cycles) and determine and switch to a new fetch policy if IPC of the last interval is lower than a specific threshold value. The program code for detector thread will be loaded by the operating system after the system is booted. The operating system’ s detector thread management kernel is expected to load a program text and the initial table of significant values such as IPC threshold values, cache miss threshold values. The loading is done via 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DMA and the detector thread management kernel will transfer data from the main memory to DT PRAM and DT DRAM. DT PRAM can be loaded with new codes by the operating system via DMA if it determines that a new algorithm should be engaged to enhance the throughput. Once loaded, the DT DRAM is exclusively accessed by the detector thread. Figure 4.2 distinguishes the role of the hardware thread scheduling from that of the system job scheduler. The job scheduler works on a pool of jobs requested by various users. The number of jobs in the pool is far larger than that of contexts available on an SMT processor. Figure 4.3 shows the result of a UNIX top command which lists a score of jobs that are currently in the system. Only one job is shown to be “running” on the processor. W ith an SMT processor, multiple jobs will be shown to be “running” on the processor. If the processor supports 8 contexts, up to 8 jobs will appear in the result of the command. However, that would not guarantee that instructions of a specific job (thread or process) were fetched at a specific cycle though it indicates the 8 jobs are occupying the available contexts of the processor. The decision the job scheduler keeps making is which 8 jobs should be occupying the 8 available contexts whereas that the hardware thread scheduler does is which 2 jobs should be selected for instruction fetch every cycle because we are assuming the base fetch mechanism of ICOUNT.2.8 [87]. The main goal of a hardware thread scheduler is to avoid imbalance among threads. Here, imbalance on X means that usages or counts of X are not nearly 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Execution Status Monitor Hardware Thread Scheduler OS Job Scheduler Figure 4.2: The hardware thread scheduling works on the jobs that have been once selected by the system job scheduler. uniform among the threads. For example, if one thread has far more instructions in early stages of the pipeline (decode and rename stages and the instruction queue) than the others do, we have imbalance on instruction counts. Imbalance adversely affects the throughput because of the following reasons. 1. Since a small number of threads are occupying one type of resources, the other threads can hardly get assigned the resources. 2. The average number of non-dependent and issuable instructions per thread becomes lower among the other threads, lowering the average number of in structions that can advance through the pipeline. 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. l a s t p id : 29421; lo a d av e ra g e s': 0.38,.. .0*38, 0 .38 95 p r o c e s s e s : 94 s le e p in g , l .o n cpu CPU s t a t e s : 95.7% i d l e , 0.8% u s e r, 3.3% k e rn e l, Q.2% io w a it, 0,0% swap Memory: 1024M r e a l , 5214 f r e e , 5904K swap in u s e , 19..63M swap f r e e PID USERNAM E T E JR PRI NICE SI-2E RES STATE * ..TIM E CPU COMftND 29383 p au l : 0 : 0 2472K m m opulO 0 •02 1.46% to p 14119 john . i .59 0 55M 44M s le e p 186 1 -6 1.41% n e ts c a p e .b in 455 ro o t !:■ 58 ■ 0 3896K 3576K S leep 504 37 0.59% sm _configd 453 nobody •' S' 58 0 3712K 3232K s le e p .203 46 0.34% sm_egd 4.56: ro o t 7 59 -10 3128K 2712K s le e p 106 32 0.16% sm_krd 4:54' nobody 9 59. 0 2648K 2358K s le e p 149 48 0.14% sm _logscand 29154; tom 1 58 0 66S8K 5376K s le e p 0 00 0.13% em acs-20.2 4.36 ro o ty .;; 7 58 . o 2552K ■ Z lU R s le e p 89 26' 0.08% sm_symond 162 ro o t '' 1 .49 0 21Q4K 1328K s le e p 48 -38 0.05% rp c b in d ■ 235 ja n e ' 6 "58 0 3456K 29C4K s le e p 17 03 0.05% p e r i 22302 w ebfoot 3 58 ■ 0 4280K 3Q48K s le e p ■ 0 -00 0.03% h t tp d 310 ro o t 6 58 0 2336K 1864K s le e p 13 11 0.02% v o id : 285:52 ro n a ld - . 1 58 0 1496K 76QK s le e p 0 G D -. 0 .0 2 1 rl.o g ln 28451 p a t.ty . 1 48' o' 2112K 1888K s le e p 0 00' 0.01% tc s h ' 14 695' ro o t. 1 53 . 0 2712K 1Q56K s le e p 0 oa- .'.0.01% Ip s c b e d 11556 tf o h a r d ■ 1 58 ■ ■ o 10M 9QOT s le e p ■ 0 qo' :0.07% v ie w tra c e 6165 root- ■ 1 58 0 4056K 3264K s le e p 0 ■ii 0.05%' ssndl 11528 jam ie. 1 58 o ISM 10M s le e p 0 00 0,04.1 fu s io n .16743 c a th y 12 59 0 199M 190M - s le e p 2 46 0.03% .m atlab 11191 ro o t 1 58 0 2332K 2632K s le e p 0 00 0 .0 3 % 'in .ftp d Figure 4.3: A pool of jobs that the job scheduler needs to select every scheduling quantum. 3. Consequently, because of 1 and 2, thread-level parallelism is lowered. 4.3 Software A rchitecture of the D etector Thread The software architecture of the detector thread for adaptive thread scheduling appears in Figure 4.4. Status counters are updated every cycle by the circuitries throughout the pipeline. For every period of eight kilo cycles, the number of com mitted instructions are counted and the maximum number of instructions that can be executed (8Kx8) are counted. If the interval is to remain constant, the maximum numbers need not be counted. The detector thread will check whether the IPC (the 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. number of committed instructions per cycle) is less than the value of the threshold. If it holds true, the previous time frame will be detected as low-through-put. Once a previous scheduling quantum1 is determined low-throughput, a new fetch policy has to be determined because the incumbent policy turned out to poorly perform. Then, the newly determined policy is engaged. In the meantime, during the remaining idle slots, other functions can be accomplished. The first thing is to identify clogging threads. By looking at per-thread status counters, the threads that are clogging the pipelines for various reasons can be identified and marked so that the job scheduler can later suspend them once loaded without going through the possibly long process of identifying them for itself. The result? A shorter stay of the job scheduler. The second thing is to enforce the incumbent policy. Per- thread status counters are checked and the priority array is updated depending on the values of the counters. Then, the thread selection unit will look at the array to make decisions on what two threads should be selected for instruction fetch every cycle. 4.3.1 P seu d o C ode o f th e D etecto r T hread The pseudo code framework of the detector thread is shown in Figure 4.5. The main subroutine Detector.Thread() has a big endless while loop with a jump location 1This scheduling quantum should not be confused with that of the job scheduler. Typical sizes of a quantum for job scheduling is in the range of milliseconds which can be equivalent to a million cycles. 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. No ^ IPC<Threshold Yes s? Status Counters Updated TSU Policy Switch Determine New Policy Identify Clogging Threads Policy Enforce Figure 4.4: Software architecture of the detector thread right ahead of it, East. If the condition, IP C la s t < IPCJthold holds true, it will be recognized as the low-throughput event and consequent actions will be taken. IP C la s t is the committed instructions per cycle during the last eight- kilo-cycle quantum and IP C Jhold means the threshold value of the IPC which is predetermined by the detector thread management kernel developer. This threshold value may also be chosen to be updated by the detector thread software. 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Once low-throughput event is recognized, Identify-Clogging Threads () is called and low throughput cause analysis is performed to identify the clogging threads. DetermineJVewPolicyQ is next called to find out the policy that should be engaged in the next quantum. This stage needs most efforts since choosing a new policy will significantly affect the throughput of the next time frame2. The new policy then is engaged as the next incumbent policy by the function PolicyJ5witch() and a jump to the subroutine Policy ^Enforce () is made. In this routine, the thread priority array (TPA) is updated depending on the current system state and the incumbent policy while the thread selection unit (TSU) looks at this array to determine the threads for instruction fetch every cycle. The TSU selects up to two threads every cycle because we are using ICOUNT.2.8[& 7}. 4.3.2 D eterm in ation o f T h reshold V alues The big question to address before determining the next fetch policy is how we know whether the processor is experiencing low throughput or not. W hat is the threshold that makes the reference based upon which we can make judgments? Figure 4.6 illustrates how the value of the IPC threshold affect the frequency of switchings and the quality of a switching. If the threshold value is too low, very few switching will take place but the quality of a switching can be higher because when it is known to be low-throughput it is more likely that the incumbent policy 2I use the terms time frame and scheduling quantum interchangeably. 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. i n t in c u m b e n t.p o lic y ; v o id P o lic y .S w itc h C in t x) { v o id D e te c to r .T h r e a d O {. i n t p; in c u m b e n t.p o lic y = x; > E a s t: v o id P o li c y .E n f o r c e 0 { W est: w h ile (1) { s w itc h (in c u m b e n t.p o lic y ) • { c a s e BRCOUNT: / / R e s o rt t h p r i o r i t y / / a r r a y a c c o rd in g t o BRCNT b re a k ; w h ile (1) { / / LTFlag i s tu r n e d on by H/W i f ( I P C .la s t < IP C .th o ld ) i I d e n tif y .C lo g g in g T h r e a d s O ; p = D e te rm in e _ N ew P o lic y C in cu m b en t.p o licy ); P o li c y .S w itc h ( p ) ; > g o to W est; > > d e f a u l t : / / ICOUNT / / R e s o rt th r e a d p r i o r i t y / / a r r a y a c c o rd in g to ICOUNT b re a k ; c a s e L1MISSC0UNT: / / R e s o rt t h p r i o r i t y / / a r r a y a c c o rd in g t o L1MISSCQUNT c a s e ...: b re a k ; i n t D e term in e _ N ew P o licy (in t p) { i n t newp; / * * * * * * * * * * * * * * * * * * * * * * * * / /* S e le c t New P o lic y * / f * * * * * * * * * * * * * * * * * * * * * * * * / r e t u r n newp; > > > > i f (quantum > 8K c y c le s ) g o to E a s t; v o id I d e n tif y .C lo g g in g T h r e a d s O { /*******************************/ /* I d e n t i f y c lo g g in g th r e a d s * / * * * * * * * * * * * * * * * * * * * * / / / T urn on su sp en d f l a g s of / / i d e n t i f i e d th r e a d s > Figure 4.5: The framework of a detector thread in pseudo code (abridged) is incapable of improving the situation. If the value is too high, switching will occur too frequently. Further, the quality of each switching can be very low since it is more likely that the situation cannot improve even with alternative policies because the current throughput can be fairly high. We will show through simulation in Chapter 6 that what we showed here in Figure 4.6 holds true even in realistic situations. We obtain the optimal value for the threshold through simulations as will be discussed in Chapter 6. We decided to use the value of 1 as the IPC threshold because that gives better throughput with optimal overhead in software. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. N u m b e r o f Po licy S w itc h e s p e r 8 « -c y c le interval Q .8 0.6 0.' Q 2 0 6 7 2 3 4 5 0 8 Threshold Value (PC ) Probability of B e n ig n S w itc h es ' — — 1 2 3 4 5 6 7 8 Threshold Value (IPC) (a) Relationship between the number (b) Relationship between the probabil- switches and the threshold values ity of benign switches and the thresh old values Figure 4.6: How the threshold value affect number and quality of switchings? 4.3.3 D eterm in ation o f N e x t Fetch P olicy 4.3.3.1 U nderlying Prem ises Once it turns out that the incumbent fetch policy fails to sustain high throughput, the followings should be taken into consideration to determine a new fetch policy. • W hat was the incumbent fetch policy for the last quantum? • W hat is current state? (Instruction counts, cache miss rates, etc.) « Whether the IPC is on the rise or fall? ( Throughput gradient) • History of a fetch policy’s effect under the same condition The more things we take into consideration, the more sophisticated and intelli gent the determination heuristic should become. However, too intelligent heuristic 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. may not fit in the available cycle budget or in the DT PRAM whose size is limited. The fewer things we take into consideration, the less overhead the detector thread is given and the quicker the response of the detector thread would become. However, the lower intelligence may not end up with good results. Thus, we need to find out a trade-off point where the overhead fits our budget while producing good results. The simplest way of determining the new policy is the fixed transition with no current condition considerations. And this will basically be what we do in our Type 1 heuristic (Figure 4.7). However, it should be noted that switching to another specific thread can rather aggravate the already worsened situation instead of improving it. This kind of approach will also heavily rely on the value of the threshold because a higher value of the threshold is more likely to cause such adverse effects while a lower value is less so. 4.3.3.2 T ype 1 H euristic This heuristic is the simplest way of determining a new fetch policy. In this scheme, no status indicators are referenced for decision and consequently it is not sensitive to the state the system is currently in. As long as low throughput is not detected, the current state, that is, the incumbent fetch policy will be maintained. Once low throughput is detected, transition to the other thread (either BRCOUNT or ICOUNT) will unconditionally be made. Initially, the default fetch policy will be ICOUNT. The advantage of this scheme is that software overhead of the detector 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. thread will be minimal to a degree that it can be implemented in hardware. How ever, other advantages of the detector thread will not be available when the scheme is implemented in hardware. The most obvious advantages include the flexibility of the detector thread software. ICOUNT BRCOUNT Figure 4.7: Type 1 heuristic for determination of a new fetch policy 4.3.3.3 T ype 2 H euristic This heuristic is another simple way of determining a new fetch policy. In this scheme, no status indicators are referenced for decision as in Type 1. The difference Type 2 (Figure 4.8) from Type 1 is that one more state (or fetch policy) has been added to the original finite state machine. The variants based on this scheme can be made by changing the sequence of the transitions, which currently is set to the order of ICOUNT, L1MISSCOUNT and BRCOUNT, or adding more fetch policies to the current set of three. 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. BRCOUNT ICOUNT L1MISSCOUNT Figure 4.8: Type 2 heuristic for determination of a new fetch policy 4.3.3.4 T ype 3 heuristic Type 1 and Type 2 only considers what was the fetch policy for the last quantum once low throughput is detected. There is only one state that can be transited to from a state. Thus, as long as low throughput is not avoided, one of the two or three states will be entered in a cyclic fashion. In Type 3 heuristic (Figure 4.9), one of the two states can be entered from a state depending on the value of specific conditions. Type 3 heuristic relies on the following conditions: • COND^MEM is true when one of the following two sub-conditions is true. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1. LI miss count for the last quantum is higher than its threshold value of 0.19 times/cycle. 2. Load/Store Queue becomes full too often, more often that its threshold value of 0.45 times/cycle. • COND^BR is true when one of the following two sub-conditions is true. 1. Branch misprediction count for the last quantum is higher than its threshold value of 0.02 times/cycle. 2. The count of conditional branches for the last quantum is higher than its threshold value of 0.38 branches/cycle. Above, the specific threshold values for LI miss count, Load/Store Queue occu pancy rate, Branch count and it misprediction count were determined by simulation. We ran eight-thread simulation in our SMT simulator with our 13 different mixes of applications and ended up with an average value for each metric. These mea sures are indeed dependent on hardware configurations and what kind of mixes are running in the processor. There can be no single “golden” reference measures that can be used always. To be more effective, the threshold values should be updated to reflect newly found information. That is one of the reasons why the detector thread approach is good for the adaptive dynamic thread scheduling. The system’s detector thread management kernel can profile the system and determine whether current threshold numbers are 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. obsolete and if so, it may update the values to reflect the new state of the system. This update can be done by writing values in the detector’ s DT RAM through DMA. !COND_MEM BRCOUNT ICOUNT COND_BR L1MISSCOUNT Figure 4.9: Type 3 heuristic for determination of a new fetch policy Type 3 heuristic works as follows. For example, suppose BRCOUNT is the incumbent fetch policy and low throughput is detected. The simple and obvious fact is that BRCOUNT has not worked well during the last quantum. This implies that there is no crucial imbalance among threads of the current set about conditional branches. Imbalance might be in other factors. Now we can guess that one of the other policies, ICOUNT or L1MISSCOUNT may work better. Now, we consider the condition, COND-MEM and check its value. If it holds true, then it implies 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that imbalance might have been in the number of LI cache misses or the usage of the load/store queue. Otherwise, the problem might not lie in memory usages and give the general-purpose policy, ICOUNT. 4.3.3.5 Type 4 Heuristic For Type 4 , we add two additional features. The first one is to taking into ac count the gradient of throughput. Even when low throughput is detected, if the throughput is higher than the throughput observed one quantum earlier (positive gradient), it is not allowed to switch policies. That way, we are waiting for the situ ation to keep improving with the originally incumbent fetch policy. If the gradient is negative, switching is allowed. The second feature is to keep track of switching history. In the switching history buffer, the followings are recorded for each policy switching event. • Incumbent policy (ipol): The fetch policy that is originally engaged before the switching takes place. • Value of the condition (aval): For each policy, there is one condition that is checked. The value of the condition is recorded. • Counter for positive outcomes (poscnt): This counter is incremented every time a specific case ended up with increase in throughput. • Counter for negative outcomes (negcnt): This counter is incremented every time a specific case ended up with decrease in throughput. 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Before making the final decision, poscnt and negcnt are compared. If poscnt is greater, then the regular switching is made. Otherwise, the opposite direction will be chosen. For instance, suppose the incumbent policy was ICOUNT and low throughput is detected. Then with COND^BR being true, transition should have been toward BRCOUNT policy with Type 3 heuristic. In Type 4, the counters (poscnt and negcnt) are examined and if poscnt is not greater than negcnt, the transition will be made toward the opposite, L1MISSCOUNT. 4.3.4 Id en tifyin g C logging Threads The process of identifying clogging threads is not a complicated process. Rather, it is significant that we have a way to perform the process by exploiting wasted pipeline slots thanks to the detector thread. SMT job schedulers[59, 79, 67], once invoked, need to analyze what is going on in the processor. Such process will involve context-switching user threads, issuing instructions of the job scheduling kernel and later switching back in a set of original threads or a set of new threads. 4.3.5 C hanging D isp atch P olicies Another set of decisions can be made at the dispatch stage where a choice has to be made about which thread’ s instructions will enter the instruction windows, the reorder buffers and be allocated functional units. A study [88] examined issues 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. regarding changing dispatch policies. It showed that because the fetch stage is very liable to be the bottleneck, changing policies at the dispatch stage does not significantly affect the performance of the SMT pipeline. 4.4 A pplications of the D etector Thread 4.4.1 Efficient Job Scheduling The detector thread can be used for purposes other than dynamic thread scheduling. As stated earlier, it can be used to alleviate the burden of the system job scheduler. A job scheduler is a part of operating systems. It periodically checks system status and makes decisions on what jobs should be selected for the next time slice. Even in SMT, the job scheduler will have to periodically be invoked and run for a while to make decisions for the next time slice as long as more threads than contexts are available. When a job scheduler is invoked, room must be made for it by suspending one of the threads. Then, the job scheduler will gather information by reading the hardware counters and then applying its policies to choose threads that will remain and the ones that will be suspended. Software context switching will be involved for those that are suspended. The job scheduler can shorten its stay by just checking the flags that the detector thread has updated and simply making decisions based on that. 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4.2 Low Pow er C onsum ption The detector thread approach can also be used for other applications such as power monitoring where pipeline is throttled to stay within maximum power level. When determining a new fetch policy, the detector thread can enforce that threads with a larger number of instructions with high power-consumption level be less frequently chosen. 4.4.3 Program and D a ta P refetch The detector thread can be given a task to watch the address pattern for memory accesses to the program and data, analyze the pattern to predict the regions to be accessed in near future and perform prefetch of the blocks prior to the actual loads. Of course, this is one of the speculative approaches and when the prediction turns out incorrect, there can be a significant overhead in performance, rather deteriorating the performance. Also, the prefetch traffic caused by the detector thread is likely to interfere with the real traffic caused by normal threads. 4.4.4 S p ecu lation C ontrol In an SMT processor, the number of resident threads can be varied. If an SMT processor supports eight contexts, from zero to eight threads can be resident at one time. When more threads are resident, the necessity of the intelligent branch pre diction is lessened because instructions from other threads can take over exploiting 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. pipeline slots while the outcome of a conditional branch is being calculated. On the contrary, when the number threads decreases, say, to one thread, the demand for a more intelligent branch prediction will increase because it will be harder to find independent instructions to fill the voids, ending up with lower utilization. Thus, varying the degree of branch prediction depending on the number of resident threads can help improve throughput of an SMT processor. That is, when eight threads are resident, branch prediction is entirely turned off and when a single thread is resident, full-fledge branch prediction is engaged. Analyzing the system status and controlling the degree of branch prediction can be assigned to the detector thread. Of course, the simplest form of this can be implemented in hardware. However, if a sophisticated heuristics need to be exploited for batter analysis, the detector thread might be the only solution. 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 M ethodology 5.1 Our Simulator We used SimpleSMT which is a part of ALPSS framework [52] and developed for SMT simulation. SimpleSMT is based on the SimpleScalar tool set [4]. It thus inherits most architectural specifications of the superscalar model and software architecture employed in SimpleScalar. The SimpleSMT simulator is distinct from the SimpleScalar simulator in that SimpleSMT has been modified to support the enlarged register files and the elongated pipelines to take into account the additional hardware complexity of an SMT processor. The simulation environment has been configured to have resources compatible with previous research [87] on SMT for verification purposes. Table 5.1 shows the configuration used in our simulation. The pipeline stages modeled by SimpleScalar and SimpleSMT are illustrated in Figure 5.1. SimpleScalar has a simple five-stage pipeline. Instruction fetch and decode is accomplished in the IF stage, dispatch is performed in the DI stage 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Parameter Value Fetch Bandwidth 2 threads, 8 instruction total Functional Units 3 FP, 6 Int, 4 ld/st Instruction Queues 64-entry FP, 64-entry Int Inst Cache 32KB, direct, 64-byte lines D ata Cache 32KB, direct, 64-byte lines L2 Cache 256KB, 4-way, 64-byte lines L3 Cache 2MB, direct, 64-byte lines Latency (to CPU) L2 6 cycles, L3 12 cycles, Memory 62 cycles Pipeline Depth 9 stages Min Branch Penalty 7 cycles Branch Predictor 2K bimodal Instruction Latency Based on Alpha 21264 Table 5.1: Simulator configuration where instructions are assigned RUU (register update unit) entries. In the IS stage, instructions execute in functional units once operands become ready. In the WB stage, outputs of the functional units are written back to the result bus to end up in the RUU entry’s result field. In the CT stage, the results in the RUU are committed to the registers to support in-order completion. The SimpleSMT simulator has an extended version of the pipeline. It has sep arate stages for the instruction fetch and decode for more realistic simulation. It has the RR stage where register renaming is performed. The EQ stage enqueues an instruction, that is, the instruction is allocated an entry in the instruction queues (modeled as the RUU of the SimpleScalar simulator) or load/store queues. R1 and R2 stages represent register read stages for fetching the operands. We have two stages to reflect the longer latency of reading from register files because of its size grown to support multiple contexts. Once operands are ready, the instruction goes 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. through EX stage to execute its operation. WB and CT stages are the same as those of SimpleScalar. IF or ■ . IS WB CT (a) SimpleScalar IF ID RR EQ R1 R2 EX WB CT (b) SimpleSMT Figure 5.1: Pipeline stages modelled in SimpleScalar and SimpleSMT 5.2 Benchm ark applications and their com binations This study is targeting at the next generation microprocessor. Therefore, the pro gram behavior that reflects modern computing environment is essential to our re search. SPEC CPU2000 [33] is one of the latest benchmark suites that includes 25 modern applications. Our simulation has been performed with the SPEC CPU2000 benchmark applications that were compiled for a single Alpha 21264 processor (EV6 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. binaries). The combinations of several benchmark programs are regarded as mul tiple threads in an SMT processor. Each combination (or mixture) is made of 4, 6, or 8 different programs which are selected from a total of 25 benchmark programs. Since our study focuses on how various fetch policies affect processor utiliza tion with different mixes of applications, we attempted to cover as broad a mix of benchmarks as possible. Three parameters were used in our selection of the vari ous combinations: throughput (IPC: instructions committed per cycle) on a single threaded machine model, memory footprint and the characteristics of the opera tions, integer, floating-point or mixture of the two. The result of this selection is shown in Table 5.2. When forming combinations out of integer and floating-point applications, we were careful to have the same number of applications for each type. The size of the memory footprint of each application was available from SPEC 2000 in [33], The applications that took more than 100 MB of reserved memory were classified as “memory hog” applications whereas those using less than 100MB were considered having low memory requirement. The throughput of each application on our SMT model was available from [52] and it was used to determine whether the throughput of an application is relatively high or low. 5.3 Fetch Policies M odeled in Sim ulation We implemented the functional models of the ten different fetch policies as shown in Table 5.3. BRCOUNT, L1DMISS COUNT, ICOUNT and RR were proposed and 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Mix Name Applications hil-loM-mx crafty, eon, equake, facerec, mesa, parser, sixtrack, vortex hil-hiM-mx apsi, fma3d, galgel, gap, gcc, gzip, mcf, wupwise loI-loM-mx ammp, art, eon, mgrid, parser, sixtrack, twolf, vpr loI-hiM-mx applu, apsi, mgrid, gcc, gzip, lucas, mcf, swim mxI-mxM-mx ammp, crafty, equake, fma3d, gap, mcf, swim, twolf hil-loM-i crafty, eon, gap, gzip, parser, twolf, vortex, vpr hil-hiM-i crafty, gap, gcc, gzip, mcf, parser, vortex, vpr loI-loM-i eon, gcc, gzip, mcf, parser, twolf, vortex, vpr hil-loM-f apsi, equake, facerec, fma3d, galgel, mesa, sixtrack,wupwise hil-hiM-f applu, apsi, equake, fma3d, galgel, lucas, swim, wupwise loI-loM-f ammp, art, facerec, fma3d, galgel, lucas, mgrid, sixtrack loI-hiM-f applu, apsi, galgel, lucas, mgrid, sixtrack, swim, wupwise mxI-mxM-f ammp, art, equake, fma3d, lucas, mesa, swim, wupwise Table 5.2: Various combinations of applications evaluated in [87]. Additionally, we added to our list LDCOUNT, MEMCOUNT, ACCIPCand STALL COUNT. The description of each policy is found in the table. LI MISS COUNT and LI IMISS COUNT were added as well to have a closer look at the effect of the caches. Every cycle, the simulator sorts out threads according to the fetch policy. In structions are fetched from the first thread as long as the cache block boundary is not met. If no boundary is encountered while fetching eight instructions, all eight instructions are fetched from one thread. Otherwise, instruction fetch can continue for the next thread. We limited the number of threads that can be fetched in one cycle to two as explained in 3.5.2. A study [9] showed that fetching all eight instruc tions from one thread can be adversely affected by fetch fragmentation. For fair comparison, we applied the same mechanism to both fixed scheduling and adaptive scheduling. 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fetch Policies BRCOUNT Number of total branches for a thread LDCOUNT Number of total loads for a thread MEM COUNT Number of total memory accesses for a thread L1M ISS COUNT Number of total LI Cache misses for a thread L1IM ISS COUNT Number of total Ll ICache misses for a thread L1DMISS COUNT Number of total L l DCache misses for a thread ICOUNT Current Instruction Queue population for a thread ACCIPC Accumulated IPC for a thread STALL COUNT Number of total stalls incurred for a thread RR Round-Robin scheduling Table 5.3: Various Fetch Policies tested Because of the huge size of the SPEC 2000 applications, it is almost impossible to run simulations until the end of all programs. Since the reference mode of a typical SPEC 2000 application has an average of 200 billion instructions, it will take about three months to completely run one application since the performance of our simulator is about 25K instructions per second. To lower the time requirement and still get accurate simulation results, we ran simulation for ten million cycles in ten randomly chosen different intervals by taking advantage of the fast-forward feature of the SimpleScalar simulator [4]. 5.4 M odeling the D etector T hread’s Behavior The functional model of the detector thread was implemented by extending the SimpleSMT simulator. All five types (Type 1 thorough Type 4 and Type 3') of determination heuristics were implemented into the model. Every 8K cycles, the throughput for the last interval is checked against the threshold value and if it 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is lower than the threshold value, a new policy is determined. Once a policy is determined, the policy is used until it is switched to another one. Expiration of an interval does not necessarily mean switching of policies. As long as the throughput maintains higher than the threshold value, no switching of policies takes place. 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 Experim ental R esults 6.1 The N eed for the A daptive D ynam ic Thread Scheduling in Sim ultaneous M ultithreading 6.1.1 P erform ance o f V arious F etch P olicies An attem pt was made by Tullsen et al. to find out the optimal fetch policy in SMT [87]. Giving priorities to threads with fewer instructions in the instruction queues (ICOUNT) and the ones with newer instructions in the queues (IQPOSN) proved to produce higher throughput than other heuristics. The study also noted that even though ICOUNT performed better on the average, in some special cases, other policies could work better. However, an SMT processor with a fixed ICOUNT fetch policy may not avoid poor throughput in situations where ICOUNT does not work. Therefore we came up with our contention that adaptively switching fetch policies depending on situations will result in better throughput. 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To corroborate this hypothesis, we ran simulations while keeping the record of the fetch policy that worked best in each interval (8K cycles). Figure 6.1 shows how many times each fetch policy was ranked as the best performer. (Note that the order of fetch policies shown in the legend is the same as the order each policy appears in each bar.) Undoubtedly, ICOUNT was ranked the first in more cases than any other fetch policies. However, it should be noted that in about half of all cases, other fetch policies performed better than ICOUNT did. Fetch policies such as LDCount, LIDMissCount, LlIMissCount and, surprisingly, even round- robin are amongst them. These facts were not recognized by the previous studies because they mostly relied upon average analyses. We can infer that if we have the capability of switching thread selection heuristics, then we can significantly improve the throughput of the SMT processor by using the heuristic that works better in each specific case rather than just one single hardwired fetch policy in all cases. Another interesting point is found in Figure 6.2. It shows how many - times each fetch policy was ranked as the worst performer. It is apparent from the figure that it is more likely that ICOUNT becomes the worst performing fetch policy of all as the number of threads increases, reaching 20% of the time when eight threads are supported. This implies that simply balancing the number of instructions in the early stages of the pipelines does not work well when there are too many threads (e.g., eight threads) exist. When the number of threads is large, the effect of balancing the numbers does not translate well to effects of other 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. policies such as BRCOUNT or L1MISS COUNT. Detecting such conditions and preventing ICOUNT from being the fetch policy of choice in such cases thus should help sustain throughput of an SMT processor. However, the number of cases when ICOUNT becomes the best is insensitive to the number of threads (Figure 6.1). This implies that the mixtures that respond better to ICOUNT will not be affected by the number of threads. From the two figures, it appears that BRCOUNT is sensitive to the number of threads as well. BRCOUNT works better as the number of threads increases. It becomes more likely to become the best performer (Figure 6.1) and at the same time less likely to become the worst performer as the number of threads increases (Figure 6.2). However, note that BRCOUNT usually is not a good fetch policy for our mixtures of applications. H o w o f te n a f e tc h po lic y p e rf o rm s th e b e s t? stallcount Himiss Eg lldm iss U m issc I & Idcount □ m em count n icount m brcoant Number of Threads Figure 6.1: How often a fetch policy becomes the best performer? 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. How often a fetch policy perform s th e w orst? too% gg RR staWcount □ Himiss H M dmiss El H m issc @ Idcount □ m em count Q icount gg brcount Q accipc Figure 6.2: How often a fetch policy becomes the worst performer? 6.1.2 P seu d o Fetch P o licy - M ax We define a pseudo fetch policy Max to be the fetch policy that performs the best in a given interval. The policy Max is different from an oracle scheduler; the best fetch policy is chosen within a given interval for Max while an oracle scheduler should take into account its effect to future intervals. The scheduling result of an oracle scheduler should not change depending upon varied length of intervals while that of Max can significantly do. Though Max cannot represent the real envelop of performance, it approximates the potential of what we can achieve with the adaptive approach for thread scheduling. Let’s take a closer look at the performance of each fetch policy in Figure 6.3(a). This graph was obtained while simulating the combination mxI-mxM-mx discussed 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in section 5. It shows how each policy performs in each 8K cycles interval. It is interesting to note that ICOUNT performs the worst in the three test points shown in the graph. We are defining here a pseudo thread fetch policy called Max whose IPC value is the best IPC achieved in each interval. During some intervals of this specific example, the IPC of Max is the same as BRCOUNT, LDCOUNT or ICOUNT and, interestingly, during some other intervals, its IPC value becomes the same as that of the round-robin policy because its value was higher than that of all other “smart” fetch policies. In the example given, Max reaches an IPC value of about 4.2 accumulated over the whole duration shown in the graph (which is about 1 instruction per cycle higher than that of the group of BRCOUNT, LlM ISSCO U NT and ICOUNT). This means that if the SMT processor is equipped with an adaptive thread scheduling capability that can dynamically change thread fetch policies to suit the changing environments, it can potentially achieve significant throughput improvements. In the given example, Max reaches an IPC value of about 4.2 accumulated in the whole duration shown in the graph which is about 1 instruction per cycle higher than that of the group of BRCOUNT, LlM ISSCO U NT and ICOUNT (See Figure 6.3(b)). This means that if the SMT processor is equipped with an adaptive thread scheduling capability that can dynamically change thread fetch policies to suit the changing environments, it can potentially achieve significant throughput improvement. 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. c \ .x- -•‘ S \ . - " T ’ * \ .--•'T., \ / : ' , \ ...... G ...... — •“•••••■O......................... ........................ o ....c „ e... % _ -----^ ' ’ V V " ” v 7 : ........... ;...... • > < .... '■ \ : \ \ .....^ / / I . . . .....- W / s w * m.iss Time (cycles) i6 « M ..SJ m .ca. Time (in cycles) (a) Throughput in each interval (b) Accumulated throughput Figure 6.3: Throughput of the virtual fetch policy Max in each interval Figure 6.4 shows the performance of the Max strategy obtained using larger sets of data. Each data point in the graph represents the average IPC value of the thirteen different combinations of the applications discussed in Chapter 5 for each fetch policy. The line on top of the rest is for Max in which the IPC of the pseudo fetch policy Max was averaged over all thirteen mixes. For each mix, the IPC of Max was obtained by averaging the highest IPC values found in each interval throughout the simulation lifetime. We find from Figure 6.4 that the IPC of ICOUNT presents an average IPC higher than all other real fetch policies regardless of the number of threads. However, its IPC decreases as the number of threads increases from 6 to 8. This diminishing return in IPC as the number of threads increases was already discussed by previous studies [87, 35]. These studies recognized that inter-thread interference such as cache pollution account for such drawbacks of having too many threads simultaneously available on an SMT processor. The chart shows no such 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. diminishing returns with the Max pseudo fetch policy. Consequently, with eight threads, the throughput is more than one instruction per cycle higher than that of ICOUNT or 27% improvement. It is obvious from Figure 6.4 that the adaptive dynamic thread scheduling will find more opportunities for performance improvement than a fixed scheduling ap proach as the number of threads increases. There are two reasons for this. The first lies in the fact that the overhead of the adaptive dynamic thread scheduling will be relatively small when there are more threads, for example, eight threads. The second reason is that an SMT processor will go through more adverse effects such as cross-replacement of blocks in the shared caches that may result in saturation observed in the studies described earlier. In such situations, fetch policies such as LDCOUNT or LlM ISSCOUNT would work better than ICOUNT because they would give priority to the thread that relies less heavily on the performance of the cache. We also observe from the chart that even with the Max pseudo fetch policy, the IPC is saturated when going from 6 threads to 8 threads although it does not degenerate as ICOUNT does. This implies that one hardware context can be allocated to the detector thread without sacrificing performance in most cases. The possible paradox is if the detector thread becomes too effective, then it could delay the saturation point beyond seven, thereby creating the need for additional contexts. 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A v e ra g e IPC 5 4 3 2 AcclPC ~o-- BRCount ICount M emCount LDCount — e— LIMissCount ■ —t— LIDMissCount -L IIM issC ount -StallC ount RR Number of Threads Figure 6.4: IPC of Max and average IPC of various other fetch policies 6.1.3 A ccuracy of th e pseudo fetch p olicy M ax In the previous section, we approximately showed that there indeed is the need for the adaptive dynamic thread scheduling. Scheduling threads exclusively based upon the one single fixed fetch policy, ICOUNT did not produce the high throughput the pseudo policy, Max did. However, that experiment does not guarantee that the adaptive thread scheduling will actually outperform ICOUNT. The first reason is that Max is not exactly the adaptive dynamic thread scheduling. Max is more of an oracle scheduler. Suppose the interval X and Y are consecutive in time. Max takes the throughput values (14 and Vy) of the best fetch policies during the interval of X and Y , Px and Py, respectively. However, 14 cannot be followed by 14 if Px and Py are not the same. The throughput value, Vy holds valid only if Py was effective during the 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. interval X . This implies that the throughput we obtain for Max can be higher or lower than we can actually achieve with the adaptive thread scheduling. For more accurate evaluation, we implemented the policy switching features into our SMT simulator as stated earlier in Chapter 5. 6.2 The optim al threshold values and policy determ ination heuristics In Figure 4.6, we estimated how the threshold value of throughput would affect the frequency of policy switchings and the quality of each switching. Figure 6.5 a) and c) verifies what we had expected, b) and d) shows how the policy determination heuristic type affects the frequency and quality of switchings. Type 3' represents the Type 3 heuristic plus considering gradient of throughput. It is interesting that Type 4 heuristic rather results in more low-quality (malignant) switchings. This implies that determining a new fetch policy depending on the historical performance does not work well. Figure 6.6 shows the effects of the threshold values upon throughput. It is obvious that the best performance is reached when the threshold value is 2 and Type 3 heuristic were used. The performance improvement over ICOUNT is about 30%. 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 100000 80000 soooo 4C O O O 20000 1 2 3 4 5 Threshold Value (a) Number of switchings vs. threshold value 4.00 3.50 3.00 2.50 2.00 1.50 t .00 0.50 0.00 □ Type 4 □ Type 3' sa Type 3 □ Type 2 a Type 1 1 2 3 4 5 100000 80000 60000 40000 20000 Type 1 Type 2 Type 3 Type 3' Type 4 Policy Determination Heurstic Type (b) Number of switchings vs. type Type 1 Type 2 Type 3 Type 3' Type 4 Policy Determination Heurstic Type (c) Probability of benign switches vs. threshold (d) Probability of benign switches vs. type value Figure 6.5: Effect of the threshold value on switch occurrence and quality Now let’s look into two special mixes of applications: hil-hiM-f and mxI-mxM- mx. Figure 6.7 shows the results for hil-hiM-f. hil-hiM-f means that the mix mostly contains applications with higher single-threaded throughput, higher mem ory footprint and more intensive floating-point usages. This mix shows the best performance improvement the adaptive thread scheduling can achieve amongst all our mixes. 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Average (or A ll Combinations □ Type 4 Q Type 3 & Type 3 Q Type 2 H Type Threshold Value (a) IPC vs. threshold values Average (or All Combinations 4.4 4.0 3.6 3.2 2.8 2.4 2.0 5 2 3 Type 2 Type 3 -■ Type 4 Threshold Value Average lor All Combinations Type 1 TypeS Type 3 Type 3' Type 4 ICOUNT Policy Determination Heurstic Type (b) IPC vs. type A v e r a g e f o r A ll C o m b i n a t i o n s Type 3' Type Type 3 Policy Determination Heurstic Type Type 1 Type 2 (c) IPC vs. threshold value for each type (d) IPC vs. type for each threshold value Figure 6.6: Effect of the threshold value and policy determination heuristic on throughput (average of all mixes) It is interesting that in this case, most heuristics (Type 2 through Type 4) perform very well with the threshold values within the range of between 1 and 3, inclusively. W ith this specific mix, about 50% of performance improvement over ICOUNT was observed. Figure 6.8 shows the results for mxI-mxM-mx. mxI-mxM-mx means that applica tions are all mixed up. Applications with low and high single-threaded throughput 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. hil-hiM-f □ Type Q Type 3 B Type 3 0 type 2 Threshold Value Type 2 Type 3 Type 3‘ Type 4 Policy Determination Heurstic Type (a) IPC vs. threshold values (b) IPC vs. type Type 2 Type 2 Type 3 Type 3' Policy Determination Heurstic Type (c) IPC vs. threshold value for each type (d) IPC vs. type for each threshold value Figure 6.7: Effect of the threshold value and policy determination heuristic on throughput (for hil-hiM-f mix only) are mixed and those with low and high memory footprint are put together. Also, floating-point and integer applications are mixed. In this specific example, we could not see much throughput improvement over the fixed ICOUNT scheduling. It is interesting that Type 1 heuristic showed the highest throughput when the thresh old value was 2. We can infer from this that the mxI-mxM-mx combination is already very well balanced in many aspects, it is hard to boost throughput over 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ICOUNT because ICOUNT already very well addresses the inherent problem of such mixes. That is, when threads are making a good balanced mix, simply bal ancing the number of instructions in the early stages of the pipeline is sufficient. Put in other words, for well-balanced mixes, we may not need adaptive dynamic thread scheduling. Consequently, we may ask the following question. Why don’t we just let the job scheduler concentrate on co-scheduling well-balanced sets of applications and then a fixed fetch policy, ICOUNT will work well? There are two reasons for that. The first one is that the job scheduler cannot co-schedule well-balanced sets of applications all the time. Furthermore, we need to prepare for the worst-case situations. The other reason being is that the job scheduler has to stay on the processor for significantly longer duration without the detector thread’s help. 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Threshold Value ICOUNT Policy Determination Heurstic Type (a) IPC vs. threshold values (b) IPC vs. type Type 1 Type 2 Type 3 Type 3‘ 4.4 4.0 3.6 3.2 2.8 2.4 2.0 Type 3' Type! Type 2 Type 3 Policy Determination Heurstic Type (c) IPC vs. threshold value for each type (d) IPC vs. type for each threshold value Figure 6.8: Effect of the threshold value and policy determination heuristic on throughput (for mxI-mxM-mx mix only) 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 Conclusions This dissertation has presented a study of adaptive dynamic thread scheduling in comparison to fixed scheduling approaches employed in earlier researches. It pro posed the detector thread approach to implement adaptive scheduling with low hardware and software overhead. The detector thread is a special thread that oc cupies one designated thread context with minimal extra hardware. It is scheduled for execution by the thread selection unit when an idle slot is available after the maximum of two threads (and up to eight instructions) are fetched. To validate the idea, we used the SimpleSMT simulator to obtain an approxi mate upper-bound of performance improvement we can achieve using our approach. SPEC 2000 applications were used to create thirteen various mixes of applications based on single-application performance, memory footprint and type (integer or floating-point). 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Simulation results showed that the adaptive scheduling can produce up to 27% performance improvement over a fixed scheduling for eight threads. This disserta tion stresses that such an adaptive scheduling is more suitable and advantageous for SMT architectures because they can embrace a resident thread readily with minimal overhead. The results we obtained in this study are extremely encouraging. Ever since SMT was introduced, studies have repeatedly shown that having too many threads (usually more than four) will not bring the anticipated throughput increase and worse yet, sometimes will even lower the throughput. Our study showed that the adaptive thread scheduling in combination with a detector thread can significantly extend the saturation point in terms of number of threads. It was also observed that as the number of threads increases, it is more likely that the fixed scheduling with ICOUNT may perform poorly. This implies that the adaptive dynamic thread scheduling is more necessary in an SMT processor with a greater number of threads. We furthered our study by proposing ways to implement the adaptive dynamic thread scheduling based on the implementation ideas for the simultaneous subor dinate microthreading. Our proposed high-level architecture further developed by contriving the way to avoid interference in the data access. The software architecture for the detector thread was developed and various heuristics were evaluated for determining the fetch policy to be used for the next scheduling quantum. Type 3 turned out to work best with the threshold value of 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2. Type 4 which keeps track of outcomes of earlier decisions turned out not worthy of the efforts because there seemed to be no correlation in time domain regarding the fetch policies because there is no fixed pattern about the interactions between independent threads. Once the job scheduler is put into the picture, because more dynamic change in the set of applications is going to take place, correlation in time domain will be even harder to find. This work started with the idea that a fixed fetch policy would be outperformed by an adaptive dynamic thread scheduling that can switch fetch policies once low throughput is detected. It proposed a way (the detector thread) to implement it with low costs. It also proposed and evaluated different heuristics for determining a new fetch policy. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 8 F u tu re W ork For more accurate evaluation of the adaptive dynamic thread scheduling, the in struction set needs to be extended to support the detector thread instructions. Special register names need to be defined and used by the detector thread instruc tions to read the status counters and update the flags. When the detector thread instructions are more realistically implemented, it can be simulated to variably detect low throughput. Consequently, the effects of the detector thread can be more accurately evaluated and compared against the fixed approach that was used throughout this work. However, without the utmost accurate evaluation, our methodology should be valid because the detector thread instruction and operand fetch paths are isolated from those of the normal threads to a significant extent. Furthermore, the instruc tions of the detector thread are fetched only when one or more idle slots are left over after all possible instructions from the normal threads are attem pted and tested for fetch. 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fetching operands from DT DRAM should have minimal effect because it does not interfere with data accesses made by the instructions of the normal threads. The overhead of storing the results for a detector thread instruction should also be negligible because the store will exclusively made on top of the DT DRAM without interfering with the write buffers of the normal threads. The overhead of operand fetches for the detector thread should also be minimal even when the detector thread accesses normal memory (not DT DRAM) because the fact that such operand fetches for the detector thread are pending implies that the system’ s throughput is lower than a threshold value and thus many idle slots should be available in the pipeline. Determining a fetch policy for the next scheduling quantum is the key function of the detector thread. W hat we have explored in this work is a small portion of the whole possible design space. The direction to which we can explore is to add more states (fetch policies) in the Heuristic Type 3. For example, we can have L1IMISSCOUNT and L1DMISSCOUNT instead of L1MISSCOUNT to make the states more specifically respond to the varying situations. Then, it will be necessary to define more conditions to determine the direction for each transition. Mixes in this study were formed based on three factors; single-thread through put, memory footprint and type. We can employ more parameters to more finely form mixes. By doing so, we will be able to more effectively analyze the experi mental results to enhance our presented heuristics. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Throughout this study, a mixture of applications remained the same from the beginning of a simulation session to its end. To evaluate the effect of the job scheduler and the interaction between the job scheduler and the detector thread, we should be able to vary the co-scheduled threads dynamically. Such a feature can be put into the simulation platform and in such an environment, the effect of identifying clogging threads by the detector thread can be quantitatively evaluated to show how the duration of the job scheduler process being resident in the SMT processor is reduced. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reference List [1 ] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer system. In Proceedings of the 1990 Interna tional Conference on Supercomputing, pages 1-6, 1990. [2 ] Executive Summary Applications. Alpha and ia64. [3 ] A.Tanenbaum. Modern Operating Systems. Prentice-Hall, Inc., 1992. [4 ] T. Austin. The SimpleScalar Architectural Research Tool Set, Version 2.0. Technical Report 1342, University of Wisconsin-Madison, June 1997. [5 ] M. Bach. The Design of the UNIX Operating System. Prentice-Hall Software Series. Prentice-Hall, Inc., 1986. [6 ] L. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. In Proceedings of the 27th Annual Interna tional Symposium on Computer Architecture, pages 282-293, June 1999. [7 ] M. Bekerman, A. Mendelson, and G. Sheaffer. Performance and hardware complexity tradeoffs in designing multithreaded architectures. In Proceedings of Parallel Architectures and Compilation Techniques, October 1996. [8 ] Doug Burger, James R. Goodman, and Alain Kagi. Memory Bandwidth Lim itations of Future Microprocessors. In Proceedings of the 23rd International Symposium on Computer Architecture, pages 78-89, 1996. [9 ] J. Burns and J-L Gaudiot. Exploring the SMT Fetch Bottleneck. In Proceed ings of the Workshop on Multithreaded Execution, Architecture and Compila tion (MTEAC99), Orlando, Florida, January 1999. [10] J. Burns and J-L Gaudiot. Quantifying the SMT Layout Overhead - Does SMT Pull Its Weight? In Proceedings of the 6th International Symposium on High Performance Computer Architecture, pages 109— 120, January 2000. [11] B. Calder, P. Feller, and A. Eustace. Value profiling and optimization. J. Instruction-level Parallelism, 1:1-6, 1999. 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [12] B. Calder, G. Reinman, and D. Tullsen. Selective value prediction. In isca26, May 1999. [13] R. Chappell, J. Stark, S. Kim, S. Reinhardt, and Y. Patt. Simultaneous Sub ordinate Microthreading (SSMT). In Proceedings of the 26th Annual Interna tional Symposium on Computer Architecture, pages 186-195, May 1999. [14] S. Chen and W. Fuchs. Compiler-assisted multiple instruction word retry for vliw architectures. IEEE Transactions on Parallel and Distributed Systems, 12(12):1293— 1304, 2001. [15] T. Conte, K. Menezes, and B. Patel. Optimization of instruction fetch mech anisms for high issue rates. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, June 1995. [16] D.Culler, J.Singh, and A.Gupta. Parallel Computer Architecture: A Hardware Software Approach. Morgan Kaufmann, 1999. [17] Keith Diefendorff and Pradeep K. Dubey. How Multimedia Workloads Will Change Processor Design. IEEE Micro, 30(9):43-45, 1997. [18] T. Diep, C. Nelson, and J. Shen. Performance Evaluation of the PowerPC 620 Microarchitecture. In Proceedings of the 22nd Annual International Sympo sium on Computer Architecture, June 1995. [19] K. Ebcioglu, J. Pritts, S. Kosonocky, M. Gschwind, E. Altman, and K. Kailas. An eight issue tree-vliw processor for dynamic binary translation. In Proc. In t’ l Conf Computer Design, October 1998. [20] S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, and D. Tullsen. Simultaneous Multithreading: A Platform for Next-Generation Processors. IEEE Micro, pages 12-18, September/ October 1997. [21] M. Fillo, S. Keckler, W. Dally, N. Carter, A. Chang, Y. Gurevich, and W. Lee. The M-Machine Multicomputer. In Proceedings of the 28th Annual Interna tional Symposium on Microarchitecture, December 1995. [22] J. Fisher. Very Long Instruction Word Architectures and the ELI-512. In The 10th Annual International Symposium on Computer Architecture, pages 140-150. IEEE Computer Society, June 1983. [23] K. Flautner, R. Uhlig, S. Reinhardt, and T. Mudge. Thread-level Parallelism and Interactive Performance of Desktop Applications. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 129-138, Cambridge, Massachussets, November 2000 . 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [24] M. Flynn, P. Hung, and K. Rudd. Deep-Submicron Microprocessor Design Issues. IEEE Micro, 19(4): 11— 22, July 1999. [25] C. Fu, M. Jennings, S. Larin, and T. Conte. Value speculation scheduling for high performance processors. In Architectural Support for Programming Languages and Operating Systems, pages 262-271, 1998. [26] R. Govindarajan, S. Nemawarkar, and P. LeNir. Design and performance eval uation of a multithreaded architecture. In Proceedings of the 1st International Symposium on High Performance Computer Architecture, pages 298-307, 1995. [27] M. Gulati and N. Bagherzadeh. Performance Study of a Multithreaded Super scalar Microprocessor. In Proceedings of the 2nd International Symposium, on High Performance Computer Architecture, pages 291-301, Feburary 1996. [28] B. Gunther. Multithreading with Distributed Functional Units. IEEE Trans actions on Computers, pages 399-411, April 1997. [29] L. Gwenlapp. Dansoft Develops VLIW Design. Microprocessor Report, 11 (2): 18-22, Feburary 1997. [30] L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Chen, and K. Olukolun. The Stanford Hydra CMP. IEEE Micro, 20(2):71-84, March/April 2000. [31] L. Hammond, B. Nayfeh, and K. Olukotun. A Single-Chip Multiprocessor. IEEE Computer Special Issue on Billion-Transistor Processors, 30(9):79-85, September 1997. [32] J. Hennessy and D. Patterson. Computer Organization and Design: The Hard ware/Software Interface, Morgan Kaufmann Publishers, Inc., 2nd edition, 1998. [33] J. Henning. SPEC CPU2000: Measuring CPU Performance in the New Mil lennium. IEEE Computer, 33(7):28— 35, July 2000. [34] S. Hily and A. Seznec. Branch Prediction and Simulataneous Multithreading. In Proceedings of Parallel Architectures and Compilation Techniques, 1996. [35] S. Hily and A. Seznec. Standard Memory Hierarchy Does Not Fit Simultaneous Multithreading. In Proceedings of M TEAC ’ 98 Workshop, Feburary 1998. [36] H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase, and T. Nishizawa. An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threadds. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 136-145, May 1992. 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [37] J. Hunter and W. Crawford. Java Servlet Programming. Prentice-Hall Soft ware Series. O’Reilly and Associates, Inc., 2nd edition, April 2001. [38] K. Hwang and Z. Xu. Scalable Parallel Computing: Technology, Architec ture, Programming. McGraw-Hill Series in Networks, Parallel and Distributed Computing. WCB/McGraw-Hill, 1998. [39] M. Johnson. Superscalar Microprocessor Design. Prentice Hall Series in Inno vative Technology. P T R Prentice-Hall, Inc., 1990. [40] Y. Kang, M. Huang, S. Yoo, Z. Ge, D. Keen, V. Lam, P. Pattnaik, and J. Tor- rellas. FlexRAM: Toward an Advanced Intelligent Memory System. In Inter national Conference on Computer Design (ICCD), 1999. [41] S. Keckler and W. Dally. Processor coupling: Integrating compile time and runtime scheduling for parallelism. In Proceedings of the 19th Annual Inter national Symposium on Computer Architecture, pages 202-213, May 1992. [42] R. Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 19(2):24-36, March/April 1999. [43] A. Klaiber. The Technology behind Crusoe Processors. Technical report, Transmeta Corporation, January 2000. [44] AJ KleinOsowski, J. Flynn, N. Meares, and D. Lilja. Adapting the SPEC 2000 Benchmark Suite for Simulation-Based Computer Architecture Research. In The 3rd IEEE Annual Workshop on Workload Characterization, Austin, Texas, September 2000. [45] P.M. Kogge, J.B. Brockman, and V.W. Freeh. PIM architectures to support petaflops level computation in the HTMT machine. In 1999 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, pages 35-44, 2000. [46] Christoforos E. Kozyrakis, Stylianos Perissakis, David Patterson, Thomas An derson, Krste Asanovic, Neal Cardwell, Richard Fromm, Jason Golbus, Ben jamin Gribstad, Kimberly Keeton, Randi Thomas, Noah Treuhaft, and Kather ine Yelick. Scalable Processors in the Billion-Transistor Era: IRAM. Com puter, 30(9):75-78, 1997. [47] B. Krishna and R. Govindarajan. Classification and Performance Evaluation of Simultaneous Multithreaded Architectures. In Proceedings of the fth. Inter national Conference on High Performance Computing (HiPC-97), pages 34-39, December 1997. 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [48] V. Krishnam and J. Torrellas. Efficient Use of Processing Transistors for Larger On-Chip Storage: Multithreading. In Proceedings of ISCA ’ 97 - Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, June 1997. [49] M. Lam. Software pipelining: An effective scheduling technique for vliw ma chines. In Proc. ACM SIGPLAN 1988 Conf. Programming Language Design and Implementation, pages 318-328, 1988. [50] M. Lam and R. Wilson. Limits of Control Flow on Parallelism. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 46-57, Gold Coast, Australia, May 1992. [51] E. Larson and T. Austin. Compiler controlled value prediction using branch predictor based confidence. In Proceedings of the 33rd Annual International Symposium on Microarchitecture, 2000. [52] S. Lee and J-L Gaudiot. ALPSS: Architectural Level Power Simulator for Simultaneous Multithreading, Version 1.0. Technical Report TR-02-04, Uni versity of Southern California, April 2002. [53] M. Lipasti, C. Wilkerson, and J. Shen. Value locality and load value prediction. In Architectural Support for Programming Languages and Operating Systems, pages 138-147, October 1996. [54] J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen. Convert ing Thread-Level Parallelism to Instruction-Level Parallelism via Simultane ous Multithreading. ACM Transactions on Computer Systems, pages 322-354, August 1997. [55] J. Lo, S. Eggers, H. Levy, S. Parekh, and D. Tullsen. Tuning Compiler Op timizations for Simultaneous Multithreading. In Proceedings of the 30th A n nual International Symposium, on Microarchitecture, pages 114-124, December 1997. [56] J. Lo, S. Eggers, H. Levy, and D. Tullsen. Software-Directed Register Deal location for Simultaneous Multithreaded Processors. IEEE Transactions on Parallel and Distributed Systems, pages 922-933, September 1999. [57] M. Loikkanen and N. Bagherzadeh. A Fine-Grain Multithreading Superscalar Architecture. In Proceedings of the 1996 International Conference on Parallel Architectures and Compilation Techniques, pages 163-168, October 1996. [58] C. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 40-51, June 2001. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [59] Matthew McCormick, Jonathan Ledlie, and Omer Zaki. Adaptively Schedul ing Processes on a Simultaneous Multithreading Processor. Technical report, University of Wisconsin - Madison, 2000. [60] N. Mitchell, L. Carter, J. Ferrante, and D. Tullsen. ILP versus TLP on SMT. In Proceedings of the 1999 Conference on Supercomputing, November 1999. [61] M. Moudgili, J. Moreno, K. Ebcioglu, E. Altman, S. Chen, and A. Polyak. Ccompiler/ architecture interaction in a tree-based vliw processor. IEEE Tech nical Committee on Computer Architecture Newsletter, pages 10-12, 1997. [62] B. Nayfeh and Olukotun. Exploring the Design Space for a Shared-Cache Mul tiprocessor. In Proceedings of Parallel Architectures and Compilation Tech niques, 1994. [63] H. Oehring, U. Sigmund, and Th. Ungerer. Simultaneous Multithreading and Multimedia. In Workshop on Multi- Threaded Execution, Architecture and Compilation (M TEAC 99) in conjunction with Fifth International Symposium on High Performance Computer Architecture (HPCA-5), January 1999. [64] Mark Oskin, Frederic T. Chong, and Timothy Sherwood. Active Pages: A Computation Model for Intelligent Memory. In The 1998 Annual Interna tional Symposium on Computer Architecture, pages 192-203. IEEE Computer Society, June 1998. [65] V. Pai, P. Ranganathan, H. Abdel-Shafi, and S. Adve. The Impact of Exploit ing Instruction-Level Parallelism on Shared-Memory Multiprocessors. IEEE Transactions on Computers, 48(2):218— 226, Feburary 1999. [66] V. Pai, P. Ranganathan, and S. Adve. The Impact of Instruction-Level Par allelism on Multiprocessor Performance and Simulation Methodology. In Pro ceedings of the 3rd International Symposium on High Performance Computer Architecture, pages 72-83, Feburary 1997. [67] S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-Sensitive Scheduling for SMT Processors. Technical report, University of Washington, 2000. [68] S. Patel, M. Evers, and Y. Patt. Improving trace cache effectiveness with branch promotion and trace packing. In isca25, June 1998. [69] Y. Patt, S. Patel, M. Evers, D. Friendly, and J. Stark. One Billion Transistors, One Uniprocessor, One Chip. IEEE Micro, 30(9):51— 57, 1997. [70] D. Patterson and J. Hennessy. Computer Architecture: A Quantitative Ap proach. Morgan Kaufmann Publishers, Inc., 1990. 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [71] David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. A Case for Intelligent RAM. IEEE Micro, 17(2):34-44, 1997. [72] E. Rotenberg, S. Bennett, and J. E. Smith. Trace cache: A low latency ap proach to high bandwidth instruction fetching. In micro29, pages 24-34, De cember 1996. [73] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J. E.Smith. Trace processors. In microSO, December 1997. [74] A. Roth and G. Sohi. Speculative Data-Driven Multithreading. In Proceedings of the 7th International Symposium on High Performance Computer Architec ture, pages 37-48, Monterrey, Mexico, January 2001. [75] C. Shin, S. Lee, and J-L Gaudiot. The Need for Adaptive Dynamic Thread Scheduling in Simultaneous Multithreading. In Proceedings of the 1st Work shop on Hardware/Software Support for Parallel and Distributed Scientific and Engineering Computing (SPDSEC-02) in conjunction with the 11th In ternational Conference on Parallel Architectures and Compilation Techniques (PACT-02), September 2002. [76] U. Sigmund, M. Steinhaus, and T. Ungerer. On Performance, Transistor Count and Chip Space Assessment of Multimedia-enhanced Simultaneous Mul tithreaded Processors. In Workshop on Multi- Threaded Execution, Architec ture and Compilation (M TEAC-f), December 2000. [77] U. Sigmund and T. Ungerer. Evaluating a Multithreaded Superscalar Micro processor versus a Multiprocessor Chip. In Proc. of the 4 th PAS A Workshop- Parallel Systems and Algorithms, pages 147-159, April 1996. [78] B. Smith. Architecture and applications of the hep multiprocessor computer. Society of Photocoptical Instrumentation Engineers, 298:241-248, 1981. [79] A. Snavely and D. Tullsen. Symbiotic Jobscheduling for a Simultaneous Mul tithreading Architecture. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 234-244, Cambridge, Massachussets, November 2000. [80] G. Sohi. Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers. IEEE Transactions on Computers, 39(3):349-359, March 1990. [81] G. Sohi and M. Franklin. High-Bandwidth D ata Memory Systems for Super scalar processors. In Proceedings of the 4th International Conference on Ar chitectural Support for Programming Languages and Operating Systems, pages 53-62, April 1991. 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [82] Y. Song and M. Dubois. Assisted Execution. Technical Report Technical Report CENG 98-25, University of Southern California, 1998. [83] G. Edward Suh, S. Devadas, and L. Rudolph. A New Memory Monitoring Scheme for Memory-Aware Scheduling. In Proceedings of the High Perfor mance Computer Architecture (HPCA ’ 02) Conference, Feburary 2002. [84] R. Thekkath and S. Eggers. The Effectiveness of Multiple Hardware Contexts. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 328-337, San Jose, California, October 1994. [85] M. Thistle and B. Smith. A processor architecture for horizon. In Proceedings Supercomputing 88, 1988. [86] H. Torng and S. Vassiliadis, editors. Instruction-Lev el Parallel Processors. IEEE Computer Society Press, 1995. [87] D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm. Exploit ing Choice: Instruction Fetch and Issue on an Implement able Simultaneous Multithreading Processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, pages 191-202, May 1996. [88] D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximiz ing On-Chip Parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 392-403, June 1995. [89] D. Tullsen, J. Lo, S. Eggers, and H. Levy. Supporting Fine-Grain Synchro nization on a Simultaneous Multithreaded Processor. In Proceedings of the 5th International Symposium on High Performance Computer Architecture, pages 54-58, January 1999. [90] T. Ungerer, B. Robic, and J. Silc. Multithreaded processors. The Computer Journal, 45(3):320-348, 2002. [91] D. Wall. Limits of Instruction-Level Parallelism. Technical Report WRL-TR- 93.6, Western Research Laboratory, Digital, 1993. [92] D. Wall. Speculative Execution and Instruction-Level Parallelism. Technical Report WRL-TN-42, Western Research Laboratory, Digital, 1994. [93] S. Wallace, B. Calder, and D. Tullsen. Threaded multiple path execution. In Proceedings of the 25th Annual International Symposium on Computer Archi tecture, pages 238-249, June 1998. 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [94] S. Wallace, D. Tullsen, and B. Calder. Instruction Recycling on a Multiple- Path Processor. In Proceedings of the 5th International Symposium on High Performance Computer Architecture, pages 44-53, January 1999. [95] H. Wang, P. Wang, R. Weldon, and et. al. Speculative Precomputation: Ex ploring the Use of Multithreading for Latency. Intel Technology Journal, 6(1), Feburary 2002. 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Architectural support for network -based computing
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
Decoupled memory access architectures with speculative pre -execution
PDF
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
Clustering techniques for coarse -grained, antifuse-based FPGAs
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
An adaptive soft classification model: Content-based similarity queries and beyond
PDF
Extendible tracking: Dynamic tracking range extension in vision-based augmented reality tracking systems
PDF
Compiler optimizations for architectures supporting superword-level parallelism
PDF
Experimental evaluation of a distributed control system for chain-type self -reconfigurable robots
PDF
Improving memory hierarchy performance using data reorganization
Asset Metadata
Creator
Shin, Chulho
(author)
Core Title
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Gaudiot, Jean-Luc (
committee chair
), Ierardi, Douglas (
committee member
), Pedram, Massoud (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-260634
Unique identifier
UC11334831
Identifier
3093914.pdf (filename),usctheses-c16-260634 (legacy record id)
Legacy Identifier
3093914.pdf
Dmrecord
260634
Document Type
Dissertation
Rights
Shin, Chulho
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical