Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Effects of memory consistency models on multithreaded multiprocessor performance
(USC Thesis Other)
Effects of memory consistency models on multithreaded multiprocessor performance
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
E F F E C T S O F M EM O R Y C O N S IS T E N C Y M O D E L S O N M U L T IT H R E A D E D M U L T IP R O C E S S O R P E R F O R M A N C E by Y ong-K im C h o n g A T h e s is P r e s e n te d to th e FA CU LTY O F T H E S C H O O L O F E N G IN E E R IN G U N IV E R S IT Y O F S O U T H E R N C A L IF O R N IA In P a r t ia l F u lfillm e n t o f th e R e q u ire m e n ts fo r th e D e g re e M A S T E R O F S C IE N C E in C O M P U T E R E N G IN E E R IN G M ay 1993 C o p y rig h t 1993 Y ong-K im C h o n g U M I Number: EP43878 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, th ese will be noted. Also, if material had to be removed, a note will indicate the deletion. UMI Dissertation Publishing UMI EP43878 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 4 8 1 0 6 -1 3 4 6 This thesis, written by under the guidance of Faculty Committee and approved by all its members, has been presented to and accepted by the School of Engineering in partial fulfillment of the re quirements for the degree of ................................ ..................... D a te Z T Q !. _ 4 r . . ?r . . 1. .... Faculty/ Com m ittee L n a u ? V - V K » Chairman b&MXaaa**- C p s ' 9 3 C5HS 37U J I A ck now ledgm ents I would like to express my sincere appreciation to my advisor, Pro- ! fessor Kai Hwang, for his patience, support and guidance in the writing of this thesis. It has been a great pleasure and privilege to work with him. I would also like to thank Professor Viktor K. Prasanna and Profes sor Rafael H. Saavedra-B arrera for serving on my thesis committee and giving their valuable com m ents. I would like to thank Weihua Mao, Shisheng Shang and Hwang-Cheng Wang for their help rendered. C on ten ts Acknowledgments ii List of Tables v List of Figures vi Abstract viii 1. Introduction and Background 1 1.1 Summary of Research Work 1.1.1 Research O b je ctiv e s...................................................... 1 1.1.2 Thesis O rganization................................................................ 2 1.2 Related Previous R e se arch ..................................................................3 1.2.1 Shared M emory M o d e ls........................................... 3 1.2.2 D istributed Shared M emory S y stem .............................. 4 1.3 Latency Hiding T echniques.......................................................... 6 1.3.1 H ardw are-Supported Coherent C a c h e s...............................7 1.3.2 Relaxed Consistency M emory M o d e ls............................... 8 1.3.3 Softw are-Controlled Data P refetch in g............................. 10 1.3.4 M ultiple C o n te x ts.................................................................. 11 1.4 G eneralized Stochastic Petri N e ts ................................................... 12 1.4.1 Basic GSPN M o d e l................................................................. 13 1.4.2 Steady State Probability D istrib u tio n .............................. 14 2. GSPN Model of Multithreaded Processors 18 2.1 A bstract Processor M o d e l................................................................ 18 iii j 2.2 Deterministic Processor M o d e l.............................. I 2.3 The GSPN Processor M o d e l................................... j 2.3.1 Embedded M arkov C h a in ........................... 2.3.2 Steady State Solution .................................. 2.3.3 Perform ance C u rv e ...................................... ! 2.4 C onclusions................................................................... I I 3. Effects of Sequential Consistency Memory Model i I 3.1 Processor A rch itectu re............................................. i ! 3.2 Constraints on Memory Accesses U nder S C ...... ! 3.3 The GSPN Processor M o d e l................................... I 3.4 Performance C u rv e s.................................................. 3.5 C onclusions ...................................................... 4. Effects of Relaxed Consistency Memory Models 4.1 Memory Access C o n strain ts................................... 4.2 Processor Consistency M o d e l................................ 4.3 Weak Consistency M o d e l........................................ . 4.4 Release Consistency M o d e l.................................... 5. Cache Effects and Comparison 5.1 Cache Effects on Various M emory M o d e ls........ j 5.2 Comparison W ith Sim ulation R e su lts ................... I 5.3 Relative M erits of Various M emory M o d e ls i i ; 6. Conclusions i 6.1 Summary of R esearch C o n trib u tio n s.................... 6.2 Suggestions for Further R e se a rc h .......................... B ibliography L ist o f Tables 2.1 M ultithreaded Processor Efficiency for /?=16 and L= 130 3.1 D efault param eter values for GSPN m o d e ls....................... 5.1 Param eter values obtained for M P3D, LU and PTHOR .. L ist o f F igu res 1.1 D istributed Shared M em ory S y stem ................................................ 6 1.2 Sample code segm ent with prefetch in stru ctio n ........................ 11 1.3 A simple GSPN m odel and its reachability s e t .......................... 14 2.1 Abstract m odel of a m ultithreaded p ro ce sso r............................ 19 2.2 Operation modes of a m ultithreaded p ro ce sso r......................... 21 2.3 Processor efficiency curve using determ inistic m o d e l 21 2.4 GSPN model of a m ultithreaded processor ................................ 23 2.5 EMC of the GSPN m odel for A = 2 ........................................... 26 2.6 Reduced EMC of the GSPN m odel for N = 2 ................................ 26 2.7 Performance curves for R = 16 and L - 1 3 0 .................................... 31 3.1 M ultithreaded processor architecture with coherent cache .. 34 3.2 GSPN model for Sequential C o n sisten cy .................................... 40 3.3 SC perform ance curve using default param eter v a lu e s 44 3.4 SC perform ance curves w ith different s......................................... 45 3.5 SC perform ance curves w ith different mr and mw ..................... 45 3.6 SC perform ance curves w ith different sw ..................................... 47 3.7 SC perform ance curves w ith different p i .............................. 47 4.1 GSPN model for PC with static allocation of se rv e rs 53 4.2 PC perform ance curves with static allocation of se rv e rs 55 4.3 GSPN model for PC w ith dynam ic allocation of se rv e rs 56 vi 4.4 PC perform ance curves using default param eter values 4.5 PC perform ance curves with different B ............................. 4.5 PC P{stall} with different B ................................................. 4.7 PC perform ance curves w ith different sw ........................ 4.8 GSPN m odel for Weak C o n sisten cy................................... 4.9 WC perform ance curves with different s ^ and sre[ ...... 4.10 WC P{ sta ll} with different sacq and srei ............................ 4.11 WC perform ance curves with different B ......................... 4.12 WC perform ance curves with different sw ........................ 4.13 GSPN m odel for Release C o n sisten cy ....................... | 4.14 RC perform ance curves with different sacq and srei ....... 4.15 Combined perform ance curves for sacq-0.01 and B=6 . | J 4.16 Combined perform ance curves for sacQ - 0 .05 and B=6 , ! 4.17 Combined perform ance curves for sacq- 0 .0 l and B -4 . \ i 5.1 Effects of cache degradation on different m odels........... 5.2 MP3D w ith estim ated kr- k w- 0 . 6 ....................................... ' 5.3 LU with estim ated kr=0.4 and kw=4.4 ....................... I 5.4 PTHOR with estim ated kr=kw= 0 .5 5 ................................... 5.5 MP3D under different memory consistency m o d e ls..... ' 5.6 LU under different memory consistency m o d e ls............ . j 5.7 PTHOR under different memory consistency m odels ... 58 58 59 59 62 64 64 65 66 68 70 70 71 71 74 78 78 79 80 80 81 1 i i A bstract I I I ! In this thesis, stochastic m odels of m ultithreaded m ultiprocessors j are provided. The m ultiprocessor is equipped with coherent caches in a I D istributed Shared Memory (DSM) system under the constraints of vari- i __________________________________________ _ | ous consistency memory models. The m em ory consistency m odels con- I sidered are Sequential Consistency (SC), Processor Consistency (PC), j Weak Consistency (WC) and Release Consistency (RC). The m em ory I . . . ! access constraints imposed by each consistency m odel are specified and i j applied to the m ultithreaded processor, using a unified context sw itching ! policy. Generalized Stochastic Petri Net (GSPN) models are developed j and solved by num erical solution m ethod using specific param eter values. i i » I From the perform ance curves obtained, the effectiveness of relaxed consistency models depend very m uch on the characteristic of the appli cation program and context switching policy used. In case of sm all buffer I size and high synchronization rate, the perform ance of the relaxed consis tency m odels is worse than the SC m odel due to excessive processor stall ing time. With sufficiently large buffer size and low synchronization rate, ( significant perform ance gain over SC can be expected from each of the relaxed consistency models. The GSPN m odels for SC and RC correlates well with published sim ulation results from Stanford University, w hile PC and WC m odels requires further validation due to lack of sim ulation i results. viii C hapter 1 In trod u ction an d B ackground 1.1 Summary of Research Work 1.1.1 Research Objectives The current trend in designing m assively parallel com puters is to build a scalable, D istributed Shared Memory (DSM) m ultiprocessor sys tem w ith thousands of processing nodes [LH89, NL91, Bell92, Hwa93]. This is done by form ing a large virtual memory space by com bining the m em ory space of each processing node. The m ain advantage of a DSM system is that it provides the scalability of a distributed-m em ory m ulti com puter while keeping the programmability of a conventional shared- memory m ultiprocessor system . Some representative system s that are classified under DSM system are Stanford DASH [LLGx92], Tera com puter [ACCx90] and the CM5 [TMC91]. 1 One of the main challenge in building a DSM system is the ability to hide the relatively long memory access latencies. Various latency hid ing techniques such as use of Coherent Caches, Relaxed Memory Consis tency Models, Prefetching and Multiple Contexts or Multithreading have been suggested by various researchers. Sim ulation study on com bination of some or all of these techniques have been carried out and shown to be prom ising [WG89, GGH91, GHGx91]. Several analytical models have also been proposed to m odel m ultithreaded processors with and w ithout other latency hiding schem es [Saav90, Agar92, MH93]. The primary objective of this research is to develop a stochastic m odel of a m ultithreaded processor equipped with hardware coherent cache and under various m em ory consistency m odels. The consistency m odels that will be considered are the Sequential Consistency (SC), Pro cessor Consistency (PC), Weak Consistency (WC) and Release Consis tency (RC) models. The Generalized Stochastic Petri Net (GSPN) is chosen based on its capability in m odeling both concurrency and synchro nization operations, which is not possible under queuing m odels. The effects of some param eter changes on the relative perform ance of each model will also be studied. 1.1.2 Thesis Organization This thesis is divided into six chapters. Chapter 1 provides the background on DSM system s and latency hiding techniques, together with the theory behind GSPN m odeling technique. Chapter 2 presents a basic GSPN m odel of an abstract m ultithreaded processor model. The perfor mance curves obtained for various degree of m ultithreading is com pared with published results using analytical model. In Chapter 3, the basic 2 m odel is extended to m odel a m ultithreaded processor w ith coherent cache, under SC consistency m odel in a DSM system. The architectural assum ptions and param eters used for the model are defined. The GSPN m odel and its perform ance curves under some param eter changes are pre sented and analyzed. In Chapter 4, the GSPN m odel is further extended to m odel the relaxed memory consistency m odels, namely PC, WC and RC, w ith dif ferent degrees of m em ory access constraints. The perform ance curves obtained for all consistency m odels are com pared under different param e ter value changes. In Chapter 5, a simple cache degradation m odel is included to improve the accuracy of the GSPN m odels. Com parisons of the perform ance curves predicted by the GSPN m odels against published sim ulation results were m ade and possible sources of error identified. This is follow ed by a summary of the relative m erits of various memory m odels. Finally, Chapter 6 sum m arizes the m ain contributions and sug gests topics for further research. 1.2 Related Previous Research 1.2.1 Shared Memory Models The form al definition of each memory consistency m odels were presented by different researchers [Lam79, Good89, D SB 8 6 , GLLx90]. Gharachorloo et al. [GGH91] perform ed sim ulation study on each consis tency model on a single-threaded m ultiprocessor system. The used of j hardware supported, directory-based, invalidating coherent cache proto col have been extensively covered in [LLGx90, LLGx92, LLJx92]. An 3 | analytical model of a m ultithreaded processor was presented by Saavedra j et al. [Saav90, Saav91] based on the exact solution of the underlying I j M arkov Chain. A separate analytical m odel was proposed by Agarwal [Agar92] which included both the cache and netw ork latency models. Gupta et al. [GHGx91] conducted some sim ulation studies on the i effects of combining different latency reducing techniques, which also includes the softw are-controlled prefetching. An analytical m odel was also presented by Mao and Hwang [MH93] which investigated the effects of software prefetching and RC on a m ultithreaded processor. 1.2.2 Distributed Shared Memory System p t j Two popular classes o f parallel MIMD com puters that have been w idely used are the tightly coupled shared-m em ory m ultiprocessors and the loosely coupled distributed-m em ory m ulticom puters. In a shared- mem ory m ultiprocessor, a com m on global physical m em ory is shared by m ultiple processors, usually through a single or m ultiple bus system . This provides a natural extension of a uniprocessor program m ing model, but 1 with the problem of bottleneck on shared mem ory accesses. This lim its I the size of the system to a sm all num ber of processors. The memory space is bounded by the bus width and physical capacity of the shared-m em ory subsystem . The distributed-m em ory m ulticom puter consists of a collection of I ! independent processors, each having its own local memory, connected via an interconnection network using m essage passing paradigm . This pro vides a more scalable system in term s of memory space and processing power, but with m uch higher com m unication overhead. Even w ith the use 4 of worm-hole routing technique, the com m unication latency across an interconnection network w ith reasonably sm all diam eter is still tens to hundreds of processor cycles for each access. Hence it is only suitable for coarse grain parallel com putation with m inim um com m unication among the processors. A new class of m ultiprocessor called D istributed Shared M emory (DSM) m ultiprocessor has been actively studied for building the next generation o f M assively Parallel Processing (M PP) system. It applies the shared-m em ory abstraction on top of the m essage-passing distributed- memory m ulticom puter system. A virtual address space, which is the com bined memory space of each processing node, is created and shared among processes on the loosely coupled processors as illustrated in F ig ure (1.1). This allows large program s or databases to be run on a m ulti processor system with relatively sm all individual memory space. DSM system s provide a num ber of benefits such as data abstraction, passing of variables by reference and low er run-tim e overhead. In short, DSM tries to combine the scalability of a distributed-m em ory system with the ease of program m ing of a shared-m em ory system in order to achieve the ideal goal of a parallel computer. A num ber of experim ental system s such as DASH [LLGx92], PLUS [BR90] and Shiva [LS89]] were built based on both hardware and software im plem entations of DSM. Some of the issues involved in the design of a DSM system are m em ory coherency, interprocess synchroni zation, structure and granularity of the shared data, heterogeneity and scalability of the system. N itzberg and Lo [NL91] gave a detail survey of existing DSM systems and how these issues were handled in each system. 5 r r r p u i Mem. M em. Mem. Node N Node 0 Node-1 Figure (1.11 D istributed Shared Memory System The scalability of a DSM system is affected by the am ount of over head incurred in interprocess com m unication, process synchronization and rem ote memory access latency. The rem ote memory access latency is determ ined by the bandwidth of the underlying interconnection netw ork and the traffic density. For a large DSM system , the relatively long rem ote memory access latency will effectively cause low processor utili zation and offset the perform ance gain expected from the use of parallel ism. Hence techniques to reduce or hide these long latencies are critical to achieve high processor utilization and a more scalable m ultiprocessor system. 1.3 Latency Hiding Techniques A num ber of techniques have been proposed by various researchers in trying to reduce or hide the rem ote memory access latency. These includes Hardware Coherent Caches, Relaxed Memory Consistency M od 6 els, Software Prefetching and Multiple Contexts. These techniques are not m utually exclusive. Each technique may interact constructively or destructively with the other techniques. Sim ulation study [GHGx91] have shown that a com bination of different techniques often produce better perform ance results than each applied individually. 1.3.1 Hardware-Supported Coherent Caches This technique extends the conventional uniprocessor cache princi ple to include the caching of shared read-w rite data in a m ultiprocessor system. The idea is to exploit the locality of references for shared-data, thus reducing the m iss rate due to shared-data accesses. However, this gives rise to a num ber of cache coherent problem s as described in [DSB8 8 ]. These m em ory coherency problem s can be easily solved for a small bus-based m ultiprocessor system using the Snoopy Cache Coherent Protocol [Good89]. The problem is m uch more com plicated for a large scale DSM system and softw are or hardw are support is required to enforce m em ory coherency at all levels. Stanford DASH system [LLGx90, LLGx92] is im plem ented with a hardware supported, invalidating, distributed directory-based cache coherent protocol. Physical m em ories are distributed among all the pro cessor nodes. A directory entry is kept for each memory block in the home node w hich owns the data, indicating all rem ote nodes caching it. Under the invalidation protocol, when a write to the shared data block occurs, point-to-point m essages are sent to these nodes to invalidate rem ote copies of the data block. A cknow ledgm ent m essages are sent from these nodes to the originating node to signify the com pletion of the inval idation process. 7 i The m ain benefit in em ploying coherent caches comes from the j reduction in miss rate due to shared-data references, as com pared to the ■ case when only private data are cached. This effectively reduce the latency seen by the processor, thus im proving overall processor effi ciency. Sim ulation results obtained in [GHGx91] shows significant im provem ent in processor perform ance w ith the caching of shared read- write data using hardw are coherent caches. j 1.3.2 Relaxed Consistency Memory Model i ! M emory consistency model defines the order by which the shared memory accesses from one process should be observed by other processes in the system. It imposes restrictions on the order of shared m em ory j accesses initiated by each processor. This translate to the amount of buff ering and pipelining of memory accesses allowed in each processor in try ing to hide the latency. Four memory consistency models have been ! proposed, these include Sequential Consistency, Processor Consistency, J Weak Consistency and Release Consistency. i Sequential Consistency (SC) [Lam79] is the strictest m odel. It requires that the execution of a parallel program to appears as some inter- ; leaving of the execution of the parallel processes on a m ultithreaded I 1 sequential m achine. Memory accesses are atom ic and strongly ordered, thus preserving program order in all processors. This imposes severe restrictions on the outstanding accesses that a process may have, which in turn restrict the buffering of memory accesses. On the other hand, Release Consistency (RC) [GLLx90] is the m ost relaxed model. It uses the inform ation about synchronization points and 8 enforce explicit Acquire (lock) and Release (unlock) accesses which are guaranteed to be processor consistent. An acquire is a read operation that gains perm ission to access a block of data, while a release is a write oper ation that gives away such perm ission. This allows extensive buffering and pipelining of memory accesses within the synchronization points, and thus potential for increased perform ance. The main disadvantages of RC lies in the more complex program m ing m odel and the increased hardware com plexity, which requires lock-up free caches and a m echanism to keep track of m ultiple outstanding requests. However, the same hardw are is also required to support prefetching and multiple contexts. Processor Consistency (PC) [Good89] and Weak Consistency (WC) [DSB8 6 ,DSB8 8 ] falls betw een SC and RC in term s of potential perfor mance. PC requires that the w rites issued by each individual processor are always in program order, but the order of w rites from different pro cessors can be observed differently. This gives the opportunity for write buffering, thus allowing reads follow ing a write to bypass the write. WC enforces consistency at the synchronization points, w hich usually con tains the data structure within a critical section. C orrectness is achieved by ensuring that all previous accesses are perform ed at the beginning and end of each critical section. This allows accesses w ithin the critical sec tion to be pipelined. The perform ance gain from relaxed memory consistency models comes from various degree of buffering and pipelining of rem ote memory accesses. This depends on the program behavior and the am ount of shared data among the processes. G upta et al. [GGH91, GHGx91] presented some sim ulation results com paring the perform ance of each consistency model. Chapters 3 and 4 of this paper present stochastic m odels of these 9 f memory consistency models and their perform ance indices, using Gener- i alized Stochastic Petri Nets (GSPN) m odeling technique. 1.3.3 Software-Controlled Data Prefetching Prefetching is a technique used to move data close to a processor j before it is actually needed. This can be done by hardw are or software ' m ethods, using either binding or non-binding schemes. W ith binding i 1 prefetching, the value of a later reference is bound as soon as the prefetch I is com pleted. This poses a problem since the value becom es stale if ! another process modifies the same location during the interval between com pletion of prefetch and actual reference of the shared-data. Non-bind- i j ing does not have this problem since it allows the data to be visible to the i | cache coherence protocol. This ensures consistency of the data until it is I ; actually reference by the processor. I Hardware prefetching is norm ally done by having long cache lines I or instruction look-ahead buffers. This is lim ited by the reduced spatial locality in m ultiprocessor applications, branches in instruction streams, ! finite buffer size and fixed instruction look-ahead interval. On the other hand, software prefetching requires explicit prefetch instructions to be added by the program m er or an intelligent compiler. This allows selective prefetch to be issued only when necessary, thus reducing the amount of i netw ork traffic in the system. Furtherm ore, the freedom to extend the i interval between prefetch issues and actual data reference is critical when latencies are large. The disadvantage in using softw are prefetching is the < extra instruction overhead and program m er or com piler intervention | required. 10 An exam ple of a program segm ent using software prefetch instruc tion is as shown in Figure (1.2). Prefetch (S) • | Lock (L) S = S + 1 Unlock (L) ' Figure (1.2) Sam ple code segm ent w ith prefetch instruction ] ! i ! The gain from the use of softw are prefetching depends on the | amount of instruction overhead used and the coverage factor achieved, I I which is defined as the fraction of the shared-data references that is satis- I ■ ' fied through prefetching. Mao and Hwang[M H93] has done a detailed < analysis of the effect of software controlled prefetching on the perfor- ! mance of a m ultithreaded m ultiprocessor system. i ■ 1.3.4 Multiple Contexts ■ ; M ultiple contexts (or m ultithreading) allow s m ultiple threads to I > be run in a single processor on a context sw itching basis. Each processor is equipped with a num ber of hardw are contexts, with independent regis ter sets to achieve fast context sw itching tim e. W hen a running context encounters a long latency access (such as a m iss on shared-data), it is switched out and another ready-to-run context starts executing. In this manner, the long mem ory latency of one context can be hidden w ith use ful com putation of another context, thus increasing the processor utiliza tion rate. D ifferent context sw itching policies may be em ployed in 11 deciding when to perform a context switch. These include switch on cache miss, switch on every load, switch on every instruction and switch on block o f instruction. The later policies are m ainly for the purpose of reducing the overhead in context switching. The perform ance gain from employing m ultiple contexts on a sin gle processor depends on several factors. The num ber of contexts avail- i j able in the processor determ ines whether all the latencies could be com pletely hidden. The context switch overhead determ ines the m axi mum achievable processor efficiency. Other factors such as application program behavior, constructive or destructive cache interference and net- j ! work traffic density also affects the expected perform ance gain. In gen eral, sm all num ber of contexts, between two to four, is sufficient to j achieve close to maxim um processor efficiency when com bined with i other latency hiding techniques [Saav9G, GHGx91, Agar92, MH93]. i i Chapter 2 of this paper presents a stochastic model of a m ultithreaded ! processor using GSPN m odeling technique. i I ! 1.4 Generalized Stochastic Petri Nets j Petri Net (PN) is an effective modeling tool for description and analysis of concurrency and synchronization in parallel system s. The ^ association of time as an exponentially distributed random variable to PN ! give rise to stochastic PN (SPN), which is isom orphic to hom ogeneous M arkov Chain (M C). The use o f SPN in m ultiprocessor system evaluation has been very attractive due to its capability in describing both parallel ism and synchronization, which is not possible in queuing netw ork mod- i els. M oreover, the state transition rate diagram of the associated MC can i J be obtained directly from the SPN description. 12 : Generalized Stochastic Petri Nets (GSPN) [Ajmo84] is an exten- i sion of SPN by allowing transitions to belong to two different classes, j namely immediate and timed transitions. An im m ediate transition (repre- I sented by a segment) fires in zero time once it is enabled. A tim ed transi tion (represented by a white rectangle) fires after a random , exponentially j distributed, enabling time which is identical to SPN. Hence firing rates s are associated only with tim ed transitions. This is particularly useful | when the operating sequence of a system consists of activities whose | duration differs by orders of m agnitude. The short activities can be mod- ! eled from the logical point o f view using im m ediate transitions. This I i effectively reduces the num ber of states of the associated MC, thus reduc- | ing the solution com plexity as com pared to SPN. 1.4.1 Basic GSPN Model ! A form al definition and derivation of GSPN is given in [Ajmo84]. i The firing rules specified in GSPN differs from that in SPN. If the set of enabled transitions, H, com prises only tim ed transitions, then the transi- j tion t; (is H) fires with a probability ! ; i h ( U ) ! j e H which is the same as in SPN, where lj is the transition rate of transition j. However, if H com prises both im m ediate and tim ed transitions, then only imm ediate transition will fire. If more than one im m ediate transition are sim ultaneously enabled, then it is necessary to specify a probability den sity function (or switching distribution) on the set of enabled im m ediate transitions. The reachability set of the GSPN can be obtained as in the case of standard PN, together with the additional firing rules stated above. M ore over, the reachability set can be divided into two disjoint subsets. The m arkings that enable tim ed transitions only are called Tangible states, to signify finite tim e spent in these states. The m arkings that enable im m edi ate transitions only are called Vanishing states since no time is spent on these states. A sim ple exam ple of a GSPN m odel and its reachability set obtained is as show n in Figure (1.3). P2 P I T1 1.00 P3 T 2 0.06 P4 T4 T3 0.01 0 .50 mi m2 m3 m4 1115 Vanishing 2 1 0 0 0 States 1 1 0 0 1 Tangible 1 0 1 0 0 States 1 0 0 1 1 2 0 0 1 0 0 0 1 0 1 0 0 0 1 2 0 1 0 0 2 Figure (1.3) A sim ple GSPN m odel and its reachability set 1.4.2 Steady State Probability Distribution The time behavior of a GSPN is equivalent to the time behavior of a stochastic process {X(0, with a finite state space. With the assump- 14 tions that the reachability set is finite, the firing rates do not depend on time param eters and the initial m arking is reachable with a nonzero prob ability from any m arking in the reachability set, we can classify the pro cess as a finite state space, homogeneous, irreducible and continuous time stochastic process. If we disregard the concept of tim e and focus on the set of possible states the process enters due to firing of any transition, an Em bedded M arkov Chain (EMC) can be recognized w ithin the stochastic process. Let S be the state space of the EMC which includes both tangible (7) and Vanishing (V) states, such that \s\ = Ks, \T\ = Kt, \v\=Ky, and S = T u V , TnV = 0 , Ks = Kt + Kv The state transition probability m atrix U of the EMC can then be w ritten as K ,—► -* — K f K v— ^ - K ,- C D 0 0 u = A + B = + 0 0 E F t t <1.2) f The elem ents of subm atrix A can be obtained from the characteristic of the random switches, and the elem ents of subm atrix B can be obtained using the firing rates of the tim ed transitions. The stationary probability distribution vector of the EMC, 7t, can be expressed as 15 n = n U (1.3) Taking an arbitrary tangible state i, as a reference state, the m ean num ber of visits to state j betw een two successive visits to state / is given by 71. vv = J. ( 1.4) Jl 7 1 . i and the m ean sojourn time in each state (£ [ 1^ ] , /< = S) is given by E[W{] = 0 , ( i e V) and E [W {] = ■ (i e T) (15) 2 - lk k e H i where //; is the set of tim ed transitions enabled by tangible state i. The m ean time for the stochastic to return to the reference state i (m ean cycle tim e) is given by C, = I > > £[% ] ( 1 .6 ) J 6 T where VjiE\Wj\ is the m ean tim e spent by the stochastic process in state j during a cycle. Thus the steady state probability distribution of the sto chastic process underlying the GSPN m odel can be w ritten as: ftj = 0 , ( j e V ) VjiE{Wj] and 7ij = (jeT ) ( 1 .7 ) 16 The solution method outlined above is com putationally acceptable w henever the size of the vanishing states is small (Kv « K t). Since this m ethod also compute the mean num ber of visits for all vanishing states, which is useless in terms of inform ation content, a considerable am ount of com putational time is wasted. Furtherm ore, for large values of Kv and Kt, the size of the matrix U may be im possible to solve. 1 I j In order to remove the vanishing states from the EM C, we need to ! com pute the total transition probabilities among tangible states only. Let i t j and j represents arbitrary tangible states (i, j e T) while r and s be arbi- 1 trary vanishing states (r, s e V O . Using crs ,drj , eis a n d /^ to represent ele- j m ents in the submatrix blocks C, D, E and F of the m atrix U. The total j transition probability between any two tangible states i and j (u’ij) can be ; obtained by I ! i U ’i j = f i j + ' L e< rP {'■->;'} (1.8) r e V where P{r->j) represents the probability that the stochastic process i I m oves from vanishing state r to the tangible state y, follow ing a path through vanishing states only. This gives the transition probability m atrix U’ of the reduced EMC. The new stationary probability distribution 7t ’ , can be obtained by i t i ! 7t' = 7t' U' (L9) and the steady state probability distribution for the tangible states can be com puted as in the previous case. 17 C hapter 2 G SPN M odel o f M u ltith read ed P rocessors This chapter presents a GSPN m odel of a m ultithreaded processor architecture. A sim ple determ inistic m odel is first derived. A GSPN m odel along w ith its Em bedded M arkov Chain is then presented, follow ed by the steps for num erical solution of the steady state probability distri bution. The perform ance curve obtained for various degree of m ulti threading is com pared with the analytical m odel as presented in [Saav90]. 2.1 Abstract Processor Model We can represent a m ultithreaded processor in a m ultiprocessor system with an abstract m odel as shown in Figure (2.1). A total of N threads or contexts are running in each processor. Each context runs for R cycles before m aking a rem ote memory access, where it will be sw itched out and being replaced by another ready-to-run context. The context sw itch overhead is C cycles and each rem ote memory access operation takes L cycles to com plete. A num ber of assum ptions were made in this abstract model. The synchronization accesses w hich are program dependent are ignored. Pipe 18 lining of rem ote memory access is allowed, with m ultiple servers in the network, thus allow ing m ultiple outstanding requests being service sim ul taneously. L is assum ed larger than C since it defeats the purpose of m ul tithreading if L is equal or sm aller than C. Finally, we also assum e that N rem ains constant throughout the com putation, which m eans that there is sufficient parallelism for execution w ithin each node. C ontexts Regs ~ T Regs 2 T Regs 3“ • • Regs 1 S T C ontext Sw itch t c j ” R unning (R) Local jxemote Access (L) M em ory u LJ R equest B uffers Figure (2.11 A bstract model of a m ultithreaded processor The perform ance index of a m ultithreaded processor is m easured by its efficiency figure, given as the fraction of the total tim e where the processor is executing useful instructions. The aim is to keep the proces sor busy with sufficient contexts, w ithout having to w ait for the com ple tion of the rem ote memory accesses. 19 2.2 Deterministic Processor Model A sim ple analysis of the processor efficiency can be carried out using determ inistic values for all param eters R, C and L as presented in [Saav90]. Although unrealistic, this m odel gives the optim istic values and first order estim ation of the processor efficiency for any given num ber of contexts. Two regions of operation need to be considered, nam ely linear and saturation regions. The processor is said to be in the linear region when there are insufficient num ber of contexts to hide the rem ote access latency. This results in some processor idle tim es where all contexts are waiting for its rem ote access to com plete, as illustrated by Figure (2.2a). Since there are a total of N contexts, the processor efficiency can be eas ily com puted as tin N R E = R + L (2 -D Elin increases linearly with N until a point when there are just enough contexts to com pletely hide all the rem ote access latencies. This is the saturation point, and the value of is given by the relation N S °' = ^ < 2-2> Beyond this point, further increase in the num ber of contexts will not increase the processor efficiency. The processor is now operating in the saturation region, as illustrated in Figure (2.2b). The processor effi ciency at saturation is given by sat R 1 R + C 1 + C /R (2.3) 20 1 C T h read s N1 K , 1<T IZ Z H 1 R :i.c ■ ■ * ^ i " L - C ........ i i - \ / 1 R 1 C 1 K , 1 C 1 _ _ ji.... i : i...............R .. C L.... ...K ...J ...i l .. (a) L in ear Mode T h read s N -l I F T I (T m ::z . k 1 C 1 R 1 C TZT I ............ |J - \ t . mmmt . ..i r ..... 1 R 1 C R 1 C 1 ...i K_ C 1 R 1 ( ' 1 R (b) S a tu ratio n Mode Figure (2.2i O peration m odes of a m ultithreaded processor (Efficiency E) S a tu ra tio n L in ear 1 N s a t (N um ber of contexts N) Figure (2.3) Processor efficiency curve using determ inistic m odel 21 which only depends on C/R ratio, and is independent of L. Clearly we need to keep C/R ratio relatively small in order to achieve sufficiently high processor efficiency in the saturation region. Figure (2.3) shows the typical processor efficiency curve under determ inistic m odel and the respective regions of operation. 2.3 The GSPN Processor Model The GSPN m odel based on stochastic principles can be easily con structed for the abstract processor model. This is possible if we assume that all the param eters R, C and L to be random variables having exponen tial distribution. Hence the rate or probability of a rem ote memory request (r) is given by 1/R, which is the inverse of the average run-length. Similarly, the service rate in perform ing context sw itch (c) is 1/C and the service rate for rem ote memory accesses (/) is l/L. The assum ption that R is a random variable is valid since we would expect some variations in the actual run-length of each thread. This also applies to the param eter L since the latency is affected by the distance traversed and the network traffic density. For the case of context sw itch overhead, C, the assum ption was m ade m ainly to reduce the total num ber of possible states, with m inim al effect on the accuracy of the m odel for small values of C. This is vital in determ ining the feasibility of applying num erical solution m ethod on the large system o f equations obtained. As will be shown in the follow ing sections, this assum ption does not affect the accuracy of our m odel, when com pared to the analytical solution obtained in [Saav90]. 22 Based on the assum ptions above, the GSPN of a m ultithreaded pro cessor with N contexts is given in Figure (2.4). P2 1.00 P3 T2 P4 P5 T4 T3 Figure (2.4) GSPN m odel of a m ultithreaded processor A token in place PI indicates that a ready-to-run context is avail able for next sw itching. P I is initialized with N tokens which represents the N contexts ready for execution at the start time. A token present or absent in place P2 indicate the availability of the processor. P2 is initial ized with one ( 1 ) token to represent the single processing unit available for program execution, A token present in place P3 or P4 indicates that the processor is Running or Context Switching respectively. Finally, tokens present in place P5 indicates that there are outstanding rem ote m em ory requests being serviced. N otice that P4 and P5 are in parallel to signify concurrent operations of context switching and servicing of , rem ote memory accesses. 23 Transition T1 is an im m ediate transition with a probability of one (1). It controls the access of the processor for the next ready-to-run con text, by serving as an access gate. This transition w ill be enabled only when the processor is free (a token in P2) and there is at least one ready- to-run context (token or tokens in P I). Together, P2 and T1 ensures that the processor is either in the Running state or in the Context Switching state, but not in both. Transitions T2 and T3 represent the rate of rem ote m emory access (r) and service rate for context sw itching (c) respectively. Transition T4 represents the service rate o f rem ote memory accesses. The service rate for T4 is marking dependent, given by d = m5x l (2.4) where is the num ber of tokens present in place P5 at any instant of tim e. This is valid since we assume that the rem ote m em ory accesses are pipelined, or having m ultiple independent servers on the netw ork, thus allow ing m ultiple outstanding requests being serviced concurrently. Hence for m outstanding requests, the expected tim e to receive a reply is L/m, as com pared to L in the case of a single outstanding request. This translate into an increase in the service rate by a factor of m. The processor efficiency can be obtained by getting the steady state probability of having one token present in place P3. This is done by summing steady state probability of all states (or m arkings) that satisfy this condition. 24 2.3.1 E m bedded M arkov C hain The associated Em bedded M arkov Chain (EMC) of the GSPN model can be easily obtained from the reachability tree by considering only the set of states that the process entered after each transition. This includes both the vanishing and tangible states. The num ber of states gen erated under GSPN is m uch sm aller as com pared to the associated PN m odel due to the precedence rule introduced by the imm ediate transition T l. The num ber of states can be further reduced to that consisting of tangible states only by replacing all vanishing states with the appropriate transition probability, as outlined in Section 1.4. The reduced EMC will be used in constructing the transition probability matrix and the steady state solution for the GSPN m odel. j The corresponding EMC and reduced EMC of the GSPN m odel for i the case of N =2 is as shown in Figures (2.5) and (2.6) respectively. The i states are represented by 5-tuple {mj m2 m3 "U m5] which indicate the I num ber of tokens in each place for a particular state. S* and Vj represent J tangible and vanishing states respectively. The total num ber of states in the reduced EMC for N contexts, is given by I Sn = 2 ( N + 1) {25) which is linearly proportional to N, and is independent of the param eters R, C and L. 25 ( ^ 2 1 0 0 0 ^ ) ( ^ 1 0 1 0 0 ^ ) 20010 < ^ J J 3 0 1 1 ^ > SI 11001 00101 01002 00012 Figure (2.5) EMC of the GSPN model for N=2 20010 10100 00101 S I 00012 01002 S4 Figure (2.6) Reduced EMC of the GSPN model for N=2 2.3.2 S tead y S tate S olu tion From the reduced EMC state transition rate diagram , the steady state solution of the M arkov chain can be obtained by first solving the follow ing system of linear equations as outlined in Section 1.4.2: 7t = K U where U is the transition probability m atrix and n is the vector of the sta tionary probability distribution of the EMC. The elem ent «jj of m atrix U is given by the transition probability from state i to state j of the EMC. The steady state probability, 7tj, of state j can be obtained by first arbitrary taking any state, i, as a reference state. Com pute the mean num ber of visits to state j betw een two subsequent visit to state i (vji)» average sojourn tim e in state j (E[Wj]) and the m ean cycle tim e of state i (C/). The steady state probability is given by the equation VjiElWj] Again, for the case of N= 2, the transition probability matrix for the GSPN m odel is 0 1 0 0 0 0 0 0 / l + c c l + c 0 0 1 0 0 0 0 0 / l + r 0 0 0 r l + r 0 0 21 0 0 0 c 21 + c 2 l + c 0 0 0 1 0 0 27 For specific values of the param eters r, c and /, num erical solution method such as Gaussian Elim ination can be applied to solve the system » of linear equations. The processor efficiency for N - 2, E 2, can be obtained by adding the steady state probabilities of states SO and S3, 1 j E2 = K0 + U3 (2-7) I I j I W ith N -2, R= 16, C=2 and L=130, we obtain 7i 0 =O.024087 and | 7t3=0.192036, which gives E2=0.21612. I I I | 2.3.3 Performance Curve I j The perform ance or the processor efficiency of the m ultuthreaded I processor with various values of N can be obtained using the m ethod t described in the previous sections. This gives the perform ance curve for 1 various degree of m ultithreading for specific values of R, C and L. For , large value of N, it is only feasible to extract the reachability tree and ' solving the large system of linear equations w ith the help of available software tools. GreatSPN [Chi87] is one of the tools that was developed by G.Chi- ola et al. to assist researchers in analyzing GSPN m odels w ith reasonably large state space. It provides graphical input facilities for the building of GSPN m odel, extracting the reachability graph and perform ing steady state solution using G auss-Seidel iterative solution of the embedded 1 M arkov Chain. We will be using this tool in building and perform ing steady state solution of all our subsequent GSPN m odels. 28 The processor efficiencies based on GSPN m odel of Figure (2.4) for #=1 to 20 are com puted and tabulated in Table (2.1), together with the | I associated num ber of tangible states. Context sw itching overhead C=2 j and 4 cycles are shown, using fixed values of /?=16 and L=130 cycles N SN % (C = 2) En ( C=4) 1 4 0.10956 0.10950 2 6 0.21612 0.21547 3 8 0.31883 0.31686 4 1 0 0.41664 0.41226 5 1 2 0.50826 0.49998 6 14 0.59215 0.57809 7 16 0.66669 0.64466 8 18 0.73031 0.69820 9 2 0 0.78193 0.73822 1 0 2 2 0.82126 0.76560 1 1 24 0.84908 0.78254 1 2 26 0.86719 0.79195 13 28 0.87796 0.79663 14 30 0.88381 0.79871 15 32 0.88671 0.79955 16 34 0.88802 0.79986 17 36 0.88857 0.79996 18 38 0.88878 0.79999 19 40 0.88885 0.80000 2 0 42 0 . 8 8 8 8 8 0.80000 Table (2.1) M ultithreaded Processor Efficiency for R=\6 and L=130 29 respectively. N otice that the number of states for the given GSPN is rea sonably small and is independent of the param eters R, C and L In order to verify the accuracy of the GSPN m odel, the processor efficiency curves are com pared with the analytical solution obtained by R.H. Saavedra [Saav90], as shown in Figure (2.7). The same set of param eters are used, with L corresponds to Lm as defined in [Saav90]. For both cases with C=2 and 4 cycles, the two curves are very close to one another, and practically overlaps for small values of N. The slight differ ence around the knee of both curves is as expected since the GSPN m odel assum es stochastic nature for both L and C, w hich are taken as determ in istic values in the analysis in [Saav90]. 30 0.9 0.8 0.7 0.6 LLJ 0.5 0.4 D l 130 0.3 h 1 - : Deterministic : S aaved ra et al. G © : G SPN 0.2 0. 1 ' 20 Num ber of C ontexts I * I 0.9 0.8 0.7 o | 0.6 R : 16 L : 130 h — t - : Deterministic : S aaved ra et al. o— e : G SPN 0.2 0.1 20 Num ber of C ontexts Figure (2.7) Perform ance curves for R=16 and L= 130 L 31 2.4 Conclusions A GSPN m odel based on stochastic principles is used to analyze the perform ance of an abstract m ultithreaded processor. A detailed description of the m odel and the relevant steps for the com putation of the processor efficiency are presented. The perform ance curves obtained are found to be very close to known analytical results, especially with small context sw itch overhead. The advantage of the GSPN model lies in its sim plicity and the O(N) states generated by the EMC, which depends only on the num ber of contexts N, and is independent of other param eters used. N um erical solu tion m ethods can be easily applied to solve the system of linear equations for specific values of the param eters. 32 Chapter 3 E ffects o f S eq u en tia l C o n sisten cy M em ory M odel 3.1 Processor Architecture We will perform our analysis on the perform ance of a D istributed Shared Memory (DSM) m ultiprocessor system under Sequential C onsis tency (SC) model. The basic architecture of a m ultithreaded processor with directory-based coherent caches in a DSM system is shown in Figure (3.1). The DSM consists of P processing nodes connected through a scal able interconnection network. Physical m em ory is distributed among all processing nodes, form ing a single shared virtual m em ory space. Cache coherence is m aintained by the directory controller using invalidating, distributed directory-based protocol as described in C hapter 1. We will assume that all instructions and private data are stored locally in each node, with a cache hit rate close to one. This allows us to assume that all references take single processor cycle to com plete, except 33 for shared-data references. The caches are assumed to be lock-up free using write through schem e, with sufficiently large continuation name space to cater for m ultiple outstanding memory requests from different threads. A m em ory request may be satisfied locally, either from the cache or from local m em ory m odule, or from a remote mem ory m odule of another processing node. The processor w ill perform a context sw itch on every cache m iss on shared-data. Contexts Regs i Regs Regs Processor Cache Local Memory Directory Controller Network Interface Node 2 Node P Node 1 Interconnection Network Figure (3.1) M ultithreaded processor architecture with coherent cache 34 All reads are assumed to be blocking, where the processor continue execution of the same thread only after the value requested has been received. Thus no m ultiple outstanding read requests are allow ed for a ! single thread. On the other hand, writes may be buffered or pipelined | depending on the memory consistency m odel used. Independent requests j from different threads may be pipelined, thus allow ing concurrent accesses to both the local and rem ote m em ories. The pipeline stages are I j assum e to be sufficiently large to cater for all pending independent i accesses from different threads. A cache fill operation is perform ed fol- | lowing the reply from a read request w ithout stopping the processor’s ! execution of another thread. I ! f All synchronization variables for acquire (lock) and release | (unlock) operations are assum ed cachable, and take same am ount of time j as normal read-w rite operations. We will ignore the synchronization bar- - riers in our model. I I 1 3.2 Constraints on Memory Accesses under SC Memory consistency m odels im poses restrictions on the order of shared memory accesses initiated by each process in a m ultiprocessor system. This is due to the requirem ents for event ordering to support each consistency model. An event ordering is used to specify w hether a m em ory access is legal or illegal when several processes are accessing a com- | mon set of shared memory locations. The follow ing definitions were given by Dubois et al. [DSB 8 6 ] for the purpose o f specifying various i memory consistency models: 35 j • A load by processor PL is considered performed with respect to pro- I cessor Pk at a point in time when the issuing of a store to the same ! address by P* cannot affect the value returned by the load. • A store by P ( - is considered performed with respect to P^ at a point in time when an issued load to the same address by P* returns the value by this store. I ■ • A load is globally performed if it is perform ed with respect to all processors and if the store that is the source of the returned value | has been perform ed with respect to all processors. I I i _____________ j The difference betw een performed and globally performed only i happens in m ultiprocessor architectures w ith non-atomic stores. A store j is defined as atomic if the value stored becom es imm ediately readable by j all processes. Hence in a DSM system with caches and general intercon- ! nection network, the stores are inherently non-atomic unless special hard- J ware m echanism s are used to enforce atom icity. The program order is I | defined as the order in w hich m em ory accesses occur in the execution of a ; single process, with no reordering of instructions taking place. The I I phrase all previous accesses refers to all accesses in the program order that happened before the current m em ory access. Lam port [Lam79] first provided the form al definition of SC, which ■ requires that the execution o f a parallel program to appear as some inter- \ leaving o f the execution o f the parallel processes on a sequential ! machine. Dubois et al. [DSB 8 6 ] provided two sufficient conditions based on event ordering w hich guarantees sequential consistency: (a) Before a load is allow ed to perform w ith respect to any other pro cessor, all previous load accesses m ust be globally performed and all previous store accesses m ust be perform ed. (b) Before a store is allowed to perform w ith respect to any other pro cessor, all previous load accesses m ust be globally performed and all previous store accesses m ust be perform ed. Based on the conditions above, we can define our m ultithreaded processor under SC as follow s: (1) For each thread: (1.1) Read : Processor issues read access and waits for read to perform . Context sw itch on cache miss. (1.2) Write : Processor issues w rite access and waits for write to perform . Context sw itch on cache miss. (1.3) Acquire : Same as Read. (1.4) Release : Same as Write. (2) Pending accesses from independent threads satisfying conditions (1.1 to 1.4) can be pipelined. An access is considered performed when it is a hit in the cache, or a reply is received from either local or rem ote memory m odule. Although some buffering o f write accesses w ithin each thread is possible, the potential gain is rather lim ited. This is due to the fact that in most appli cations, read and write accesses are well interleaved, thus m aking m ost of the write latency visible to the processor [GGH91]. In addition, a more com plicated context sw itching scheme is needed to avoid processor idling w hile w aiting for previous w rites to complete. I 37 3.3 The GSPN Processor Model We will use the GSPN m odel of a m ultithreaded processor pre sented in Chapter 3 as the basic structure of our m odel. Assume that the average run length (/?), m em ory access latencies (L^,LR), context switch overhead (C) and cache fill overhead (T) are random variables having exponential distribution, w ith service rates given by the inverse of the average time consum ed. All threads are assum ed to be independent, so that accesses from different threads in the same node can be pipelined. The num ber of threads in each processor is assum ed to be constant, im plying that sufficient parallelism exists in the program . The follow ing gives a list of param eters and their respective defi nitions that will be used in our analysis: • N : Num ber of contexts running in each processor (N > 2) • s : The shared-memory reference rate, w hich is given by the inverse of the average run length betw een two consecutive shared-m em ory references 0 = 1 / R). • Sw, srei : The fraction or probability of shared m em ory references that are writes and releases, classified under write opera tions. • sr » s a c q ' ■ The fraction or probability of shared memory references that are reads and acquires, classified under read opera tions. • mr , mw : The cache m iss rates for shared read and w rite operations. The respective hit rates hr and hw can be obtained from h = 1 - m. • PL y PR '• The fraction or probability of cache m isses being serviced by Local or Remote m em ory respectively, with p^ = 1 - p^. 38 • c : The service rate in perform ing context sw itching, which is the inverse of context sw itching overhead (c = 1 / C). • 1^ , Ir : The service rates for Local or Remote shared memory accesses, which is given by the inverse of the respective average access latencies (/^ = 1 I , Ir = 1 / Lr). • y : The service rate in perform ing cache fill operation, which is given by the inverse of cache fill overhead (y = 1 / Y). The set of param eters s, sw, srei, sacq and sr depends m ainly on the application and program behavior. The cache m iss rates depend on both program behavior and cache perform ance. Param eters p l and pr are determ ined by the program behavior and the shared data distribution among all nodes. Param eters c, lL, lR and y are determ ined by the proces sor architecture and the DSM system. The m ultithreaded processor under SC can be conveniently repre sented by a GSPN m odel as shown in Figure (3.2), based on the m em ory access constraints specified in the previous section. Two distinct branches can be identified, representing the read and write operations. A brief description of the net is given below: • Place P I contains tokens that represent ready-to-run contexts. It is initialized with N tokens. • A token present in place P2 indicate the availability of the processor for next context switch. It is initialized with one (1) token to signify single processing unit. • Transition T1 serves as an access gate to the processing unit. • A token present in place P3 indicate the processor is in the Busy state. A maxim um of one (1) token can be present in P3 at any tim e, as controlled by T l. 39 JP2 JT l 1.00 T2, P4 T3 Sw“ T5 j, T 6> ISacq j Sr V f P 6 T8 j iT.oo P5 T7 P7 T 9 T i l | T 12 ^ / M r |Ei T10 H W T “ Mw P9 T13 PU P10 T 16 4 T17 pprr p l' P 14 A T 14 ^ T 15 ^ tpl pfn r P 12 P15 T 19 T20 T18 T21 d4 P16 T 22 Figure (3.2i GSPN m odel for Sequential Consistency • The tim ed transition T2 is assigned with a firing rate s, representing the rate of shared memory accesses by each thread. • Transitions T3, T4, T5 and T 6 are assigned with sw itching distribu tions srei, sacq and sr respectively, representing the probabilities of write, release, acquire and read operations among the shared accesses, with • Transitions T9, T10, T il and T12 m odels the hit and miss rates of the coherent cache. T9 and T10 are assigned w ith probabilities of hw and mw respectively for the write accesses, whereas T12 and T i l are assigned w ith hr and mr respectively for the read accesses. A cache hit will put the processor back in the Busy state im m ediately, w ith a token in P3. • A cache miss will put the processor in the Context Switching state, represented by a token in place P9. Transition T13 fires at a rate of c, to signify the com pletion of context sw itching operation. • Transitions T14, T15 are assigned with probabilities p£ and p^, rep resenting the probability of a write request being service by the local or rem ote memory module. The same applies to T16 and T17 for the read requests. • Tokens present in places P12 and P15 represents pending m em ory requests for local memory. These requests are serviced at rates deter m ined by firing rates of T18 and T21 w hich are marking dependent, given by: S w + Sre l + Sacq + S r = 1 (3.1) d \ - M l2 x lL (3.2) c/4 — A/jg x (3.3) where Mj represents num ber of tokens in place Pj. 41 • Sim ilarly, tokens present in place P13 and P14 represents pending memory requests for rem ote m em ories, with firing rates of T19 and T20 given by: (3.4) (3.5) • Transition T22 m odels the cache fill operation after a reply from read request, with a firing rate of y. • The com pletion of a read or write access places a token back to P I, adding to the num ber of ready-to-run contexts for next sw itching. 3.4 Performance Curves Based on the GSPN m odel, the processor efficiency with N con texts, Etf, can be defined as the probability o f having a token in place P3 at steady state. This is given by the equation: En = 2 > i (3.6) JeG where G = { set of tangible states with a token in P3 } and 7tj = steady state probability of state j. A set of default param eters, tabulated in Table (3.1), have been chosen to evaluate the im pact of param eter changes on different memory consistency m odels. We assum e that + srei = sr + sacq = 0-5, unless stated otherwise. We will lim it the num ber of contexts in our m odel to iV < 6 based on the large num ber of states generated by the GSPN and the com putation time involved in com puting the steady state probabilities. P revi d2 = M 1 3 x lR d3 = M u x lR 42 ous studies [Saav90, GHGx91, Agar92] have shown that sm all num ber of contexts (N < 4) are sufficient to achieve close to m aximum efficiency. Figure (3.3) shows the perform ance curve of the m ultithreaded processor under SC m odel using default param eter values. The processor efficiency for N = 1 is approxim ated by setting c = 10.0, such that P{Switching} ~ 0 (< 0.001). The processor efficiency at saturation can be obtained using determ inistic analysis. The effective run length, Reff , is given by: r = ( ------------------- _ --------------------) ( I ) (3.7) eff (sw + srel) mw + (sr + sacq) mr s and the processor efficiency at saturation, Esat, is given by: R E sat = -------------------------------------------- (3.8) R f f + — e ff c Using the default param eter values, we have Re f f = 1 0 cycles and Esat = 0.7143. The perform ance of the processor is rather low due to the fact that no buffering and pipelining of memory requests is allowed under SC m odel. The perform ance curves for different values of s is shown in Fig ure (3.4). Reducing s effectively increase the average run length of each thread, thus increasing the processor efficiency in both linear and satura tion regions. 43 Parameters Default Val. Remarks s 0 . 2 R = 5 sw , sr 0.5 srel ’ sacq 0 . 0 mr , 0.5 p l ^pr 0.5 c 0.25 C - 4 lL 0.0625 Ll = 16 Ir 0.015625 Lr = 64 y 0.5 7 = 2 Table (3.1) D efault param eter values for GSPN models 0 .9 0.8 ffi 0.5 0.4 Q-0.3 0.2 Num ber of C ontexts (N) Figure (3.3) SC performance curve using default parameter values 44 0.9 0.8 c0 .6 a -0 .3 h h : s = 0 .0 5 o— o : s = 0.1 0.2 : s = 0.4 0.1 Num ber of C ontexts (N) Figure (3.4) SC perform ance curves with different s 0.9 0.8 £ 0 . 4 0 -0 .3 t - > ■ : mw = mr = 0.1 o— e : m w = mr = 0.25 *— x : mw = mr = 0.5 x— x : mw = mr = 0.75 0.2 0.1 Num ber of C ontexts (N) Figure (3.5") SC performance curves with different mM , and mr 45 The contribution of coherent cache can be seen by varying the param eters mr and mw as shown in Figure (3.5). The case mr = mw = 1 represents a m ultiprocessor system w ithout coherent caches, where all shared references results in a m iss, follow ed by a context switch. Intro duction of coherent cache generally improve the processor efficiency by increasing the effective run length of each thread. ( 5 Since no buffering is allow ed for write accesses under SC m odel, no significant change in the processor efficiency is expected by varying sw. The difference betw een read and write latencies is the extra cache fill overhead incurred for read accesses, and in m ost cases Ll , LR » Y. Fig ure (3.6) shows the effect of changing sw for Y = 2 and 8 cycles. I For the case of pL, a higher value of pL effectively reduces the average latency seen by the processor, thus increasing the processor effi- \ ciency in the linear region. This can be achieved by static allocation of i frequently used data to local m em ory of a processor node, with prior knowledge of the access patterns. In the case of totally random access pattern, with equal probability to each node in the system , p^ = 1 / P, w here P is the num ber of processor nodes in the DSM system. Figure 1 (3.7) shows the effect of varying p £ on the perform ance curve. Notice that the efficiency at saturation is not affected by pL. 46 0.9 0.8 0.7 c 0.6 u j 0.5 S 0 .4 a - 0.3 o— o : Y = 2, S w = 0.25 +• ■ • + : Y = 8, S w = 0.75 #-•••# : Y = 8, Sw = 0.5 0 - - - 0 : Y = 8, Sw = 0.25 0.2 0.1 Num ber of C ontexts (N) Figure (3.6) SC perform ance curves with different 5 ■ w 1 0.9 0.8 g ° . 7 c 0.6 2 ]o u j 0.5 | 0.4 o o a: 0.3 0.2 0.1 0 o— g : PL = 0.75 *— * : PL = 0.5 x— * : P L = 0.1 +--•■+ : PL = 0 .0 2 3 4 Number of C ontexts (N) Figure (3.7) SC performance curves with different Pi 47 3.5 C onclusions A GSPN is used to model the m ultithreaded processor under Sequential memory Consistency model. The model is based on constraints im posed by the SC m odel and some architectural assum ptions on the DSM system. It will serves as the basis o f com parison with other relaxed memory consistency m odels. The perform ance of the m ultithreaded processor is considerably poor for small num ber of contexts. This is due to its inability in buffering and pipelining the memory accesses w ithin each thread. Coherent cache generally improve the processor’s perform ance by increasing the effec tive run length of each thread. B etter shared data allocation schem e by assigning frequently used data to local memory of each node reduces the effective memory access latency, thus im proving overall processor per form ance. 48 j C hapter 4 i E ffects o f R elaxed C o n sisten cy M em ory M odels i i i I I • i j 4.1 Memory Access Constraints I i We w ill define the m em ory access constraints im posed by the l ! relaxed memory consistency m odels using event ordering. These con- ; straints will be used in deciding the level of buffering and pipelining allowed in the GSPN m odels. The Processor Consistency (PC) m odel introduced by Goodman ( [Good89] requires that writes issued from a processor are always in pro gram order, but the order o f writes from different processors can be observed differently. The conditions for satisfying PC as given in [GLLx90] are: (a) Before a load is allow ed to perform w ith respect to any other pro cessor, all previous load accesses m ust be perform ed. 49 (b) Before a store is allowed to perform with respect to any other pro- I cessor, all previous loads and stores m ust be perform ed. I i The first condition allow s reads follow ing a write to bypass the write, thus allow ing for buffering of write accesses. However, no pipelin ing of write accesses are allowed, as specified by second condition. i The Weak Consistency (W C) m odel proposed by Dubois et al. ! [DSB 8 6 ] requires that memory is consistent only at the synchronization j points. The conditions to ensure WC are: l i (a) Before an ordinary load or store is allow ed to perform with respect i j to any other processor, all previous synchronization accesses m ust | be perform ed. I j (b) Before a synchronization access is allow ed to perform with respect j to any other processor, all previous ordinary load and store [ accesses m ust be perform ed. i i (c) Synchronization accesses are sequentially consistent with respect ! j to one another. i ! | This allow s for pipelining of both read and write accesses w ithin , the synchronization points.The requirem ent that synchronization accesses be sequentially consistent m ay lim it the expected perform ance gain in applications with high synchronization rates. The Release Consistency (RC) proposed by G harachorloo et al. [GLLx90] is an extension of WC by exploiting the inform ation about syn chronization points using explicit acquire and release accesses. The con ditions for RC are: 50 (a) Before an ordinary load or store access is allowed to perform with respect to any other processor, all previous acquire accesses m ust be perform ed. (b) Before a release access is allowed to perform with respect to any other processor, all previous ordinary load and store accesses m ust be perform ed. (c) Acquires and Releases are processor consistent. This allow s for pipelining of both read and write accesses within acquire and release as in the case of WC. In addition, acquires are allow ed to bypass pending releases, thus providing higher potential in perform ance gain. 4.2 Processor Consistency Model Based on the m em ory access constraints im posed, we can define our m ultithreaded processor under PC as follow s: (1.2) Write (1) For each thread: (1.1) Read : Processor issues read access and waits for read to perform . Context sw itch on cache miss. Processor sends write to write buffer, stall if buffer is full. The write request is retired from the buffer only after w rite is perform ed. Same as Read. Same as Write. (2) Pending accesses from independent threads satisfying conditions (1.1 to 1.4) can be pipelined. (1.3) Acquire (1.4) Release 51 I Notice that no context switching is taking place for write opera- ; tions, since writes are sent to write buffer. We define a new param eter B, \ ! as the size of the write buffer for all threads. Clearly, B should be large enough to avoid processor idling due to buffer full situation. ! The main task in m odeling the m ultithreaded processor under PC | I m odel involves accurate m odeling of the num ber of independent write i j accesses that can be pipelined. This defines the num ber of servers for | w rite accesses at any point in time. Two GSPN m odels are presented, ! based on static and dynamic allocation of servers for the write accesses. Figure (4.1) shows the GSPN model using static or fixed num ber of ! servers for the write accesses. The basic structure of the net is the same as | in the case for SC m odel, with dl to d4 defined as: dl X s T ii (4.1) d l X V 0 ii (4.2) d3 - M 17 x lR (4.3) d4 ii 00 X fS- (4.4) | Transition T7 represents buffering of write request, which fires J only if there is an available buffer entry. This puts the processor back into I Busy state, with a pending write to be perform ed. In order to lim it the num ber of states of the EMC, a write hit is m odeled from logical point of view, using an im m ediate transition T10. This is possible since in gen eral, Ll ,L r » 1. 52 Figure ( 4 .n GSPN model for PC with static allocation of server s 53 P l l is initialized with B tokens that represents the total available j buffer entries for w rite operations. P10 is initialized w ith V tokens that ' represents the num ber of available servers for pipelined accesses. The j value of V is chosen based on the assum ption made: | (a) V = 1 if we assum e that all write accesses are dependent, which i j happens when sw = 1 . J (b) V = IV if we assum e that all w rite accesses at any point in tim e con- I . tain at least one request from each thread, im plying perfect inter- j leaving of w rite requests. | (c) V = g (N) where g (N ) is obtained from the probability distribution I j function of w rite requests from each thread. For a uniform proba- | bility distribution: , l ' 1 N I g (N) = k = ( ^ ) (4 .5 ) j= i 1 where k = m ean num ber of possible requesting threads. I j The perform ance curves for the static m odel using B = 6 for differ- i ! ent values of V is shown in Figure (4.2). The cases where V = 1 and V = N 1 represents the low er and upper bounds for the PC model. For V = k, the irregularity of the curve is due to integer values used for V. The assum p tions for the cases V = N and V = k fail to hold when sw is large. W hen sw ' ~ 1, we should expect only single server (V = 1) for the buffer. 54 0.1 - *— * : SC x— x : V = 1 0 J ;---------- 1 ----------'--------1 — ------- 1 ---------- 1 _____ , 0 1 2 3 4 5 6 Num ber of C ontexts (N) Figure (4.2) PC perform ance curves with static allocation of servers A second GSPN m odel using dynam ic server allocation scheme is shown in Figure (4.3). This overcom e the lim itations im posed by the fixed allocation scheme. Instead of assigning a fixed value in P10, the num ber of servers is determ ined by the num ber of active threads in the processor. A thread is considered active if it is currently running or wait ing for a read request to complete. This assum es that all the pending write accesses are from the current active threads. A token is added to P10 when a new context starts running to signify additional degree of pipelining allowed. Likewise, a token is rem oved from P10 when a read request is com pleted. A high sw itching probability of 10.0 is assigned to transition T25 to ensure that negligible or no tim e is w asted in rem oving a token from P10. The follow ing two possible sources of error will be ignored due to the com plem enting effects among them: 55 P2 f T1 T.oo T3 Sw“T " SrefT" Saaq T6 T4 T5 T12 V T13 M r" P9 T10 T il JP13 P12 T14 P14 u T16 ■ P L T17 - T18 PL P17 T19 T20 ) P18 T22 3 d4 T21 d3 P19 T25 1 1 0 .0 0 T23 1.00 P20 T24 P21 Figure (4.3^ GSPN model for PC with dynamic allocation of servers 56 i I • The increase in efficiency due to probability of servicing requests I ! from previous threads, other than those from active threads. • The decrease in efficiency due to probability of rem oving the server while there are pending write accesses to be com pleted. For the case of sw = 1, only single server is allocated, thus proces- ! sor consistency is preserved. We define the index processor stalling prob ability, Pfstall} as the steady state probability o f processor being stalled during execution o f a thread. Hence for PC, PPq {stall} is given by: Ppc {stall} = P{buffer full} (4.6) The value for P{stall} is affected by the buffer size used, the fraction or | rate of write accesses and the effective service rate of w rite buffer. i i i ! Figure (4.4) shows the perform ance curves obtained using default param eter values for both PC and SC m odels. Notice that as N increases, t the gain in perform ance over SC slowly dim inishes since P{stall} increases with higher rate o f w rite accesses. Figures (4.5) and (4.6) shows ' the effects of different buffer sizes on processor efficiency and the corre sponding P{stall} values. Increasing the buffer size generally im proves ! the processor’s perform ance by reducing the P{stall} values. For sm all buffer sizes, the fraction of processor tim e stalling for a w rite buffer entry could be very high w hich effectively offsets the gain from buffer- ; ing. The same problem happens when the fraction of write accesses .s*, is ! ! increased for a fixed buffer size, as shown in Figure (4.7). In stead of a higher expected perform ance w ith increased sw a loss in perform ance is I observed for larger N due to excessive processor stalling time. We will be using the GSPN m odel w ith dynamic server allocation scheme in our future com parison purposes. 57 0.9 0.8 0.4 Q-0.3 B = 6 o— e : PC *— * : SC 0.2 Num ber of C ontexts (N) Figure (4.4) PC perform ance curve using default param eter values 1 0.9 0.8 uj 0.7 > % o 0.6 Q > 0 l u 0.5 1 0.4 o o £ 0.3 0.2 0.1 0 0 PC SC B * — * x— x : B = 4 o— o : B = 2 2 3 4 Num ber of C ontexts (N) Figure (4.5^) PC performance curves with different B 58 1 r 0.9 0.8 § 0 . 7 O ) %°'Q 15 5 0.5 o PC g— o : B = 2 h h : B = 4 x— x : B = 6 *— * : B = 10 -2 0 3 in u-'3' 0.2 0.1 Num ber of C ontexts (N) Figure (4.6) PC PS stall) with different B 1 0.9 0.8 uT ° - 7 c 0.6 0) uj 0.5 1 0.4 o o 0.3 0.2 0.1 0 PC , B = 6 - t v : Sw = 0.25 g— o : S w = 0.5 *— x : SC - x : S w = 0.6 0 2 3 4 Num ber of C ontexts (N) Figure (4.7) PC performance curves with different s W 59 4.3 Weak Consistency Model Based on the memory access constraints im posed, we can define our m ultithreaded processor under WC as follow s: (1) For each thread: (1.1) Read : Processor stalls for pending release to perform . Processor issues read access and waits for read to perform . Context sw itch on cache miss. (1.2) Write : Processor sends write to write buffer, stall if buffer is full. The write requests can be retired from the buffer w ithout having to w ait for com pletion of previous writes. (1.3) Acquire : Processor stalls for pending w rites and releases to perform . Processor sends acquire and wait for acquire to perform . Context sw itch on cache miss. (1.4) Release : Processor sends release to write buffer, stall if buffer is full. The release request is retired from the buffer only after all previous w rites are per form ed. (2) Pending accesses from independent threads satisfying conditions (1.1 to 1.4) can be pipelined. No context sw itching is taking place for write and release opera tions as in PC. W rite buffer is shared among w rite and release requests. The GSPN m odel for W C is shown in Figure (4.8). An additional branch is added to the net to m odel the release operations. The main fea tures of the net are described below: 60 • For w rite operations, no server lim it is imposed. This im plies that sufficiently large pipeline stages are available to process all pending write accesses. • The inhibitor arcs from P18 and P19 to T10 are used to prevent ser vicing o f next release operation with pending writes.W e assum e that all pending w rites are from the current running thread. This repre sents the low er bound in term of expected perform ance. A lterna tively, the inhibitor arcs may be rem oved if we assum e that all pending w rites are from other threads, which represents the upper bound. • Sim ilarly, the inhibitor arcs from P I 8 and P19 to T 8 are used to pre vent servicing of acquire operation with pending w rites. This is a potential bottleneck for applications w ith high synchronization rate since the processor w ill be stalled while waiting for writes to com plete. • Only single server is allocated in P I 3 for the release operations to enforce sequential consistency constraint for synchronization accesses. • Inhibitor arcs from P27 to T i l and T12 are used to prevent servicing of acquire and read accesses before the pending release operation is com pleted. This poses another potential bottleneck for WC. • Inhibitor arcs from P27 to T13 and T14 are used to prevent servicing of write accesses before the pending release operation is com pleted. The w rite requests are placed in the write buffer w ithout stalling the processor. • The firing rates d l to d6 are given by: dl = M 1 8 x lL (4 .7 ) d l - M l9 x l R (4.8) 61 P2 T1 1.00 T2 P4 T5 T3 Sw' P5 ^4_T 7 j 11.00 P9 T9 1.03 T10 ,TL2 ,T 1 1 1.00 P12 P ll P10 T p i : P14 T14 T13 T17 Mw Hw Hw P15 T 2 2 - T23 PL > P23 P27 T21 T20 T19 ^ " l pfTT“ >P19 (X PR PR PL P22 P2( P21 T 27 d4 ■ j T29 d6 T2L T25 d3 d2 P26 P24 T31 T30 1.00 1.00 Figure (4.3^ GSPN m odel for Weak C onsistency i d3 II £ © X (4.9) d4 X a ? ii (4.10) d5 — M2 2 X lR (4.11) d6 X a f II (4.12) The processor stalling probability under WC, P ^ q {stall}, can be obtained from: i Pwc {stall} = Pfbuffer full} + P{acquire stall} + P(read stall} (4.13) ! Figure (4.9) shows the perform ance curves under WC with differ- I j ent synchronization rates. For low synchronization rates ( s ^ = 0), the gain in perform ance is significant, with all write latencies successfully | hidden by the write buffer. The drop in perform ance under high synchro nization rate arise m ainly from the increased P{stall} value. Figure (4.10) j shows the effects on P{stall} under different synchronization rates. For high synchronization rate, the m ain part of P{stall} com es from the ; acquire and read stalls. Increasing the w rite buffer size generally im proves the processor’s perform ance under W C, as in the case of PC. The buffer size required to com pletely hide the w rite latencies is m uch sm aller than that for PC due to the higher service rate for write requests under WC. Figure (4.11) shows the effects on perform ance curves with different buffer sizes. For the default param eter values, B = 6 is sufficient to hide all the write laten cies. The P {stall} curves for different values of B is sim ilar to that of PC, with much sm aller corresponding values. 63 1 r 0 .9 - 0. 8 - cro.7- lu 0.5 - W 8 0.4- o o £ 0.3 - 0.2 - 0.1 - O '-----------1 ----------1 ---------- 1 ----------1 ----------1 -------- 1 0 1 2 3 4 5 6 Num ber of C ontexts (N) Figure (4.9) WC perform ance curves with different sacq and srei 0 .9 0.8 WC , B = 6 S a cq = Srel 1 0 . 7 x— x : Srel = 0 .0 0 4 o— ©: Srel = 0.0 Q - 0.6 0.3 0.2 0.1 — x Num ber of C ontexts (N) Figure (4.10) WC P{stall} with different sacq and src/ WC , B = 6 S a cq = Srel o— © : Srel = 0 .0 - i r : Srel = 0.0 0 4 x— x : Srel = 0.02 *— m : SC 64 Since write latencies are effectively hidden under WC with sm all synchronization rate, we expect the perform ance will increase with higher values of sw. Figure (4.12) shows the effects of different sw values under WC m odel. The gain is significant w ith higher sw for sm all num ber of contexts until the constraint of finite buffer size is felt. 1 0.9 0.8 f f 0 7 c 0 .6 © [o uj 0,5 § 0 .4 o 0 01 0.3 0.2 0.1 0. WC S acq =S rel=0 o— © : B = 6 *— * : SC 2 3 4 Num ber of C ontexts (N) Figure (4 .11~) WC performance curves with different B 65 1 0.9 0.8 Q 0 .7 c 0.6 ]o Gi 0.5 k _ 5 o o 4 O O a: 0.3 0.2 0.1 0. WC , B = 6 Sacq=Srel=0 e — o : Sw = 0.9 x— x : S w = 0.5 o— .Q i : S w = 0.25 *— * : SC 2 3 4 Num ber of C ontexts (N) Figure <4.12) W C perform ance curves with different '-w ! 66 5,4 Release Consistency Model Based on the m em ory access constraints imposed, we can define our m ultithreaded processor under RC as follow s: ( 1) For each thread: (1.1) Read (2) Processor issues read access and waits for read to perform . Context sw itch on cache miss. Processor sends write to write buffer, stall if buffer is full. W rites are pipelined. Processor sends acquire and wait for acquire to perform . Context sw itch on cache miss. Processor sends release to write buffer, stall if buffer is full. The release request is retired from the buffer only after all previous writes and releases are perform ed. Pending accesses from independent threads satisfying conditions (1.1 to 1.4) can be pipelined. (1.2) Write (1.3) Acquire (1.4) Release Same as in PC and W C, no context sw itching is taking place for write and release operations. The write buffer is shared among write and release requests as in the case of WC. The GSPN m odel for RC is show n in Figure (4.13). The basic structure is the same as that for W C, except for the release accesses. The single server for release operation in W C is now being replaced by the dynamic server allocation structure used in PC model. This is to m odel the synchronization accesses w hich are processor consistent under RC model. The w rites, acquire and read accesses following release requests are allow ed to bypass the pending releases. In the extreme case when 67 Figure (4.13> GSPN model for Release Consistency when sw = 0, RC is equivalent to PC model. The processor stalling proba bility under RC, PRq (stall), is given by: PRC (stall) = P(buffer full) (4.14) The perform ance curves for RC w ith different rate of synchroniza tion is as shown in Figure (4.14). All the three curves overlaps with each other since the bottleneck for synchronization accesses has been rem oved. The processor stalling probabilities for all three cases are very small and thus can be ignored. To find the processor efficiency at satura tion using equations (3.7) and (3.8) by setting % = 0, we get Esat = 0.8333. The perform ance of RC for different values of B and are the same as in WC w ith srei = sacq = 0, as shown in Figure (4.11) and (4.12) respectively in the previous section. Figures (4.15), (4.16) and (4.17) shows the com bined perform ance curves which sum m arize the relative perform ance for all the memory con sistency m odels, under different synchronization rates and buffer sizes. The corresponding P(stall) values are shown w ith dotted lines to account for the lost of perform ance under different consistency m odels. In gen eral, RC gives the best perform ance for the same buffer size.W C gives com parable perform ance gain over SC only w hen the synchronization rate is reasonably low. The perform ance of PC depends very much on the buffer size used. 69 I i 0.9 0.8 c 0.6 S 0.4 RC , B = 6 S acq = Srel o— ©: Srel = 0.0 h r : Srel = 0.0 0 4 x—-x: Srel = 0.02 *— * : SC a. 0.3 0.2 0.1 N um ber of C ontexts (N) i i Figure (4.14> RC perform ance curves with different sacq and srei 0.9 0.8 c 0.6 u] 0.5 B = 6 S acq =S rel= 0.004 e— o : RC - t h : WC x— x : PC £ 0 .4 a - 0.3 0.2 •x - X X Num ber of C ontexts (N) Figure (4.15) Combined performance curves for sacq = 0.004 and B = 6 70 0.9 0.8 B = 6 S acq = S rel= 0.02 g— e : RC ■ i — i- : WC x— x : PC 0.2 -x .-4'- .X ' Num ber of C ontexts (N) Figure (4.16^ Com bined perform ance curves for sacq = 0.02 and B = 6 0 .9 0.8 uJO.7 c 0.6 0> ; o m o,5 | 0 .4 o o B = 4 S acq = S rel= 0.004 o— o : RC h 1-: WC x— x : PC £ 0.3 0.2 0.1 •o - o - o- ■ o - Num ber of C ontexts (N) Figure (4.17) Combined performance curves for sacq = 0.004 and B = 4 71 C hapter 5 C ache E ffects an d C om parisons 5.1 Cache Effects on Various Memory Models In all the m odels we have considered so far, we assum ed that the m iss rate for shared data is constant for different degrees of m ultithread ing. This is an optim istic assum ption since cache interference occurs betw een shared data of different threads sharing the same physical cache. A better approxim ation is to consider the effect of increase in cache m iss rate due to cache interference when m ultiple contexts are running on a processor. A higher num ber of cache m isses w ill reduce the effective run length, which in turn w ill low er the processor efficiency. We will use the cache m odel derived by Saavedra [Saav90] to esti mate the effective m iss rate due to m ultithreading. A ssum ing that the fixed com ponent of cache interference is negligible and same physical cache is shared equally among all contexts, the miss rate for N contexts, m(N), can be expressed as : 72 f m (l)iV* , if N<[m(l)N~l/k\ m(N) - J (5 > 1 ) I 1 , otherwise where m (l) represents the single context m iss rate and k is defined as the cache degradation constant, which is a positive num ber determ ined by the workload. For k ~ 1 and sm all N, the m odel is close to that derived by Argawal [Agar92], which shows linear relationship betw een m(N) and num ber of contexts. To apply the cache m odel on our GSPN m odels, we need to replace m(N) w ith mr(N) and mw(N), representing different m iss rates for read and write accesses. Sim ilarly, k is replaced w ith kr and k ^ The perform ance curves for each mem ory consistency m odel can be obtained by substitut ing mr and ^ w ith mr(N) and mw(N) respectively. Figure (5.1) shows the effects of cache degradation on the GSPN m odels for kr = kw = 0.4, using default param eter values. The corre sponding dotted line shows the perform ance w hen cache interference due to m ultiple contexts is ignored. As expected, a drop in perform ance is observed for all the consistency m odels due to higher m iss rate as N increases. The perform ance gain from the relaxed consistency m odels are still significant, especially for PC w here the problem of processor stall ing, P(stall), becom es less severe under higher cache m iss rate. 73 0.9 0.8 o- c 0.6 uj 0.5 B = 6 S acq =S rel= 0.004 q ——o : RC H K WC X — x: PC 0.2 0.1 N um ber of C ontexts (N) Figure (5.1) Effects of cache degradation on different m odels 5.2 Comparison with Simulation Results We will com pare the perform ance predicted by the GSPN m odels against the trace driven sim ulation results reported by G upta et al. [GHGx91] on the S tanford’s DASH m ultiprocessor system. The DASH architecture resem bles closely w ith our m ultithreaded processor m odel, w ith the exception that m ultiple hardw are contexts is not used in the pro totype construction. The benchm ark applications used in the sim ulation are M P3D, LU and PTHOR. M P3D is a 3-dim ensional particle sim ulator, LU perform s L U -decom position of dense m atrices and PTHOR is a paral lel logic sim ulator. 74 Table (5.1) lists the param eters that w ill be used for the GSPN models based on data reported on the benchm ark applications. The m em ory access latencies Ll and LR are 18 and 73 cycles respectively, w ith the cache fill overhead Y = 8 cycles. The value of LR is approxim ated by tak ing the average of latencies betw een Home and Remote nodes. A fixed context switch overhead of 4 cycles is considered. The values of p i are obtained using the data allocation scheme and accessing pattern of each application. Program s sr s w sacq srel mr(i) m j l ) PL MP3D 0.295 0 . 6 8 8 0.312 0 . 0 0 . 2 0.25 0.4 LU 0.297 0.6697 0.3295 0.0004 0.34 0.03 0.857 PTHOR 0.23 0.8617 0.1037 0.0173 0.23 0.53 0.0625 Ll = 18 Lr = 73 Y = 8 C = 4 Table (5.1) Param eter values obtained for M P3D. LU and PTHOR For MP3D, the particles assigned to a processor are allocated from shared memory in that processor’s node, but the space cells are distrib uted evenly among all nodes. From data provided, 34% and 50% of the m isses are from particle and space cell data structures respectively, thus giving p i ~ 0.4, assum ing the rem aining m isses are having the same pro portion. 75 I For LU, columns are statically assigned to processors on an inter leaved fashion, which are allocated from local shared memory. The ratio betw een local and rem ote references can be easily obtained as pL n — » -4 (5.2) Pr 2 where Nc is the num ber of colum ns assigned to each node. For a 200 x 200 m atrix divided among 16 processors, Pl / Pr ~ 6, giving p^ ~ 0.857. For PTHOR, the logic elem ents are evenly distributed among all nodes. We assum e equal probability for each logic elem ent to be activated at any point in time. Thus for 16 processors, pL = 1 / 16 = 0.0625. Based on the values of pL obtained, the average latencies can be approxim ated by: Leff ~ Pl^ l + PrLr (5.3) which gives Leff - 50, 25 and 70 cycles for MP3D, LU and PTHOR respectively. This is close to the 50, 20 - 27 and 60 - 80 cycles as reported in the sim ulation result. The cache degradation constants for LU are approxim ated using the reported m iss rates, w hich gives kw ~ 4.4 and kr ~ 0.4 respectively. Since the m iss rates for N > 2 are not available for the case of MP3D and PTHOR, we estim ated the degradation constants, under the assum ption kr = k ^ The values obtained are kw = kr = 0.6 for MP3D and kw = kr = 0.55 for PTHOR. 76 Figure (5.2) to (5.4) shows the perform ance curves o f the three applications under SC and RC m odels. The continuous lines represent the predicted results using GSPN m odels while the isolated m arks represents values reported from trace sim ulation. We used a buffer size of j3 = 6 for RC, which is sufficient since the P{stall} values due to buffer full condi tion are negligible (less than 0 .0 1 ) in all three cases. For the case of LU, the predicted perform ance and the em pirical val ues correlate well with each other for both SC and RC m odels. The p re dicted values are generally slightly higher than that reported, with a maximum absolute error of approxim ately 0.05. For M P3D and PTHOR, the values predicted are very close to the sim ulation results for the case of N = 1. Although the validity of the predicted perform ance values for N > 2 seems questionable, we expect the actual values to be close to the p re dicted curves shown. The follow ing are the possible sources o f error that m ight account for some of the discrepancies: • In the sim ulation on DASH, the processor is blocked from accessing the cache for 4 cycles while cache fill operation of other contexts com pletes. We did not include this blocking effect into our m odels, and considered only the overall cache fill overhead. • W rite hits takes two cyles to com plete, as com pared to zero cycle used in our model. • Some of the param eters such as Pl and LR are estim ated based on inform ation provided, thus some deviations are expected. • Network and m em ory contentions which w ill effectively increase the access latencies were not considered in our model. • Barriers and spinning on synchronization locks w hich causes addi tional waiting tim es were assum ed to be negligible in our model. 77 1 0.9 MP3D h — h : RC (Model) * * * : RC (Sim) x— x : SC (Model) o o o ; s c (Sim) 0.8 c 0.6 J S 0.4 0.2 0.1 Num ber of C ontexts (N) Figure (5.2) MP3D with estim ated kr = kw = 0.6 0.9 LU H h : RC (Model) * * * : RC (Sim) x _ * : S C (Model) O O o : s c (Sim) 0.8 m O .7 c 0.6 u] 0.5 £ 0 .4 c l 0.3 0.2 0.1 Num ber of C ontexts (N) Figure (5.3) LU with estimated kr = 0.4 and kw = 4.4 78 0.9 PTHOR h n RC (Model) * * * : RC (Sim) x — x : S C (Model) o o o ; S C (Sim) 0.8 c 0.6 iB 0.5 S 0.4 □- 0.3 0.2 0.1 Num ber of C ontexts (N) Figure (5.4) PTHOR w ith estim ated kT = = 0.55 Figures (5.5) to (5.7) shows the respective predicted perform ance curves under SC, PC, WC and RC m odels, using the estim ated cache deg radation constants obtained earlier. For M P3D, WC and RC gives the same perform ance since the synchronization rate is negligible. PC p er form s com parably well with some lost in perform ance due to buffer size lim itation. For LU, all three relaxed consistency m odels give nearly the 1 same result. For PTHOR, PC and RC perform s equally well since the f buffer size problem does not arise due to low w rite access rates. WC gives considerably low er perform ance due to high synchronization rate, with P(stall) ~ 0.15 foriV = 4. In general, all relaxed m em ory consistency m odels provide some perform ance gain over SC m odel for N < 4. 79 0.9 MP3D 0.8 : RC,WC : PC SC u] 0.5 v > 0.4 Q - 0.3 0.2 0.1 Num ber of C ontexts (N) Figure (5.5) MP3D under different memory consistency models 0.9 LU h b : RC,WC < 3— o : PC x— x : SC 0.8 uj 0.5 a o.4 O -0.3 0.2 0.1 Num ber of C ontexts (N) Figure (5.6> LU under different memory consistency models 0.9 PTHOR 0.8 g— o : PC *— * : WC x— x : SC uT0.7 5 0.6 S 0.4 c l 0.3 0.2 0.1 Num ber of C ontexts (N) Figure (5.71 PTHOR under different memory consistency m odels 5.3 Relative Merits of Various Memory Models The GSPN models of a m ultithreaded processor under relaxed memory consistency, namely PC, WC and RC m odels are presented. These m odels are based on architectural assum ptions that reads are block ing and context sw itching is allowed to take place only on a cache m iss. Perform ance curves obtained from num erical solutions of the GSPN m od els are presented for different param eter values used. W hile relaxed consistency m odels generally provide substantial perform ance gain over sequential consistency m odel in a single-threaded processor, the effect on a m ultithreaded processor is rather dependent on the system param eters and context sw itching policy used. The am ount of 81 processor time stalling due to write buffer being full or access constraints im posed by specific consistency m odel can have severe im pact on the processor perform ance. A different context sw itching policy that per forms context switch on stall conditions may be applied to reduce the pro cessor stall time. However, the im plem entation issues on how to handle the current request and the additional context sw itching overhead incurred requires further investigation. The size of w rite buffer determ ines the am ount of processor time stalling for a buffer entry. WC and RC require relatively sm all buffer size to hide all the write latencies through pipelining of w rite acceses, without stalling the processor. However, PC requires a m uch larger buffer size due to its relatively slow er service rate for write accesses since pipelining w ithin each thread is not allow ed. For sm all buffer sizes or high write access rates, PC may perform worse than SC due to excessive processor stalling time. Given sufficient buffer size and low w rite access rate, PC may perform better than WC in applications w ith high synchronization rate, such as PTHOR, since no synchronization constraints is im posed on PC. The perform ance of W C is generally governed by the rate of syn chronization in the program . For low synchronization rate, WC perform s as well as RC in hiding all w rite latencies. For high synchronization rate, the perform ance is lim ited by the processor stalling for synchronization accesses, which have to be sequentially consistent. G iven that m ost appli cations contains relatively low synchronization rate, such as MP3D and LU, WC is expected to give perform ance gain com parable to that from RC. In addition, higher fraction of w rite accesses provides greater poten tial for perform ance gain under WC and RC, w ith relatively small buffer sizes. 82 C hapter 6 C on clu sion s 6.1 Summary of Research Contributions The m ain research contribution of this thesis work com es from the stochastic m odeling of m ultithreaded processor under various m em ory consistency m odels, using GSPN m odeling techniques. The capability of GSPN in m odeling stochastic processes demanding both concurrency and synchronization overcom es the lim itations im posed on other m odeling techniques such as queuing m odels. The basic GSPN m odel of an abstract m ultithreaded processor is proposed and the perform ance curves obtained are verified with published analytical results. The GSPN is extended to model a m ultithreaded pro cessor equipped w ith coherent caches in a DSM system , under the con straints of SC consistency model. A set of param eters used in the m odel are defined and the perform ance curves obtained under variations on some param eter values are com puted and analyzed. 83 The GSPN is further extended to m odel the m ultithreaded proces sor under relaxed memory consistency m odels, namely PC, WC and RC. Each m odel is treated separately under different access constraints imposed by the memory m odel, with a unified context sw itching policy. The effects of different buffer sizes and synchronization rates are pre sented for each of the relaxed memory m odels. Based on the results com puted, RC always gives the best perform ance by hiding all w rite latency with reasonably small buffer size. The perform ance of PC depends very m uch on the buffer size used and the effective service rate of the write buffer. The perform ance of WC is prim arily affected by the synchroniza tion rate of the application. W ith low synchronization rate and a large buffer size, all three relaxed consistency m odels gives com parable perfor mance gain over SC. A simple cache m odel is incorporated into the GSPN m odels to m odel the effect of cache degradation on m ultithreaded processor perfor mance. A com parison was carried out betw een the predicted perform ance using the GSPN models and some published sim ulation data. For SC and RC, the GSPN m odels correlate w ell w ith published sim ulation results, while PC and WC require further validation due to lack of sim ulation results. 6.2 Suggestions for Further Research The GSPN m odels are based on a num ber of assum ptions in order to lim it the com plexity of the net. Further refinem ents are required to improve the accuracy of the m odels. Two of the problem s encountered in the process of constructing the m odels are listed below as subjects for : further research: 84 Context Switching Policy. We have used a simple switch on cache miss policy for our m ultithreaded processor model. This poses a potential bottleneck for the relaxed consistency models with finite buffer size and high synchronization rate, which stall the processor from further execution. A different context switching scheme and m echanism that allow s context switch on buffer full or synchroniza tion wait conditions w ould be useful in reducing the processor stall ing time. Further investigation into the extra switching time incurred and the corresponding im plem entation issues is necessary. Network and Memory Contention Models: In our m odels, we assum ed the access latencies to be random variables having an expo nential distribution. As the num ber of contexts increases, we expect the latencies to increase due to lim ited netw ork bandw idth and higher m em ory contention. Stochastic m odeling of the access laten cies based on netw ork and memory contention m odels of a m ulti threaded processor w ill provide better approxim ation for different num ber of contexts in each node. B ib liograph y [ACCx90] [Agar92] [Ajmo84] [Bell92] [BR90] [Chi87] R. A lverson, D. Callahan, D. Cum mings, B. Koblenz, A. Por terfield and B. Sm ith, “The Tera com puter system ,” Proc. ACM I n f I Conf. on Supercomputing, pp 1-6, June 1990. A. Agarwal, “Perform ance tradeoffs in m ultithreaded proces sors,” IEEE Transactions on Parallel and Distributed Sys tems, 3(5) pp 525-539, Sept. 1992. Ajmone M arsan et al. “ A class of G eneralized Stochastic Petri Nets for the Perform ance E valuation of M ultiprocessor Sys tem s,” ACM Trans, on Computer Systems, 2(2), pp 93-122, May 1984. Gordon B ell, “U ltracom puters - A Teraflop Before Its Tim e,” Communications o f the ACM, 35(8), pp 27 - 47, Aug 1992. R. B isiani and M. Ravishankar, “Plus:A D istributed Shared- M emory System ,” Proc. 17th I n f I Symp. Computer Architec ture, IEEE CS Press, Los A lam itos, pp 115-124, 1990 G. Chiola, “G reatspn user m anual version 1.3,” Technical report, Dipartimento di Informatica, Universita di Torino, Torino, Italy, Sept. 1987. 86 [DSB 8 6 ] M. Dubois, C. Scheurich and F. Briggs, “M emory Access Buffering in M ultiprocessors,” Proc. 13 th In?I Symp. Com puter Architecture, pp 434-442, June 1986. [DSB 8 8 ] M. Dubois, C. Scheurich and F. Briggs, “Synchronization, Coherence, and Event Ordering in M ultiprocessors,” IEEE Computer, pp 9-21, Feb. 1988. [GGH91] K. G harachorloo, A. Gupta and J. Hennessy, “Perform ance Evaluation of Memory Consistency M odels for Shared-m em- ory M ultiprocessors,” Proc. 4th In?I Conf. on Arch. Support and Prog. Lang, and O.S. , April 1991. [GHGx91] A. Gupta, J. Henessy, K. G harachorloo, T. Mowry and W.D. Weber, “Com parative Evaluation of Latency Reducing and Tolerating Techniques,” Proc. In? I Symp. Computer Architec ture, pp 254-263, May 1991. [GLLx90] K. G harachorloo, D. Lenoski, J. Laudon, P. Gibbons and J.H ennessy, “M emory Consistency and Event Ordering in Scalable Shared-M em ory M ultiprocessors,” Proc. 17th Symp. Computer Architecture, pp 15-26, May 1990. [Good89] J.R. Goodm an, “Cache C onsistency and Sequential C onsis tency,” Technical Report 61, IEEE SCI Com m ittee, 1989. [Hwa93] K. Hwang, Advance Computer Architecture: Parallelism, Scalability, Programmability. M cGraw-Hill Inc. 1993. 87 j [Lam79] [LH89] [LLGx90] [LLGx92] [LLJx92] [LS89] [MH93] L. Lamport, “How to make a m ultiprocessor com puter that correctly executes m ultiprocess program s,” IEEE Trans. Computers, 28(9), pp 241-248, Sept. 1979. K. Li and P. Hudak, “M emory Coherence in Shared Virtual Memory System s,” ACM Trans. Comp. Sys. 7(4), Nov 1989. D. Lenoski, J. Laudon, K. G harachorloo, A. G upta and J. Hennessy, “The directory-based cache coherence protocol for the DASH m ultiprocessor,” Proc. 17th In t’l Symp. on Com puter Architecture, pp 148-159, May 1990. D. Lenoski, J. Laudon, K. G harachorloo, W. W eber, A. G upta J. Hennessy, M. Horowitz and M.S. Lam , “The Stanford Dash m ultiprocessor,” IEEE Computer, pp 63-79, Mar. 1992. D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta and J.Hennessy, “The DASH Prototype: Im plem enta tion and Perform ance,” Proc. 19th Int’l Symp. on Computer Architecture, pp 92-103, 1992. K. Li and R. Schaefer, “A H ypercube Shared V irtual M emory System ,” Proc. Int’l Conf. Parallel Processing, 1989. W. Mao and K. Hwang, “Effects of Prefetching and Release Consistency on M ultithreaded M ultiprocessor Perform ance” Technical Report, Dept, of EE-System s, Univ. of S. C alifor nia, Los Angeles, CA. April 1993 88 [NL91] [Saav90] [Saav91] [TMC91] [WGB9] B. Nitzberg and V. Lo, “ D istributed Shared Memory: A Sur vey of Issues and A lgorithm s,” IEEE Computer, Aug. 1991 R.H. Saavedra-Barrera, D.E. Culler and T. von Eicken, “A nal ysis of M ultithreaded A rchitecture for Parallel Com puting,” Proc. 2nd ACM Symp. Parallel Algorithms and Architectures, July 1990. R.H. Saavedra-Barrera and D.E. Culler, “An A nalytical Solu tion for a M arkov Chain M odeling M ultithreaded Execution,” Technical Report UCB/CSD-91/623, Com puter Science D ivi sion, UC Berkeley, M arch 1991. TMC, “The CM-5 Technical Sum m ary,” Thinking M achines Corp., Cam bridge, MA, 1991. W.D. W eber and A. Gupta, “Exploring the Benefits of M ulti ple Hardware Contexts in a M ultiprocessor Architecture: Pre lim inary R esults,” Proc. 16th In t’l Symp. on Computer Architecture, pp 273-280, June 1989. 89
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Optical bus for distributed systems: An alternative LAN
PDF
Study of various factors involved in the resistance of elastomeric vulcanizates to atmospheric conditions
PDF
Mapping parallel algorithms onto parallel architectures
PDF
Parallel STAP benchmarks and their performance on the IBM SP2
PDF
Disha: a true fully adaptive routing scheme
PDF
Investigation of seismic isolators as a mass damper for mixed-used buildings
PDF
Computational treatment of a model of the basilar membrane
PDF
Colligative properties of polyelectrolyte solution in the intervertebral disc
PDF
Stress analysis of an arch by photoelastic, model deformation and column analogy methods
PDF
Factors affecting the consistency of shortenings
PDF
The development of a process for the extraction of metallic antimony by a hydrometallurgical method
PDF
Design of reinforced concrete open-spandrel hingeless arch bridge
PDF
Access ordering and coherence in shared memory multiprocessors
PDF
Adaptive execution: improving performance through the runtime adaptation of performance parameters
PDF
The effect of sodium sulfate scale on heat transfer coefficients inside tubes
PDF
Queueing Network Models For Multiple Class Broadband Integrated Services Digital Networks
PDF
Application of irrigation water to the land
PDF
Flying with a feather named han
PDF
A study of some sulfates and double sulfates of beryllium
PDF
Separation of the components of tall oil
Asset Metadata
Creator
Chong, Yong-Kim
(author)
Core Title
Effects of memory consistency models on multithreaded multiprocessor performance
Degree
Master of Science
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Hwang, Kai (
committee chair
), Prasanna, Viktor K. (
committee member
), Saavedra, Rafael H. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c20-360313
Unique identifier
UC11262748
Identifier
EP43878.pdf (filename),usctheses-c20-360313 (legacy record id)
Legacy Identifier
EP43878.pdf
Dmrecord
360313
Document Type
Thesis
Rights
Chong, Yong-Kim
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA