Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
(USC Thesis Other)
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy subm itted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these w ill be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 800-521-0600 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. I-STRUCTURE SOFTW ARE CACHES: EX PLOITING GLOBAL DATA LOCALITY IN NON-BLOCKING M ULTITHREADED A R CH ITECTU RES by Wen-Yen Lin A D issertation Presented to th e FACULTY O F TH E GRADUATE SCHOOL UNIVERSITY O F SOUTHERN CALIFORNIA In P artial Fulfillment of the Requirem ents for the Degree D O C TO R OF PH ILO SO PH Y (C O M PU TE R EN G IN EERIN G ) MAY 2000 Copyright 2000 Wen-Yen Lin R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UMI Number: 3018017 ___ ® UMI UMI Microform 3018017 Copyright 2001 by Bell & Howell Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. Bell & Howell Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UNIVERSITY O F SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 9 0 0 0 7 This dissertation, written by Wen-Yen L in under the direction of .... Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies T)ate 3l^„2000 DISSERTATION COMMITTEE M. Chairperson R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Dedication To my daughter Erin and my lovely wife Shu-Chiao. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Acknowledgements M y advisor, Professor Jean-Luc G audiot, was the inspiration for this thesis. I would like to thank for his inspiration, guidance, and support. It was a privilege to have been one of his students. I would also like to express my gratitude to Professor M assoud Pedram and Ming-Deh Huang for serving on m y dissertation com m ittee. T heir inquisitive questions have stim ulated my research. I also thank Professor T im othy Pinkston, V iktor K. Prasanna, and Rafael H. Saavedra for being on m y Ph.D guidance com m ittee. T he CAPSL (C om puter A rchitecture and Parallel System s Laboratory) at the University of Delaware led by Prof. G uang R. Gao generously provided m e w ith the access to EARTH platform s. Their continuous efforts on developing, im plem enting, and m aintaining th e EARTH machines provided me a reliable experim ental envi ronm ent which leads to significant results presented in this dissertation. I thank Dr. Jose Nelson Am aral for valued discussion and reviewing portions of my analysis. I would like to thank my colleague and very good friend of mine, Chung-Ta Cheng, for many insightful conversations and the friendship. I also would like to thank my former group members Dr. Yung-Syau Chen, Dr. Hung-Yu Tseng, Dr. Dae-Kyun Yoon, Dr. H alim a Elnaga and Dr. Hiecheol K im for their advisem ent and encouragement. I thank Dr. NamHoon Yoo for his well-developed sim ulator. W ithout his solid work, I would had spent more tim e on developing my own sim ulator. I also acknowledge PD PC group members Chulho Shin, James Burns, and Steve Jenks. Special thanks goes to M ary Zittercob and Joanna W ingert for th eir assistance. iii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. M y sincere gratitude goes to m y parents, my brother and m y sister for their love and long tim e support. I m ust th an k my wife Shu-Chiao Huang for her understand ing and love, which I can never th an k enough. W ithout her continuous support, I would never have been able to finish the program. Finally, I would like to thank my daughter E rin, who brought m e joyful moment and strength during my last stage of study. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Contents D ed ica tio n ii A ck n o w led g em en ts iii L ist O f F igures vii L ist O f Tables ix A b str a c t x 1 In tro d u ctio n 1 1.1 S y n o p sis................................................................................................................... 4 2 B ack grou n d R esea rch 5 2.1 M ultithreaded a r c h ite c tu re s ............................................................................ 5 2.1.1 Blocking M ultithreaded A rch itectu res............................................... 7 2.1.2 Non-blocking M ultithreaded A rc h ite c tu re s .................................... 9 2.2 I-Structure m em ory system ............................................................................ 12 2.3 M otivation............................................................................................................... 13 2.4 Related W o r k ........................................................................................................ 15 2.4.1 On M emory Models and Cache M anagement for Shared Mem ory M ultiprocessors................................................................................ 16 2.4.2 IS-Cache Design on the ETS S y s t e m ............................................... 17 2.4.3 Scalable I-Structure Cache d e s ig n ..................................................... IS 2.4.4 A Cache Design for Input Token S ynchronizations...................... 19 2.4.5 Em pirical Study of a Dataflow Language on the CM-5 . . . . 20 3 I-S tru ctu re Softw are C aches (IS S C ) 21 3.1 I-Structure Cache D esig n ................................................................................... 21 3.1.1 Deferred Requests H a n d lin g ............................................................... 22 3.1.2 Deferred Queue Storage ....................................................................... 23 3.1.3 Deferred Read Sharing P r o b le m ......................................................... 24 3.1.4 Legality of W rite O p e ra tio n s ............................................................... 25 v R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.2 T he I-S tructure Software Cache (ISSC) R untim e System ................................................................................................... 26 3.2.1 W rite-direct P o l i c y .................................................................................. 26 3.2.2 Set-Associative Cache A llo c a tio n ............................................................28 3.2.3 Cache A d v a n c e ...............................................................................................29 3.2.4 Deferred Read S h a r in g ................................................................................ 31 3.2.5 “C entralized” D eferred Requests and D istributed Deferred R e a d s .............................................................................................32 3.2.6 V irtual A d d re s s in g .................................................................................. 33 3.2.7 Cache Replacem ent P o lic y ......................................................................... 33 3.2.8 ISSC System O v e rv ie w ............................................................................ 35 3.3 Sim ulation R e s u l t s .....................................................................................................36 3.3.1 T he S im u lato r............................................................................................ 36 3.3.2 Sim ulation r e s u l t ...................................................................................... 36 3.3.2.1 The d ata lo c a lity .................................................................... 38 3.3.2.2 The network tra ffic ..................................................................... 40 3.3.2.3 The system p erfo rm an ce...........................................................43 3.3.3 T he effect of cache a d v a n c e ............................................................... 44 3.3.4 Cache R e p la c e m e n t.................................................................................. 45 3.4 S u m m a r y ................................................................................................................. 48 4 ISSC im p le m en ta tio n on E A R T H sy stem s 50 4.1 EA RTH A rc h ite c tu re ............................................................................................. 50 4.1.1 Fine Grain M u lti-T h re a d in g ......................................................................50 4.1.2 Split Phase Com m unication and S y n c h ro n iz a tio n ............................. 52 4.2 Single Assignm ent Storage S tru c tu re s ............................................................ 54 4.3 ISSC Im plem entation on E A R T H .........................................................................57 4.3.1 ISSC im plem entation using Threaded-C la n g u a g e ............................57 4.3.2 Usage of ISSC in Threaded-C la n g u a g e ............................................ 60 5 E x p e rim e n t resu lts on E A R T H sy stem s 63 5.1 Highlights of Experim ental R e s u l t s ..................................................................... 63 5.2 T he Cost of ISSC O perations ........................................................................... 64 5.3 D escription of B e n c h m a rk s.................................................................................. 67 5.4 Robustness to Latency V ariation .......................................................................... 69 5.5 S u m m a r y ................................................................................................................. 74 6 P erfo rm a n ce M o d elin g 75 6.1 Perform ance A n aly sis............................................................................................. 76 6.2 T he A nalytical M o d e ls ......................................................................................... 78 6.2.1 Verifying the M o d e l.................................................................................. 80 6.2.2 Perform ance P r e d ic tio n s ........................................................................ 82 6.3 S u m m a r y ................................................................................................................. 85 vi R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 7 C o n clu sio n s and fu tu r e research 86 7.1 C o n c lu sio n s............................................................................................................ 86 7.2 Future r e s e a r c h ..................................................................................................... 88 R e feren ce L ist 92 A p p e n d ix A ISSC’s Im plem entation on EARTH using Threaded-C L a n g u a g e ....................103 A .l ISSC S t r u c t u r e ........................................................................................................104 A.2 ISSC O perations .................................................................................................... 108 A p p e n d ix B Using ISSC with Hopfield B en c h m a rk .........................................................................112 B .l Hopfield Benchm ark .............................................................................................113 B.2 M a k e file ......................................................................................................................127 v ii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. L ist O f F ig u res 2.1 D istributed Deferred Queue S to ra g e ............................................................... 18 3.1 Centralized Deferred Queue S to ra g e ............................................................... 23 3.2 D ata Block In te g ra tio n ....................................................................................... 25 3.3 Structure of I-Structure Software C a c h e s ..................................................... 28 3.4 Cache Advance A llo c a tio n ................................................................................ 30 3.5 Deferred Read S h a r in g ....................................................................................... 31 3.6 The overview of I-Structure Software Cache runtim e s y s t e m ............... 34 3.7 The Hit R atio of Rem ote Requests: (a) M atrix M ultiplication, (b) Conjugate G radient, (c) 1-D F F T and (d) LU -D ecom position.................. 39 3.8 The N um ber of Network Packets: (a) M atrix M ultiplication, (b) Con jugate G radient, (c) 1-D F F T and (d) L U -D ecom position 41 3.9 Speed up m easurem ents: (a) M atrix M ultiplication, (b) Conjugate Gradient, (c) 1-D F F T and (d) LU -D ecom position................................... 43 3.10 The Effect of Cache Advance: (a) M atrix M ultiplication and (b) Con jugate G r a d ie n t.................................................................................................... 44 3.11 Cache Replacem ent and Hit R atio in MM Benchm ark with Varying Cache size................................................................................................................. 46 3.12 Cache Replacem ent and Hit Ratio in CG Benchm ark with Varying Cache size................................................................................................................. 46 4.1 The EARTH M o d e l.............................................................................................. 52 4.2 (a) (1) An active fiber in the EU of P ,- requests an EARTH split- phase block-move-sync operation; (2) The SU of P, decodes the source address to the m em ory of P j and sends a request for the block; (3) The SU of P j receives the request and reads the block from the local memory, (b) (4) The SU of P j sends the block over the network to the SU of P,-; (5) The SU o f Pt - writes the block in the local memory; (6) The SU of P,- decrem ents a synchronization slot counter, that becomes zero and causes the spawning of a fiber th a t will use the block transferred..................................................................................................... 53 4.3 State Transition Diagram for the I-Structure Im plem entation . . . . 55 4.4 State Transition Diagram for the I-Structure Software Cache . . . . 58 4.5 Threaded-C with ISSC program e x a m p le .................................................... 61 viii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.1 Speedup in the MANNA m achine...................................................................... 70 5.2 A bsolute speedup with 10 fxs com m unication interface overhead . . . 71 5.3 Execution time with synthetically variable communication interface o v e rh e a d ................................................................................................................... 73 6.1 Execution time with add-on synthetically variable com m unication in terface overhead. (a)Dense M atrix M ultiplication (b)Conjugate Gra dient (c)Hopfield (d)Sparse M atrix M u ltip lic a tio n ....................................... 77 6.2 Perform ance prediction for different b e n c h m a rk s............................................83 6.3 Perform ance prediction for com m unication o p tim izatio n .............................. 83 6.4 Perform ance prediction for technology im p ro v em e n t..................................... 84 i x R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. L ist O f T a b les 2.1 Com parison of Blocking and Non-blocking M ultithreaded Executions 5.1 Latency of EA RTH and ISSC operations on EARTH-M ANNA-SPN, measured in num ber of cycles (1 cycle = 20 ns).......................................... 5.2 I-Structure Software Cache Hit Ratios ( % ) ................................................. 5.3 Average num ber of remote mem ory requests per n o d e ............................ 6.1 Tim ing equations and the cross-points (fj.s ) ................................................. 6.2 B enchm ark-related P aram eters......................................................................... 6.3 Platform -related Param eters M easured from MANNA machine . . . . R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Abstract Non-Blocking M ultithreaded execution m odels have been proposed as an effective m eans to overlap com putation and com m unication in distributed memory systems w ithout any hardw are support. Split-phase operations are used to enable the tol erance of request latencies by a decoupling between the initiators and the receivers of com m unication/synchronization transactions. However, the d ata locality of the shared distributed global d ata is not exploited by conventional caches; moreover, each request also incurs the cost of com m unication interface overhead. In this dissertation, we design our ISSC (I-Structure Software Cache) system to further reduce com m unication overhead for non-blocking m ultithreaded execution and develop a sim ulator to validates our design. The single assignment property of I-Structure elim inates the needs for cache coherence protocol and greatly reduces th e overhead of this software cache. It is this property th at make the concept of software cache feasible. This software cache combines the benefits of latency reduction and latency tolerance in non-blocking m ultithreaded system w ithout any hardware support. We then im plem ent our ISSC on top of EARTH systems, which is a fine-grain m ultithreading system th a t could be im plem ented from off-the-shelf microprocessors, and we studied the perform ance of ISSC on EARTH-MANNA m achine with a set of benchm arks. O ur studies indicate th at ISSC improves the system perform ance and makes the system m ore robust. We further develop analytical models for the perfor m ance of a m ultithreading system with an d w ithout ISSC. We com pare our m odel’s xi R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. prediction w ith our experim ental results on EARTH-MANNA machines. These ana lytical models allow us to predict a t w hat ratio of com m unication latency/processing speed the im plem entation of ISSC becomes profitable for applications with different characteristics. As a consequence the system can be ported to a wider range of m achine platform s and deliver speedup for both regular and irregular application. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C h a p te r 1 Introduction M ultithreaded architectures have been proposed as a means to overlap com putation and com m unication in distributed m em ory system s. By switching to the execu tion of other ready threads, the com m unication latency can be hidden from useful com putations as long as there is enough parallelism in an application. Non-blocking m ultithreaded execution m odels, like TAM [22], P-RISC [58], and EARTH [37], had been proposed to support m ultithreaded execution in a conven tional RISC-based m ultiprocessor system w ithout the needs of any specific hardware support for fast context switching. In these models, remote m em ory requests are structured as split-phased transactions so th a t the processor could continue exe cuting other instructions which do not depend upon th e request in progress. The request carries a tag, continuation vector, indicating th e return address of the re quested d ata in th e consumer thread. The arrival of the requested d ata will be sent directly to th e consumer thread identified by the continuation vector. T he arrival of the last requested data in th a t consuming thread will then activate the thread and the thread will be ready to be executed. Therefore, in these m odels, thread activations are d ata driven: A thread is activated only when all the d a ta elements it needs are available locally. Indeed, once a thread starts to execute, it executes to th e end. In such a non-blocking multithreaded execution model, once th e execution of a thread term inates, no thread contexts need to be saved before sw itching to the 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. execution of another active thread. Therefore, m ultithreaded execution is achieved w ithout the needs of any specific hardw are support. A good execution m odel must be based on a good m em ory system to achieve high system perform ance [25]. An I-Structure m emory system [9] provides split- phase m em ory accesses to tolerate com m unication latency. It also provides non- strict d ata access, w hich allows each elem ent of a data stru ctu re to be accessed once the d ata elem ent is available w ithout waiting for the whole d ata structure to be produced. Each elem ent of an I-S tructure has a presence bit associated with it to indicate the s ta te of an element, such as E m pty and Present. D ata can only be w ritten into em p ty elements, and the slots are set to th e present state after the d ata has been w ritten into them . Read from an em pty elem ent is deferred until the d ata is produced. The split-phase accesses to the I-Structure elements provide fully asynchronous operations on th e I-Structure. T he non-strict data access on the I-Structure provides the system w ith a b etter chance to exploit fine-grain parallelism . The fully asynchronous operations on the I-Structures make it easier to w rite a parallel program without worrying about data synchronizations since the d ata are still synchronized in the I-Structure itself. The single assignment rule of the I-Structure provides a side-effect free m em ory environm ent and m aintains the determ inacy of the program s. All of these features make I-Structures, along with non-blocking m ultithreading, an ideal model for parallel com puting. W hile the com bination of non-blocking m ultithreaded execution and I-Structure m em ory system appears to be a very attractiv e architecture for high performance com puting, the m ajo r drawback of this system is that locality of remote data is not utilized. Since all rem ote requests are translated into split-phased transactions which are different from local memory read /w rite operations, th e accesses of remote d ata do not pass through the local cache system and every rem ote data access is actually sent to the rem ote host. On the other hand, in some other m ultithreaded architectures, like A L EW IFE [2], FLASH [50], and *T-N.G. [8], every memory access 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is issued as a local m em ory operation. Thread switching occurs when the processor stalls on cache misses or synchronization failures at run-tim e. This kind of models is what we call “Blocking multithreaded ” Thread switching involves context saving of the suspended thread, context loading of the next thread, and pipeline flushing. The overhead is m uch larger than the non-blocking m ultithreaded execution and it usually needs specific hardware support for fast context switching. Fortunately, the use of a local cache system exploits th e global data locality and hence reduces the num ber of rem ote requests as well as the num ber of context switches. W ith the very sm all overhead of context switching in the non-blocking m ulti threaded models, th e highest overhead of these models is in the communication interface. The sending and receiving of network packets m ay take from dozens to thousands of cycles depending on the design of the network interface [21]. Since all requests are actually sent to the rem ote hosts through the netw ork, all the sending and receiving requests incur the network interface overhead. Moreover, the requests for I-Structure m em ory accesses on th e network also congest netw ork traffic which may ultim ately degrade system performance. The goal of this research proposal is to develop an efficient cache scheme for the I-Structure m em ory system in the non-blocking m ultithreaded multiprocessor sys tem s so that it could exploit the global data locality, reduce th e num ber of network packets, and hence im prove the overall system performance. T he target environment we have in m ind is a message-passing distributed memory m ultiprocessor system. The non-blocking m ultithreaded execution model is a com piler controlled m ulti threaded execution, and it could be im plem ented on any conventional RISC-based multiprocessor system w ithout any add-on hardware support for m ultithreaded ex ecution. We would intend to include our cache system to this model without any specific hardware support for further improvement of system performance. There fore, a software im plem entation of this cache scheme, the I-Structure Software Cache (ISSC), is proposed here to exploit global d ata locality w ithout adding any specific 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. hardware support in the non-blocking m ultithreaded execution model. However, we do not lim it this cache scheme to a software im plem entation only. This proposed research also provides a fundam ental study for the hardware im plem entation. 1.1 Synopsis This dissertation is orgnized as follows: Before we start to discuss the design of I-Structure caches, we did a broad back ground research and report some replated work in C hapter 2. From these background research, we disscuss some design issues of I-Structure cache design in C hapter 3. We also describe the approaches adopted in our design, and provide details of our I- Structure Software Cache im plem entation. We then perform sim ulation of our ISSC design w ith selected benchmarks to validate our design. We are not ju st satisfied with our sim ulation results, we also look for a real system th at we could implement our ISSC on it. In C hapter 4, we have a brief introduction to our target system , EARTH, and described the im plem entation of our ISSC on the system using Threaded-C language. We m easure the costs of some ISSC operations in C hapter 5 on EARTH-MANNA m achine to have a better understanding of the overhead of software cache. In this chapter, we also test our ISSC w ith a set of real benchmarks and measure its performance. We show our ISSC make the system more robust to the latency variation. In C hapter 6, we develop analytical models for a m ultithreading system with and without ISSC. We verify these models with the experim ental results we m easured in chapter 5 and m ake performance predictions for different benchm ark characteristics and a wider range of machine platform s. Then we conclude this dissertation with the contributions of this research work and provide some directions for future research. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C h a p ter 2 Background Research In this chapter, I present a spread background research on the m ultithreaded ar chitectures and I-S tructure m em ory system . After exploiting these background re search, the motivation of this research will be described and some related work on I-Structure cache designs will be introduced. 2.1 Multithreaded architectures M ultithreaded architectures had been shown as a promising way in the distributed m em ory multiprocessor system s to overlap the communication latency with useful com putations. As the progress of the VLSI technology goes, hundred MHz proces sor speed could be easily achieved. On the other hand, the gap between the speeds of CPU and m em ory becomes larger and larger. In the distributed m em ory m ulti processor systems, the rem ote m em ory access latency becomes even worse. This is because th a t the latency of rem ote m em ory access consists of not only the memory access tim e, but also the netw ork interface overhead, network latency and sometimes the d ata synchronization overhead. T he perform ance of the system will degrade a lot if th e processors stall while their rem ote m em ory accesses are in progress. To increase the processor utilization, overlapping the com m unication latency with the useful com putation becomes necessary. By m aintaining m ultiple process contexts and switching among them in a few cycles, m ultithreaded architectures can hide 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the communication, latency behind the com putation and reduce the idle tim e of processors. M ultithreaded processors should be attractiv e as nodes for a massively paral lel machine program m ed using the SPM D model. T he single-program multiple- data (SPMD) m odel [24] is currently gaining wide acceptance for massively par allel scientific com putation. The model is im plem ented by mapping it onto and MIMD m ultiprocessor, either manually or by using a d ta parallel com pilation strat egy [36, 75, 45, 48, 26]. The SPMD model provides a good target language for the array com putation features of such languages as Fortran 90 [73], High Perform ance Fortran [35], and certain functional program m ing languages (Sisal [56, 17, 18, 19] and Id [59], for exam ple). In some m ultithreaded models, like TAM [22], P-RISC [58], and EARTH [37], all rem ote m em ory accesses are translated into split-phased transactions at compila tion tim e and a th read will be activated only when all the d ata inputs are available locally. Therefore, once a thread starts to execute, it executes to the end. This kind of execution m odel is called “Non-Blocking M ultithreaded”. I-Structure m em ory system is a split-phased accessing m em ory system . It provides non-strict data accesses, fully asynchronous memory operations, and fine-grain parallelism . This makes I-Structure m em ory, along with non-blocking m ultithreading, an ideal model for parallel com puting. On the other hand, in some other m ultithreaded architec tures, like A LEW IFE [2], FLASH [50], and *T-N.G. [8], every memory access is issued as a local m em ory operation. T hread switching occurs when the processor stalls on cache misses or synchronization failures at run-tim e. This kind of m od els is what we call “Blocking multithreaded .” W ith dedicated hardware support in this model, context switch overhead could be m inim ized to tens of m achine cy cles. Therefore, th e com m unication latency could be overlapped by the interleaved executions of several threads. 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.1.1 Blocking Multithreaded Architectures By “Blocking” , we m ean “Blocking M ultithreaded A rchitectures” where the execu tions of threads are suspendable and also resum able. The idea is th at during the execution of a thread, if the processor stalls on waiting for rem ote requests, synchro nization failures, or even the local data to be brought from local m em ory to cache on cache misses, the processor would rather suspend the execution of the current thread and switch to the execution of other threads than ju st sit idle and wait for the action to complete. However, to resume the execution of the suspended thread, th e thread context needs to be saved while it was suspending. Since when the thread will be suspended cannot easily be predicted at compilation tim e, all the registers, statu s words, and some m em ory space have to be saved when a thread suspension occurs. Moreover, the th read context which is chosen for execution next has to be loaded into processor after th e context of the suspended thread was saved. All of the jobs during thread switching (context saving and loading) should be done very efficiently, otherwise the processor m ay want to stick on the same th read and just be idle while waiting for the rem ote requests, synchronization failures, or cache misses to be finished. Therefore, most of the architectures supporting blocking m ultithreaded execu tion, like Horizon [49], Tera [4]. MASA [33], J-M achine [61], A LEW IFE [2], FLASH [50], and *T-N.G. [8], have dedicated hardware which supports fast context switch. For exam ple, in ALEW IFE, a modified SPARC processor, Sparcle [3] processor is used for supporting blocking m ultithreaded execution. In Sparcle, the register set is divided into several frames th at are conventionally used as register windows [42],[65] for speeding up procedure calls in SPARC. In their design, they partition the reg ister file into four hardware contexts. A context switch to a precess whose state is currently stored in one of the register frames on the processor is effected in a small num ber of cycles. Each Sparcle processor will support up to four hardw are threads and unlim ited virtual processes. The m apping of process contexts to register frames 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is managed by software. By this dedicated hardware design, th e context switching could be achieved w ithin 14 cycles. However, this kind of hardw are support would increase the com plexity and the cost of processor design. However, there are still some side effects of the blocking m ultithreading which are hard to minimize by dedicated hardwares. One of the side effects is pipeline flushing. In pipelined processors, all the instructions entering pipeline becom e invalid right after the thread was suspended and the first instruction of next thread has to be fetched into the first stage of pipeline. T his will result in bubbles in the pipeline. T he deeper the pipeline, the higher the overhead suffered by th e system . T he other side effect is cache contention [1], In the blocking m ultithreaded execution, all the existing threads (including the active thread, ready threads, and suspended threads) of the processor com pete for the lim ited cache space with each other. This gives rise to a higher cache miss rate. Fortunately, th e exploitation of the global data locality reduces the num ber of rem ote requests an d th e num ber of context switches. In the blocking m ultithreaded execution, all the rem ote memory accesses are treated as local accesses. In machines w ith caches, the actu al rem ote requests are sent to the rem ote hosts only when they are missed from th e local cache. The block of data located in rem ote hosts will be brought back to th e local cache with the requested d ata and the following rem ote mem ory accesses m ay hit the local cache. Therefore the th read execution could be continued w ithout suspension. The reduction of the actual rem ote requests also gives a lower netw ork traffic rate. However, th e use of caches in the m ultiprocessor systems raises another im portant issue in m ultiprocessor system design, nam ely the cache coherence problem [28, 20]. In conclusion, by m aintaining m ultiple process contexts in processors supporting blocking m ultithreaded execution with fast context switch, the thread execution will be suspended when it stalls on the rem ote requests, cache misses, or synchronization failure, and the processor switches to the execution of other threads. Such th a t the 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. communication, latency could be overlapped w ith useful com putation and processor utilization increases. Since there are dedicated hardw ares to support the context switching at the run-tim e, all of the m em ory accesses could be treated as local m em ory accesses and th e actual remote requests are m ade only when they are missed from the local cache. Therefore, the global d a ta locality could be easily exploited by the local cache an d hence reduces the chance of thread switching and network traffic. However, provided sufficient parallelism exists, the num ber of threads in th e processor is still lim ited by the complexity of th e processor and the increased cache miss rate [1]. Indeed, w ith the requirement of dedicated hardw are support for fast context switching and m aintained cache coherence, it would take a long tim e and would be very costly to build this kind of system s. 2.1.2 Non-blocking Multithreaded Architectures Split-phased transaction [34, 72] is an asynchronous m em ory access scheme in m es sage passing m ultiprocessor systems. In the system s, rem ote memory requests are structured as split-phased transactions so th a t m ultiple requests may be in progress at one time. An instruction issues a request to the processor or memory m odule containing the desired data, and then other instructions which do not depend upon the result of the request in progress are executed. The request carries a tag, contin uation vector, indicating the return address in the consum ing thread at which the com putation should be continued when the response arrives. By splitting the rem ote m em ory request into two phases, requesting and consuming, the processor could con tinue executing other useful com putations w ithout wait for the d ata to arrive while the request is in progress. The arrival of the requested d ata will be sent directly to the consuming th read identified by the continuation vector. This feature of the split- phased transaction provides the ability for overlapping the communication latency w ith useful com putations. 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Basically, the non-blocking m ultithreaded execution model was evolved from the concept of data-flow execution model. Data-flow execution can be thought of as a very fine-grain m ultithreaded execution model. Indeed, in data-flow models, each thread contains only one instruction. The instruction will be activated only when the operands it needs are generated. After the execution of the instructions finished, the output d ata token is passed to other instructions and activates them . This would result in a sequence of activation. To improve the performance of the data flow architecture, m ost processors designed for data-flow execution are pipelined, like MONSOON [63, 64, 34] and RA PID [60]. T he stages of the pipelined pro cessor are interleaved with different sequences of activations. Therefore, the high through put could be achieved. However, due to the high cost of the m atching unit for the operands matching of instructions and the poor performance of the single sequence of activation execution, researchers from this area proposed the non- blocking m ultithreaded execution model. Examples of m ultithreaded architectures based on dataflow' models are: Iannucci’s work [39, 40] in combining dataflow ideas with sequential th read execution to define a hybrid com putation model, the EM-4 project [47, 46, 68] at the Electrotechnical Laboratory (ETL) in Japan, the successor of MONSOON project, *T [14], TAM [22], P-RISC [58], and EARTH [37]. The main idea of non-blocking m ultithreaded execution is to group the sequence of activation w ithout any remote mem ory accesses, branches, and synchronization into one thread a t compilation time. So th at, once a thread starts to execute, it executes to the end. A thread is like an atom ic execution unit, which is like an instruction in the data-flow execution model. Since the thread execution will not be suspended, no context needs to be saved for the thread at the run-tim e. As for the beginning of a thread execution, since the execution is not resum ed from previous execution, no process context needs to be loaded. Therefore, there is almost no overhead during the thread switching. This is the reason why it is easier to im plem ent this kind of model from off-the-shelf processors [55]. Moreover, 10 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. since the th read boundary has been determ ined at com pilation tim e, in th e pipelined processors, the first instruction of the thread which will be executed right after the current th read could be pre-fetched into to the pipeline while the last instruction of the current thread is in the execution stage. So th a t, the pipeline is also highly utilized while thread switching without any bubble stages. For th e rem ote m em ory accesses, the requesting and consuming of the rem ote data are broken into different threads by using th e split-phased transaction. In a split-phased transaction m em ory access, along w ith th e requested data address, the continuation vector (th e return address of th e requested data) was sent to the rem ote host by th e requesting thread. According to the continuation vector, th e requested data are sent back directly to the consumer th read by the rem ote host. And, the consumer thread may become active if all th e d a ta it needs are available locally. Since the requesting threads are only responsible for sending out the requests, it is not necessary for them to wait for the requests to complete after sending out the requests. T h e processor could continue the execution of current thread or other active threads while those split-phased transactions are in process. Therefore, the com m unication latency could be hidden from th e execution of other threads. However, the m ajor drawback of this non-blocking m ultithreaded execution model is th at the global d ata locality is not exploited. Since every rem ote m em ory access has been compiled into split-phased transaction explicitly, each rem ote access actu ally send out the request to the remote host and th e remote host sends back the reply message along w ith the requested data. These requests are different from the local m em ory read/w rite operations, and, therefore, these rem ote m em ory accesses do not pass through th e local cache systems and the local cache system takes no advantages of the rem ote d ata locality. On th e other hand, in the blocking mul tithreaded m odel, every m em ory access is issued as local memory operation. The request is sent to the rem ote host only when the d ata is not in the local cache. The 11 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. local cache system exploits not only the d ata locality from local memory but also the global d ata locality. Characteristics Blocking M ultithreading Non-Blocking Multithreading Thread execution Interleaved (suspendable) Atom ic entity Thread switching Run-time controlled Compilation time controlled Hardware support context switch Needed Unnecessary Rem ote memory accesses Local accesses Split-phased transactions Pipeline flushing Yes No Granularity Coarse Fine G lobal data locality exploitation Easy Difficult Network traffic Medium High Table 2.1: C om parison of Blocking and Non-blocking M ultithreaded Executions Finally, I would like to sum up the com parison between the blocking m ulti threaded model and th e non-blocking m ultithreaded model in table 2.1. 2.2 I-Structure memory system An I-Structure m em ory system [11, 9, 34] is a conventional d ata structure with some constrains on its construction and destruction. It is designed for the data storage of scientific applications in parallel com puting to achieve efficient accesses, provide fine-grain parallelism , and preserve the determ inacy of com puting. I-Structure mem ory system explicitly use split-phased transaction for the m em ory access, and this provides the system w ith the ability for hiding the latency of accessing I-Structure m em ory from useful com putation. Each element of th e I-Structure has a presence bit associated with it to indicate the state of the elem ent, such as Empty and Present. There are three primitives for the operations in I-S tructure memory system . • I-allocation allocates consecutive d ata elem ents for an array structure and these data elem ents are initialized as in E m pty state. • I-store stores a produced data item into the em pty d ata element. After the d ata item is w ritten into the data elem ent, its state is set to Present state. 12 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. If an I-store attem pts to w rite d ata into a d a ta element which is already in the Present state, it causes an error. This constrain makes the I-Structure m em ory as a single assignment m em ory system . Each data elem ent could be accessed after the data is stored into it w ithout waiting for the d ata of the whole stru ctu re to be produced. • I-fetck reads the data from the d ata elem ent in I-Structure memory. If the I-fetch is issued to an Em pty d a ta element, th e request is deferred until th at elem ent has been written. A deferred request can be memorized sim ply by saving its continuation vector in the d ata elem ent. Once the value is present, it can b e sent to the requester using the saved continuation vector. This would allow a request being issued before the data is produced. The non-strict d ata access on the I-Structure provides the system w ith a better chance to exploit fine-grain parallelism . The fully asynchronous operations on the I-Structure m ake it easier to write a parallel program w ithout worrying about data synchronization since the data are still synchronized in the I-Structure itself. The single assignm ent rule of the I-Structure provides a side-effect free m em ory environ ment and m aintains the determ inacy of the program s. Giving enough parallelism in the program , th e split-phased nature of mem ory requests allows us to hide th e extra latency of rem ote memory accesses in the distributed multiprocessor environm ent. All of these features make I-Structures highly suited to distributed m em ory sys tems designed to exploit fine-grain parallelism , like the non-blocking m ultithreaded execution. 2.3 Motivation It appears th a t using I-Structure m em ory system along with non-blocking m ulti threaded execution becomes a prom ising architecture for high perform ance parallel com puting. T his architecture exploits fine-grain parallelism , hides the ex tra latency 13 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. of rem ote requests from useful com putation, increases program m ability while m ain taining the determ inacy of the parallel applications, and of course, is low cost to build. However, the m ajor drawback of this architecture is th a t the remote d ata locality is not utilized. We need to exploit the global d ata locality in this architecture for several reasons. Indeed, w ith the capability for latency tolerance brought by split- phased transactions and the small overhead of thread switching in the non-blocking m ultithreaded execution, it would appear that the length of the remote request is ir relevant. However, it turns out th at communication latency tolerance is based on one central assum ption: it could be hidden from com putation as long as there are enough ready threads. W hen there is not enough parallelism, single thread performance is closely related to the com m unication latency and it becomes critical. Exploiting the global data locality will reduce the mean tim e between thread activation and therefore the processor utilization increases in the critical section. Secondly, even with enough threads to tolerate communication latencies and low thread switching overhead, the highest overhead of this architecture is in the communication interface. The sending and receiving of network packets m ay take from dozens to thousands of cycles depending on the design of the network interface [21]. Since all requests are actually sent to the rem ote hosts through the network, all the sending and re ceiving requests incur the network interface overhead. Finally, even though m any machines include dedicated hardware to handle the netw ork communication so th at communication interface overhead is taken away from the com putation processors, the requests for all the rem ote data accesses on the network also congest network traffic, which m ay ultim ately degrade system perform ance. Therefore, an I-Structure cache system which caches these split-phase transac tions in non-blocking m ultithreaded execution is required to further reduce commu nication latencies and release the network traffic. This cache system would provide ability for com m unication latency reduction while m aintaining the com m unication 14 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. latency tolerance ability in this architecture. Therefore, our goal is to develop a novel I-Structure cache schem e to exploit global d a ta locality in non-blocking m ul tithreaded architectures. T h e target environm ent we have in m ind is a message- passing distributed m em ory m ultiprocessor system . The non-blocking m ultithreaded execution model is a com piler controlled m ultithreaded execution, and it could be im plem ented on any conventional RISC-based m ultiprocessor system w ithout any add-on hardware support for m ultithreaded execution. The single assignment prop erty of the I-Structure elim inates the cache coherence problem from the cache design. This would make it possible to implement the cache system as a software run-tim e system without being detrim ental to the system perform ance, and we would intend to include our cache system to this model w ithout any specific hardw are support for further improvement of system performance. Therefore, in this proposed research, we developed an I-Structure Software Cache (ISSC) [51] in the non-blocking m ultithreaded execution model w ith I-Structure-like m em ory environment w ithout adding any hardw are. We would like to see the im pact of the ISSC on the overall system perform ance by analyzing the data locality utilization, network traffic, overhead distribution, and speed-up curves of some ap plications. 2.4 Related Work There is some research about the I-Structure Cache design which has been pursued elsewhere, but non of the designs are intended to im plem ent as a run-tim e system . 1 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.4.1 On Memory Models and Cache Management for Shared Memory Multiprocessors Dennis and Gao [25] proposed a cache m anagem ent scheme for their A bstract Shared- M em ory com puter system, which is a dataflow program execution model and spec ified the I-Structure model as the m em ory system to support the synchronizing m em ory operations. They proposed a high-level concept of the I-Structure Cache m anagem ent scheme but without detail im plem entation. In their design, a cache line will be allocated first in the local cache when a read miss occurs. The continuation vector of the original request will be stored in this allocated cache line, and a new request will be forwarded to the rem ote host by using the address of the allocated cache line as th e continuation vector but not the original one. The later requests for the same d ata item will be deferred in the cache line while th e first request is in progress. After th e first request is replied from the rem ote host, the d ata item is w ritten into the pre-allocated cache line and is also forw arded to all the continuation vectors which have been deferred in th at cache line. A w rite-through with w rite allocate policy is adopted in their design in a w rite miss situation. In the I-Structures, th e deferred requests from all other hosts for the sam e d ata elem ent are queued in the host or m em ory m odule which owns th a t d ata elem ent. In their design, the size of a cache line is a single I-Structure elem ent. Therefore, only the tem poral d ata locality is exploited and the spatial locality is not touched. All other details of the cache design, like cache organization and cache replacem ent algorithm , are not mentioned in th eir design. And also, no simulation or evaluation are perform ed in their work. 1 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.4.2 IS-Cache Design on the ETS System Kavi et al. [43] proposed a design of cache memories for m ultithreaded dataflow architectures. The design includes an I-Structure cache memory to exploit the data locality of the shared d ata structures in multiprocessor environm ent. Basically, the design of their I-Structure cache (IS-Cache) is a hardware supported cache system using the Explicit Token Store (ETS) model of dataflow systems. The IS-Cache keeps not only the I-Structure elements requested (I-fetch opera tions) by the processor but also the I-Structure elements produced (I-store opera tions) by the processor. A write-back on dem and policy is adopted for the I-store operations. The data item s produced by local host are kept at the local IS-Cache and are w ritten back to th e I-Structure only when there are requests for those data item s or they are replaced from the IS-Cache. As in conventional cache system de sign, a cache line is allocated only when the d ata are brought back from remote host. Therefore, in a read miss situation, the request is forwarded to the I-Structure directly without doing anything on the local IS-Cache. If the requested d ata item has been produced and is available in the I-Structure, the d ata item is sent back to the consumer thread and a copy of the d ata item is also kept at IS-Cache. If the requested data item is not yet available (the data element is in Em pty state) in the I-Structure, the request is deferred in the I-Structure and a message is sent to the producer of th at d ata item to indicate th at there is deferred request of that d ata item in the I-Structure. If the data item is already in the producers IS-Cache, th a t d ata item is w ritten back to the I-Structure and the deferred request for that d ata element are fulfilled. Otherwise, a missing table is m aintained in the producer’s IS-Cache to indicate the pending status of the I-Structure elem ents. After the data item is produced and stored in the producer’s IS-Cache, the missing table is checked and the data item is w ritten back to the I-Structure and the deferred request could be fulfilled. 17 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. To im plem ent th e write-back on dem and in the IS-Cache, extra space for the missing table is needed. Also, in order to send the pending status of a d ata item to its producer, an additional directory for the inform ation of th e producers of th e I- Structure elem ents is required. This would further m ake it difficult to im plem ent the dynamic allocation of d ata structures in the I-Structure. It would also be difficult to implement a thread m igration strategy which will change the producers of the d ata item s at run-tim e. Moreover, addition interrogation messages will be introduced to the network when requests for em pty I-Structure elem ents occur. 2.4.3 Scalable I-Structure Cache design Papadopoulos [62] and Cheng [31] independently proposed scalable m ethods to deal with the storage of th e deferred requests in the I-Structure. In the m ultiprocessor system s, m ultiple hosts m ay issue requests for the same d ata item in the I-Structure. If th a t d ata item is not produced yet, all the requests have to be deferred in the I-Structure. As the num ber of pending requests grows, there may be not enough space to store all the pending requests in the I-Structure which owns th at d a ta item . Moreover, when the d ata item is produced and w ritten to the I-Structure finally, all the deferred requests will be served. This m ay cause a hot spot problem on the network. Node j Node i I-Structure Node Node k Figure 2.1: D istributed Deferred Q ueue Storage 18 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Therefore, they proposed distributed m echanism s for the storage of th e pending requests: they are distributed am ong the requesting hosts. As shown in figure 2.1, every requesting host provides one (or more) slot to store each one of its own pending request(s), and all of these requests are linked in a queue. This scheme m ake the growing of the deferred queue quite scalable, since for every requesting host, only one slot is needed for the queue. It also avoids hot spots in the network if there are too m any requests pending in a single d ata location. 2.4.4 A Cache Design for Input Token Synchronizations Roh and N ajjar’s project [67] on the design of storage hierarchy in m ultithreaded architectures was trying to exploit the locality of the frame storage on th e Pebbles m ultithreaded model. The Pebble m ultithreaded model is a non-blocking m ulti threaded m odel which is the sam e as the architecture that we have in m ind. How ever, th e locality exploited in th eir work is the fram e storage which is used to store the input tokens of the threads. This reduces the m atch tim e of each incoming to ken. They showed the execution tim e becomes linearly proportional to the m atch tim e when the m atch tim e is greater than 3 cycles. In their sim ulation, the average m atch tim e could be reduced to 1 cycle based on the design of a fully associative cache. In this work, the locality of the global shared d ata is not touched. We believe th at th e execution time is dom inated by th e m atch tim e when the m atch cycle is large as shown in their work. We think th a t the execution tim e with a sm all m atch cycle is dom inated by the availability of the threads. The I-Structure cache exploits the global d ata locality and hence reduces the average turn around tim e of the re m ote requests. The smaller the rem ote request turn-around tim e, the less threads are needed to overlap the com m unication latencies. Therefore, by incorporating the I-Structure cache with their work, the system could be further improved. 1 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.4.5 Empirical Study of a Dataflow Language on the CM-5 C uller et aL [23] im plem ented the idea of I-Structure caching in software m anner on Id90 com piler for their Threaded A bstract Machine (TAM) im plem ented on the CM- 5. T he idea of I-S tructure caching is sim ilar to our work b u t they also did the single I-S tru ctu re d ata elem ent caching which is the same as Dennis and Gao’s work as we introduced in previous section. In their im plem entation, th e unit of a cache block is a single I-S tructure d a ta elem ent. Therefore, only tem poral d ata locality had been exploited. W ith a cache block size of one I-Structure d a ta elem ent, no deferred read sharing problem will occur. This m ade their design com paratively easier, like cache replacem ent, deferred read handling, etc. However, from our sim ulation, it shows th a t spatial d a ta locality does play an im portant role in the perform ance improve m ent. Moreover, tem poral data locality could be easily utilized by the program m er or th e com piler w ithout im plem enting the I-Structure caching, as we shown in our F F T benchm ark. 2 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C h a p ter 3 I-Structure Software Caches (ISSC) 3.1 I-Structure Cache Design In one aspect, I-Structure cache design is sim pler than the cache design for conven tional m em ory system s. That is, no cache coherence problem is encountered in the I-Structure cache design. This is because of the inherent cache coherence feature of I-Structure Cache. Indeed, I-Structure is a single assignment m em ory system. In single assignment memory systems, m ultiple updates of a d ata elem ent are not perm itted. Once a d ata element is defined in a single assignment m em ory system , it will never be updated again. The copies of the d ata elements in the local cache will never be updated. Therefore, cache coherence is already em bedded in I-Structure mem ory systems. It makes the design of I-Structure cache m uch sim pler w ithout having to take care of the cache coherence problem. However, in other aspects, the design of the I-Structure Cache is not as straight forward as the cache design for conventional memory systems. This is because of some characteristics of I-Structure, such as split-phased transaction, single assign m ent property, deferred read, and the presence bits of d ata elem ents. Therefore some design concerns and issues will arise in the I-Structure cache design. 2 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.1.1 Deferred Requests Handling In an I-Structure, a request may be deferred in the I-Structure if th e request arrives while an data item has not yet been w ritten into the I-Structure. T he deferred request will be satisfied after the d ata item is produced and w ritten back to the I-Structure. T he services of deferred reads have to be guaranteed, otherw ise some threads may wait for th e already produced data elem ents forever and this m ay result in some deadlock situations. In the I-Structure m em ory system without I-Structure cache, there will be not problem at all for this guaranty since the I-store operations will w rite the produced d ata back to the I-Structure directly and all the pending requests in th e I-Structure will be fulfilled as soon as the d ata are w ritten into the I-Structure. However, adding the I-Structure cache to the system m ay keep the d ata of I-store operations in local cache without w riting them back to the I-Structure im m ediately. In th e case th a t no cache replacem ent occurs, the produced d ata might be kept at local cache and would not be w ritten back to the I-Structure forever. If it happens to have some pending requests for those d a ta in the I-Structure, then these pending requests will never be satisfied. Therefore, the design of I-Structure cache has to avoid this situation carefully. One of the solutions is the w rite-back on dem and policy as used in Kevi’s IS- Cache design [43]. T he produced d a ta which are kept in the cache by local host will be w ritten back to the I-Structure not only when they are replaced from the cache, but also when there are requests for these data from other hosts. A fter the d ata are w ritten back to th e I-Structure, th e deferred reads could be satisfied. This scheme will prevent the unnecessary data being written back to the I-Structure if there will be no requests from other hosts. However, a w rite-through cache design provides a simple solution to guarantee the service, because the produced d ata elem ent will be w ritten to the I-Structure as soon as it is produced. O nce the d a ta elem ent is 22 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. written, to th e I-Structure, the deferred reads queued on the data elem ent slot can be satisfied. These two solutions provide the guaranty of th e deferred request services. The write-back on dem and cache design will reduce some unnecessary netw ork traffic, but it is m ore complex and expensive to im plem ent th an the w rite-through cache design is. In section 4, we will have more discussion on this issue and explain the reasons of why we chose w rite-through policy in our design. 3.1.2 Deferred Queue Storage — V - U — Deferred Requests I-Structure Cells l-Structure node Figure 3.1: Centralized Deferred Q ueue Storage In a m ultiprocessor system , there may be several requests pending on a data element before the data is generated. How the system m aintains the queue of these deferred requests is also an issue. The conventional m ethod is called the “Central ized” storage m ethod: all of the deferred requests are stored in the ow ner’s place, as shown in figure 3.1. This m ethod is very simple, and since all the deferred requests are kept at th e owner’s place of the data elem ent, all of the pending requests can be satisfied as soon as the d a ta is w ritten into its location. However, the number of pending requests depends solely on the application. Further, as th e number of pending requests grows, there m ay be not enough space to store all these requests. Therefore, this scheme m ay not be scalable: even though there is enough space to 2 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. store all the pending requests, whenever the data is generated, all of the pending requests on this d ata have to be serviced simultaneously. This m ay cause a hot spot problem on the network. T he “Distributed” storage m ethod independently proposed by Papadopoulos [62] and Cheng [31] provides a scalable solution for the unlim ited growing of pending requests and also avoids the hot spot problem on the network caused by the services of those pending requests. Moreover, since the deferred queue is distributed among the requesting processors, the I-Structure needs only serve th e first pending request which is stored in the I-Structure data elem ent. After the reply of the first pending request arrives the requesting host, the pending request, which is from other host and stored in th at host, could be satisfied. This makes the services of the pending re quests on different d ata elem ents as in pipeline fashion, and therefore, it increases the throughput of the I-Structure memory operations. However, as in Cheng’s design, th e storage slots of this distributed deferred queue are provided by the I-Structure cache of each requesting host. The cache lines allocated for th e distributed deferred queue may be replaced, and the queue will be broken. So th a t, additional effort m ust be expended to recover the queue once it is broken. Moreover, those requests which are pending at the end of the queue may wait for a long tim e for the requests to be served. However, the chance of the pending requests to explode th e space in the “Cen tralized” storage m ethod will play an im portant role in the decision of using the “D istributed” storage m ethod or not. 3.1.3 Deferred Read Sharing Problem In I-Structure m em ory systems, every data element has a presence bit associated w ith it to indicate its state (Present, E m pty, or Deferred). Indeed, to exploit the spatial data locality, a whole block of d a ta elements should be requested by the cache instead of the requested d ata item only. As shown in figure 3.2, th e d ata elements in 24 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. uested element ata block ught back Cache Line I-Structure Data Block (Main Memory) Figure 3.2: D ata Block Integration the same block m ay be in a different state; some of them m ay be in the present state and some of them m ay be in the empty state. How the d ata elem ents in the different states will be integrated into a whole data block needs to be careful handled in the I-Structure cache design. Deferred read sharing is one of the issue happens in the integration of data elements in different states into one data block. W ithout doubt, the present data should be brought back into the cache and a deferred request is stored in the slot of the requested d a ta element if it is in the empty state. The issue comes when there are other em pty d ata elements within the same d a ta block. Is the deferred read going to be put on every em pty data element, or ju st p u t on the requested data elements and th e other em pty elements left still em pty? This would be up to the choice of the designer and would have different im pact on the cache performance. 3.1.4 Legality of Write Operations As we discussed before, an I-Structure memory system is a single-assignment memory environment in and of itself. For instance, it m ust be ensured th at w rite operations are only m ade to em pty locations. If this can be guaranteed by the compiler or the language, then the w rite operations could be delayed in the local cache until the data is needed by other hosts or the cache block is replaced. However, if the legality of write operations is not ensured by the com piler/language, a write-back cache design m ay result in some non-determ inistic behavior. In this case, a write to 25 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. an already defined elem ent m ay occur, but the doubly-w ritten data m ay be kept in the local cache forever while the rest of the system is not aware of this situation. A w rite-through cache will be much safer in an I-Structure cache system if the legality of w rite operations was not enforced by the compiler or th e language. 3.2 The I-Structure Software Cache (ISSC) Runtime System T he I-Structure Software Cache runtim e system proposed here will take advantage of the spatial and tem poral localities of the global d ata in the I-Structure m em ory system s, without any hardw are support. The runtim e system works as an interface between the user applications and the network interface. A block of m em ory space is reserved by this run-tim e system as the software cache space. It filters every rem ote request and reserves a m em ory space in local memory as a cache of rem ote data. A rem ote memory request is sent out to the rem ote host only if the requested d ata is not available on the software cache of the local host. Instead of asking for the requested data item only, the whole d ata block surrounding the requested d ata item is brought back to the local host and stored in the software cache. Therefore, spatial d ata locality can also be exploited. There are several features of our ISSC system that I w ant to discuss first before I give an example to explain the overview of whole ISSC system. 3.2.1 Write-direct Policy As described in section 3 regarding the deferred request handling and the legality o f write operation issues in th e I-Structure cache design, different write policies would have different im pacts on the I-Structure cache design. In our I-Structure Software Cache design, we didn’t adopt the write-back policy for th e following reasons: 26 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • I-Structures are a producer-consum er type of m em ory system . There m ay be some requests pending on an element before the d ata item is produced. We w ant to satisfy those pending requests as soon as the d ata becomes available. If a write-back policy was adopted in our software cache, some of the threads m ay still be w aiting for th e data even though they have been produced. This m ay result in a shortage of ready threads in some processors, thereby rendering them idle. Even though a write-back on dem and m ay solve this problem, in order to w rite the d a ta back to the I-Structure as soon as there are requests from other hosts, th e producers of d ata elem ents have to be known before the data are actually produced and extra tables needs to be checked in write operations and read miss situations. This would increase th e overhead of the cache system, and o f course, this is not w hat we want for a software cache run tim e system. Knowing th e producers of every d ata elem ent in advance also makes it difficult to dynam ically allocate space for the d ata structures and also m akes it difficult to m igrate threads in the run-tim e, which m ay change the producers of d ata during run-tim e. • The main reason for using a write-back cache is to prevent th e unnecessary m em ory updates which happen when the d a ta in the local cache are updated again before they are read by other processors. However, in the single assign m ent memory system , th e d ata in the cache will never be updated. Therefore, the write-back cache design does not have an advantage over a write-through cache design in th e single assignment m em ory system . • We want to ensure th a t w rite operations are m ade only into em pty locations as soon as the w rite operation has been issued. If the w rite operation is cached in the local software cache and w ritten back to the rem ote host only when th e cache block is replaced or there are requests from other hosts, the w rite operation m ight a tte m p t to modify an elem ent which is already in the present 27 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. state due to some error. This may result in using this illegal d ata from the local cache by the local host. For these reasons, a write-direct or write-through policy could be adopted in our I- Structure Software Cache system . Therefore, data will be w ritten to the I-Structures as soon as the w rite operations are issued. This sim ply guarantees the services of deferred reads after the requested d ata elements are produced. However, we simply use write-direct policy instead of write-through to prevent the node th at issued the write from replying to a read request before the legality of te write is verified. In other words, there is no caching for w rite operations. This simple w rite-direct policy ensures w riting to em pty location only, satisfies deferred reads as soon as possible, and avoids deadlock situations. 3.2.2 Set-Associative Cache Allocation Reference bic Cache element Estate fcjit) (data) Victim Tag Pointer i Deferred bit Set#0 Set #1 Cache B lo< Figure 3.3: Structure of I-Structure Software Caches Cache search schemes play a very im portant role in th e cache perform ance. In hardware design, a fully associative cache has the highest performance because of its parallel search and the full utilization of the cache space, but it is very expensive to 28 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. im plem ent. In a software im plem entation, a parallel search is obviously impossible inside a single processor. T he direct-m apping schem e has the fastest search tim e in software im plem entation, however, it has th e worst cache utilization. In order to have a higher search perform ance and b e tte r cache utilization, a set-associative search mechanism is adopted in ISSC. A requested d ata address is m apped to a set of cache blocks by a hash function. If the address m atches w ith the tag of one of th e cache blocks in th e set, then we have a cache hit. Otherw ise, it is a cache miss, and a cache line is allocated for this request as described in th e cache advance feature. Figure 3.3 shows th e structure of the ISSC. Each cache line has a deferred-bit, a reference-bit and a tag field which indicates the address of th e first element in the block, and it contains block of cache elem ents. Cache lines are allocated in pre-reserved consecutive m em ory blocks to store the data of cache blocks so that they could be directly accessed by index addressing. Each set has a victim pointer which is used in cache replacem ent. 3.2.3 Cache Advance In conventional cache designs, the cache space is allocated when the d ata block is brought back to the local host. However, the cache space is allocated in advance in the ISSC when a read m iss is detected. This is w hat we call the “Cache Advance.” Indeed, due to the long latency and unpredictable characteristics of th e network in a distributed memory system , a second rem ote access to the data elem ents in the same d ata block may be issued while th e first request is still traveling through the network. In conventional cache allocation m ethods, m ultiple outstanding m em ory requests for the sam e data block from th e sam e host are possible. By using our approach, the second and later requests are deferred in the pre-allocated cache space while waiting for the d ata block to come back. In the example shown in figure 3.4, a cache block size of 4 d a ta elements is assumed. As in I-Structure memory, each d ata elem ent in the cache is also associated 29 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. N: [read A(4) CL] 1 1 1 .................i.m n f fr1 - To remote node Figure 3.4: Cache Advance Allocation with a presence b it. So th a t each elem ent could be distinguished in different states: Present, E m pty, and Deferred. This presence bit would provide a second-level d ata synchronization point for the data, so th a t th e feature of fully asynchronous m em ory operation of the I-S tructure memory could be m aintained. To exploit th e spacial d ata locality, a cache block is allocated in a read miss instead of one d a ta elem ent is allocated, and all d ata elements in this cache block are initialized in th e E m pty state. In this exam ple, a read request “R ” asking for th e d ata “A(5)” is m ade and missed in the software cache. A cache block “CL” is allocated for this m issing read before the request is sent to the rem ote host. Instead of sending the original request “R” to the rem ote host, a new request “N ” asking for the data block beginning with “A(4)” along w ith the new continuation vector “CL” is sent to the rem ote host and the original request “R” is deferred in the second elem ent of the pre-allocated cache block “CL” . Therefore, the following requests asking for A(4), A (5), A(6), and A(7) will h it th e cache and will be deferred in CL while the request “N” is in progress. This allows duplicate rem ote m em ory requests to be elim inated and therefore ultim ately improves overall netw ork perform ance. 3 0 Software Cache R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.2.4 Deferred Read Sharing As described earlier, there are two ways to deal with the deferred read sharing problem . One way is to append the request to all the d ata locations which are em pty. T he other one is to defer the request on the requested d a ta location only and leave other em pty locations in the block still empty. CV: [A(5)] CL: A(4) M l R i « e X I * * * N:[read A(4) CL] diQ-> N Structure Figure 3.5: Deferred Read Sharing In our ISSC, a deferred read is shared by all the empty d ata elem ent located in the sam e d ata block. In th e exam ple shown in figure 3.5, a request “N” asking for the d ata block beginning w ith A(4) arrives in the I-Structure. Among the four d a ta elem ents in this block, two of them , A(4) and A(6), are in Present state, one, A (5), is in E m pty state, and th e other one, A(7), is in Deferred state w ith a deferred request “Q ” . This request “N ” is not only deferred in A(5), which was originally requested, but also A(7). A nd, the valid data of A(4) and A(6) are sent back to “CL” in the software cache of the requesting node. In the requesting node, read requests which hit the cache b u t find out th at the data elem ents are in Em pty or Deferred state would ju st be deferred in the local software cache w ithout sending th e requests to the I-Structure. Since the deferred read has been shared by all the em pty d ata elements of a d ata block in the I-Structure, once the d ata elements are 31 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. filled with valid d ata, they will be sent back to the local software cache and the requests deferred in th e local cache could be satisfied. For applications w ith good spatial locality, placing a deferred read indication on all em pty elem ents w ithin the data block would yield b e tter perform ance than would just putting th e deferred read on the requested d ata elem ent, since all the d ata within a data block only need one request. By appending th e request to all of the em pty locations, th e d ata will be sent to caches after they are produced without m aking another request. However, for applications with poor spatial d ata locality, th e deferred read of a whole block m ay introduce more network traffic because it m ay send to caches d a ta which may never be needed by the local host. This is due to the fact th a t the d a ta has not actually been requested but ju st happens to reside in the same data block alongside other requested data. We believe th at for most numerical applications, there is plenty of structural par allelism with spatial locality. Therefore, th e deferred read sharing is im plemented in our current ISSC runtim e system . From our sim ulation results, we will dem onstrate th a t the spatial d ata locality dominates the d ata locality in the m atrix m ultiplication benchm ark. 3.2.5 “Centralized” Deferred Requests and Distributed Deferred Reads A simple “centralized” m ethod is used for the im plem entation of th e queues of deferred requests. Since ISSC is a software runtim e system, th e space to store the pending requests could be dynamically allocated if needed. T here would be no scalability problem in our design. It should be noted that th e im plem entation of this runtim e system should be as thin as possible in order to reduce its overhead. However, to im plem ent the distributed m ethod, extra messages would be introduced into the network to link the requests together. A link recovery schem e is also needed for the distributed m ethod when the link is broken. All of these would introduce 32 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. m ore overhead to th e runtim e system which is of course undesirable. Therefore, the centralized deferred read m ethod is used in the ISSC. Indeed, w ith th e “cache advance” and “deferred read sharing” features of ISSC, the length of th e queue of deferred requests for each elem ent in th e I-Structure is bounded by th e num ber of nodes in the system . This is because a t most only one request is sent from each node to th e host node. Future deferred reads are kept locally in the node.1 However, the potential hot spot problem of the “centralized” deferred read m ethod has to be fu rth er considered in the future. 3.2.6 Virtual Addressing Even though we recognize that the single assignment rule of I-Structures simplifies the cache coherence problem, some of th e cache coherence problems still occur when the I-S tructure m em ory space is de-allocated and re-utilized. To totally avoid the cache coherence problem , a logical address, like the d a ta structure ID, m ust be used. It is th e job of com piler to make sure th a t no two d ata structures have identical IDs. 3.2.7 Cache Replacement Policy Because of the single assignment feature, th e interm ediate d ata structures, which are storages neither for input data nor for final output d ata, will be sooner deallocated during the com putation than the d a ta structures in m ultiple updatable memory system s. These interm ediate data structures will have a short life tim e in th e cache. Therefore, page faults in the I-Structure cache would occur more frequently than in conventional m em ory caches. This m eans th a t cache replacem ent is very im portant in th e I-S tructure cache design. A sim ple Pseudo-LRU policy is adopted as the replacem ent policy in our imple m entation of ISSC. A cache block th a t has any elem ent in the deferred read state H n the rare situation in which all the lines in the ISSC are irreplaceable, reads bypass the ISSC and are sent directly to the host node. 33 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is irreplaceable and th e deferred bit is set. A single reference bit is attach ed to each cache block as shown in Figure /refhash. The reference bit is set w henever there is a read to the block. A victim pointer is used to select a block to be replaced. When a cache replacem ent is needed, the block pointed by the victim pointer is tested to verify if it is replaceable by checking the deferred bit and if its reference bit is zero. If either condition is not satisfied, the reference bit of the block is reseted and th e victim pointer is advanced. W hen a replaceable block with a zero reference bit is found, it is replaced and the victim pointer is advanced to the next block. The victim pointer is pointed to the first block after cache initialization. However, if all the blocks in the set are irreplaceable, reads bypass the ISSC and are sent directly to th e host as if there is no I-Structure cache. User Application CL A(4) Network Software Cache a (4)M x !* i r w x i d q t x : CL: Global Shared Data Structure ■ Remote (PE #N) i Local Figure 3.6: T he overview of I-Structure Software Cache ru n tim e system 3 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.2.8 ISSC System Overview An overview of the operation of th e ISSC is shown in figure 3.6. In this example an application makes a request R for the element A(5) of th e I-structure A located in the remote host N . Since a split-phase transaction is used, th e request R m ust include the destination host N , th e address of the requested d a ta A(5), and the continuation vector C V of the requested data. W ithout th e ISSC, the request R would be sent directly to the rem ote host N through the netw ork. However, the ISSC intercepts the request R before it is sent to the network. In this example, A(5) was in the invalid state. Instead of sending the original request R , a new request S asking for a d ata block which includes the requested d ata A(5) is generated. Before sending the new request, a cache line space C L in the software cache is reserved for the newly requested block. In our exam ple, the location of th e requested data A(5) is in the second slot of the cache line which begins with A(4). T he original request R is stored in a dynam ically allocated queue. A pointer to th e head of the queue is stored in the cache location of A(5), and the state of this location is m arked as deferred read “dr.” All other elem ents in this block are m arked as deferred request “dq.” Meanwhile, the new request 5 , which contains the destination host num ber N , the beginning address of data block A(4), and the reserved cache line location C L for this request, travels to the rem ote host N through th e network. In our example, when the block request S is received by th e rem ote host N , it finds two valid d ata elem ents, A(4) and A(6), two em pty d a ta A(5) and A(7), and one deferred read Q pending for A(5) and A(7) in this d ata block. T he ISSC in host N then reads the valid d ata elem ents, A(4) and A(6), and defers the request S for A(5) and A(7). The two valid d ata elements A(4) and A(6) are sent back to the requesting host. W hen the local host receives th e elements A(4) and A (6), the ISSC fills the corresponding slots. If there are any pending requests on those d a ta elements, the ISSC satisfies them by sending the requested data to the C V s as specified in the 35 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. pending requests. W hen the d ata elem ent A(5) is produced and w ritten to its location in the rem ote host N , the deferred read S in the rem ote host N is serviced and a d ata packet carrying A(5) is sent back to C L in th e original requester. Upon receiving the d ata packet containing A(5), the ISSC of th e original requester places the d a ta elem ent A(5) into its slot at C L and satisfies th e request R sending the d ata to the C V specified in R. 3.3 Simulation Results We have perform ed some sim ulation experim ents to validate our ISSC scheme. 3.3.1 The Simulator Our sim ulator for the ISSC is built on top of the Generic M ultiThreaded m achine (GM T) sim ulator [74] developed at the University of Southern California. The GM T sim ulator provides a generic platform of non-blocking m ultithreaded machine param eterizing various architectural details. T he global heap m em ory is an I-Structure- like system for shared global data storage. There are two instructions for global structure access, ARE AD (array read) and AW RITE (array write). The AREAD instructions to th e rem ote hosts are cached by the ISSC runtim e system. In order to test th e effects of different cache block sizes, the cache block size is configurable in the sim ulator. However, the cache size rem ains the sam e w ith varying cache block size configurations. This means th at when the cache block size increases, th e total num ber of available cache lines decreases. 3.3.2 Simulation result The goal of the sim ulation is to dem onstrate the im pact of our ISSC on the dis trib u ted m em ory multi-processor system environm ents which use split-phase m em ory transactions to tolerate the com m unication latencies. We want to dem onstrate 36 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the kind of d a ta locality which can be exploited by ISSC, and w hat kind of im pact ISSC has on th e netw ork traffic. F urther m ore, we want to verify the effect of ISSC on the system perform ance. Therefore, four benchm ark program s with dif ferent characteristics were tested in the sim ulation. One is a m atrix m ultiplication with a m atrix size of 32x32 double precision floating-point num bers, and th e other one is the kernel function of the conjugate gradient m ethod for solving 256 linear equations w ith 256 unknown variables. The other two benchm arks were chosen from SPLASH-2 kernels, 1-D F F T w ith 512 complex d ata points, and LU-Decomposition for a 32x32 m atrix. These four benchmarks have different categories of d ata refer ence locality. T he m atrix m ultiplication benchm ark has excellent tem poral locality of data reference, while th e other three benchm arks have spatial d ata locality domi nating the locality of d a ta references. This is because th at in m atrix m ultiplication, two input m atrices are constantly referenced during the whole com putation, how ever, in the other benchm arks, interm ediate vectors or m atrices are generated and would not be referenced again after the com putation pass by. In our sim ulations, we w anted to test how th e ISSC perform ed w ith varying cache block sizes in different system sizes. We want to sim ulate it on the ideal case by elim inating the perform ance degradation caused by a sm all cache size. Therefore, in the simulations, each P E is configured with 24K words of software cache which is large enough for the problem size we are testing. T he com m unication latency between two PEs is set by th e param eter “COM ”, which is the mean tim e of com m unication delay between two PEs. In this part, we chose a reasonable com m unication latency by setting “CO M ” to 2.0. W ith hardware configurations of 2, 4, 8, 16, 32, and 64 PEs, and different cache block sizes, 0 (no cache), 1, 2, 4, 8, and 16 words, the simulation results are shown in following figures. 3 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3 .3 .2 .1 T h e d a ta lo c a lity Figure 3.7(a) shows the cache hit ratio of the rem ote requests of m atrix m ultiplica tion for various hardw are configurations. W hen no cache is configured (Block Size: C B = 0), every rem ote request is sent to the rem ote host, so th e hit ratios are ob viously always 0%. For a cache block size of 1 (CB=1), only the tem poral locality is exploited. For a sm all num ber of processors, like 2 PEs, the tem poral locality dom inates the whole d a ta locality (this can be seen by com paring w ith the cache hit ratios of cache block size 1, 2, 4, 8, and 16 of figure 3.7(a)). However, in most M PP systems, there are tens, hundreds or even thousands of processors in a system. W ith the same problem size, the rem ote request hit ratio decreases linearly while the num ber of processors increases. This means th at the spatial d ata locality becomes dom inant. This is because the d ata are also distributed am ong the processors and, therefore, the num ber of rem ote requests increases. This shows th a t it is not enough to only rely on the exploitation of tem poral d ata locality in M PP systems. Also, from the figure, we could see that the degradation of hit ratios becomes faster in sm aller cache block size while the system is scaling up. W ith the help of our ISSC runtim e system, a rem ote request hit ratio of 90% could be achieved on a cache block size 8 in a 64 PEs configuration and it also reduces the gap of the hit ratios betw een different num ber of processors. Figure 3.7(b) shows the cache h it ratio of the remote requests of the conjugate gradient benchm ark. In the conjugate gradient method, m ost of the com putation consists in updating array elements. W ith a good data partition scheme, the updates of array elements could be done locally. Therefore, when the num ber of processors increases, the num ber of rem ote requests does not increase m uch and the rem ote request hit ratio ju st decreases much more slowly compared to m atrix m ultiplication in figure 3.7(a). T he benchm ark does not have so much tem poral d ata locality of references as the m atrix m ultiplication does. Therefore, the h it ratios are less than 50% when only tem poral d ata locality was exploited. However, th e increase in cache 38 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. a 2 PE* KPB B8 PE* 016 P E * 8 3 2 PE* OS* PE* C I-0 C I-1 C t-2 C l - i C l-S d - V (a) Matrix Multiplication Ci»2 0 * 4 -5 C l- 02 PE* ■ 4 PE* 08 PE* Q1SPE* B 32 PE* 0 64 PE* (b) Conjugate Gradient IB 80 70 e « 2 0 C I-0 C l- t C l*2 C l-4 C l-S CI-16 (c) 1-D FFT ■ 2 PE* a 4 PE* 08PE* 0 16 PE* • 32 PE* • 64 PE* C I-0 C I-2 Cl -8 C I-1 (d) LU-Decomposition 0 2 PE* ■ 4 PE* n8PE* 016 PE* 03 2 PE* 064 PE* Figure 3.7: The Hit Ratio of Remote Requests: (a) M atrix M ultiplication, (b) Conjugate G radient, (c) 1-D FFT and (d) LU-Decomposition 3 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. block size takes advantage of the increased spatial data locality and improves th e overall hit ratio . Indeed, with a cache block size of 16, th e hit ratio increase from 47% (with C B = 1 ) to 97% in 8 PE s configuration. Figure 3.7(c) shows the cache h it ratio of the 1-D F F T benchm ark. It shows th at the hit ratio is still 0% when cache block size equals to 1. This is because th a t each d ata elem ent is actually referenced twice during the entire com putation and in our im plem entation, the d ata elem ent has been stored as a local variable for the next reference after it was accessed from th e rem ote host. This is a typical exam ple of the fact th a t th e tem poral d a ta locality could be utilized by th e program m er or some com piler optim ization techniques. It is interesting to note th a t the cache hit ratio rem ained th e sam e while th e num ber of processors scaled up. This is unlike other benchm arks when the cache hit ratio decreased as the num ber of processors increased. T his is due to the way we distributed data among processors and the memory access p attern s of 1-D F F T algorithm . Finally, figure 3.7(d) shows the cache hit ratio of the LU-Decomposition bench m ark. It is sim ilar to the Conjugate G radient benchm ark. However, the cache hit ratios are alm ost the sam e for 32 and 64 PEs. This is because th a t the problem size we tested, 32x32, is sm all relative to the system size. In sum m ary, we could see th a t exploitation of spatial d ata locality is really nec essary especially in larger system size. From our simulation results, over 90% hit ratios are achieved in all benchm arks w ith the cache block size of 16 words in all system configurations. 3.3 .2 .2 T h e n etw o rk traffic Agarwal [1] showed th a t the perform ance of m ultithreaded processors is traded off against netw ork contention. In th e non-blocking m ultithreaded execution m odel, the situation is even worse, because a finer granularity is being exploited and m ore com m unication is necessary between processors. 40 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. itoom 3 s s m 0 2 PEP s a n e ■ 4PEP m 1 29QOI nSPEP 'yrmwt a 16 PEP . M s 2 190001 0 3 2 PEP lo o m 06 4 PEP so n 0 c i - o c t - t C I-2 C I-1 C l-S CI-16 (a) Matrix Multiplication o 2 PEP 0 iPEp a SPEX a 16 PEP a 32 PEP a 64PEP C I-0 C l - t C I-2 C l- i C I-8 CI-16 (b) Conjugate Gradient m 2w a C I-0 C l- t C I-2 C I-4 C l-S CI-1S (c) 1-D FFT 0 2 P i * I i PEP OS PEP 0 1 6 PEP 0 3 2 PEP 0 6 4 PEP C I-0 C I-1 d - 2 C I-l C l-S CI-16 (d) LU-Decomposition a 2 PEP ■ 4 PEP OS PEP 0 1 6 PEP 0 3 2 PEP O S i PEP Figure 3.8: The N um ber of Network Packets: (a) M atrix M ultiplication, (b) Conju gate Gradient, (c) 1-D F F T and (d) LU-Decomposition 4 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. In figure 3.8(a), we show the num ber of network packets with th e sam e prob lem size as the m atrix m ultiplication benchm ark program and the sam e hardware configurations as before. The num ber of network packets is counted at the network interfaces of each host. It is the total num ber of packets issued to the network by all the hosts. W hen no cache is configured, the total number of network pack ets increases while the num ber of processors increases. Increasing the num ber of processors m eans th at we are trying to distribute the data and tasks am ong more processors in order to improve the system performance. This result m atches the con clusion shown in Agarwal’s analysis. However, by exploiting both th e tem poral and spatial global d ata locality, the num ber of network packets decreases dram atically. O ur sim ulation results also show th a t the num ber of network packets for the 64 PEs system decreases from 130,032 w ithout the ISSC runtim e system to 9072 with the ISSC runtim e system and a cache block size of 16. More than 90% of the network traffic is reduced by the ISSC runtim e system . Figure 3.8(b) shows the num ber of network packets in the conjugate gradient benchm ark. Because of the fine grain parallelism of this benchm ark, the ratio of the num ber of thread activation packets to the total number of network packets is larger than in the m atrix m ultiplication benchm ark. Therefore, the effect of network packet reduction is not as significant as in m atrix m ultiplication. However, 70% of the network traffic is still reduced by the ISSC runtim e system for the 64 PEs system with a cache block size of 16. Figure 3.8(c) and (d) show the num ber of network packets in the 1-D F F T and LU-Decomposition benchmarks respectively. One interesting observation is th at the num ber of network packets increases slightly in a 64 PEs system when the cache block size increases from 8 to 16 in both benchm arks. Increasing the cache block would only fetch into more d ata which will not be referenced. This will not harm the hit ratio th at the system could achieve. However, because of the deferred read 42 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. sharing, those un-referenced data are still sent back to th e cache after th ey are produced. This is w hat will increase th e network traffic. 3 .3 .2 .3 T h e s y ste m p erform ance 30.0 30.0 25.0 Q .20.0 a . 20.0 T3 IS 15.0 c Jd 10.0 ro io.o 5.0 20.0 40.0 60.0 Number of Processors 20.0 40.0 60.0 Number of Processors 80.0 80.0 (a) Matrix Multiplication (b) Conjugate Gradient * --* C B = 0 O DCBsl r c b = 2 ^•"0 C B = 4 CB=8 '~'CB=16 ______ (deal 6.0 30.0 25.0 5.0 O.20.0 ID 4.0 If? 15.0 q . 3.0 C O 0 0 10.0 2.0 5.0 20.0 40.0 60.0 Number of Processors 20.0 40.0 60.0 Number of Processors 80.0 80.0 (c) 1-D FFT (d) LU-Decomposition Figure 3.9: Speed up measurements: (a) M atrix M ultiplication, (b) Conjugate Gra dient, (c) 1-D F F T and (d) LU-Decomposition Figure 3.9 shows the speed up m easurem ents of our benchm arks. The speed up is m easured by th e execution time in different configurations related to the execution tim e in a single processor system w ithout ISSC enabled. From our sim ulations, we could observed th a t our ISSC im proved th e system perform ance by a factor of 75% up to 95%. T he utilization of d ata locality in the non-blocking m ultithreaded 43 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. execution shortens the m ean tim e between two th read activations, and hence reduces the system idle tim e. Therefore, the total execution tim e was reduced by our ISSC. In figure 3.9, we could see th a t our ISSC could achieve optim al perform ance at cache block size of 8 words. Even though increasing th e cache block size to 16 did yield a b etter cache hit ratio th an on cache block size of 8, as we could see in figure 3.7, the im provem ent in system performance, however, is not th a t much. Indeed, in the LU-Decomposition benchm ark, w ith a cache block size of 16 the system perform ance even degrades a little bit com pared to a cache block size of 8, as shown in figure 3.9(d). This is because th a t, as shown in figure 3.8(d), th e network traffic increases when th e cache block size increases from 8 to 16, and therefore the system incurs more overhead by handling those extra data requests which may not be referenced eventually. 3.3.3 The effect of cache advance (a) Matrix Multiplication54x64), 16 PEs, C*cha_Stzs«1B384 (b) Conjugal* Graderrt(256). 16 PEs. Cacha_Stze«16384 Communication Latency (COM) Communication Latency (COM) Figure 3.10: T he Effect of Cache Advance: (a) M atrix M ultiplication and (b) Con jugate G radient The cache advance feature in our design is a very unique feature in th e I-S tructure cache design. By allocating a cache block for a read miss before sending out th e request to th e rem ote host, th e following requests for th e data located in the sam e 44 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. block could be deferred in this pre-allocated cache block. To verify how this scheme affect, the cache perform ance, we varied the communication latency by setting the “CO M ” param eter to different values (1.0, 2.0, 4.0, 8.0, 10.0, 12.0, and 16.0) in our sim ulator with the cache advance respectively enabled and disabled. The same benchm arks were sim ulated with variable cache block sizes in th e system with 16 PEs and 16K words caches. The results are shown in figure 3.10. Again the hit ratios are plotted in a 3-D form at, so th a t we could easily see how the hit ratios change with different configurations. T he results w ith cache advance enabled and disabled are plotted in the same figure, so that we could easily compare the effect of cache advance. In figure 3.10 (a) and (b), the upper surfaces are the hit ratios w ith cache advance enabled and the lower surfaces are th e hit ratios without cache advance. We can see th a t th e cache h it ratios are not affected by the variation of com m unication latencies for a fixed cache block size w ith th e cache advance turned on. However, w ithout the cache advance, the cache hit ratio decreases while the com m unication latency becomes higher and higher. 3.3.4 Cache Replacement In a real situation, th e cache will not be sufficiently large to hold all the d ata referenced by a local host. Therefore, the cache replacem ent scheme plays a very im portant role in the cache design. In our design, a m ultiple-queue LRU algorithm is used. Cache lines are linked in different queues according to how many elem ents are in the em pty state in th e cache line. When a cache replacem ent occurs, the least recently used(LRU) cache line in the queue which keeps the cache lines with th e m ost em pty elem ents will be chosen as the victim. However, a cache line with a deferred read pending on any one of its elements will never be replaced to prevent deadlocks. Therefore, if all th e cache lines have at least one deferred request inside, 45 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. T 100 - 9 0 360000 - 90 300000 ■ -70 o 250000 -•40 ■5 150000 - 3 0 100000 90000 -■ 20 -■ 10 128 160 192 224 256 268 320 352 364 416 448 512 1024 2048 4096 60 i 3 J & O s Cache Size Figure 3.11: Cache Replacem ent and Hit Ratio in MM Benchm ark with Varying Cache size. = 3 Repl aoem ent ■ 4 — Hit Ratio T 100 120000 T -■90 ■ g 100000 - 8 0 -■70 - 60 o - ■ 50 60000 -■ o w 40000 - ■ - 3 0 J Q - 20 20000 -■ 10 - ^ -t-— -t-— -t-— h — -t-— h-— t-— -+ — - — i — i n t r u - n |------1 ---- 128 160 192 224 256 263 320 352 384 416 448 512 1024 2048 4096 8192 Cache Size 7 0 E o Figure 3.12: Cache Replacem ent and Hit Ratio in CG Benchm ark with Varying Cache size. 4 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. th e missed read will be directly forwarded to the rem ote host and no cache line is replaced. In this series of sim ulations, we have fixed the system size at 16 PEs , th e cache block size at 8 and set the com m unication latency param eter COM to 2.0. We varied the cache size from 128 words to 8192 words and recorded the num ber of cache blocks being replaced and the cache hit ratio for each configuration. The sam e benchmarks tested in the ideal case are sim ulated in this part with th e same problem sizes. Figures 3.11 and 3.12 show the sim ulation results. The bar charts are the numbers of cache blocks being replaced and the lines are the cache hit ratios in different cache sizes. T he hit ratios are very small when there are only small cache sizes available. W ith lim ited cache sizes, 70% and 80% of hit ratios achieved in m atrix m ultiplication and conjugate gradient respectively. In figure 3.11, the hit ratio jum ps from 27% to 72% when the cache size increases from 160 words to 192 words, and in figure 3.12, the hit ratio jum ps from 19% to 81% when th e cache size increases from 224 words to 256 words. Increasing the cache size a little, the d ata locality is fully exploited while there are still thousands of cache blocks being replaced. This shows th a t our ISSC still performs reasonably well with lim ited cache space by using m ultiple-queue LRU replacem ent scheme. The results in figure 3.7, 3.8, 3.11, and 3.12 show th at the ISSC not only helps the system by exploiting th e d ata locality for split-phase type rem ote mem ory accesses in different type of applications, but th a t it also reduces the num ber of network packets in the network. In figure 3.10, we show the effect of the cache advance scheme on the system in the aspect of rem ote request hit ratios. By applying the cache advance scheme, we provides an adaptive cache system which will not be affected by the varying of com m unication latency. This is really useful in th e M PP system s whose communication latency is usually long and unpredictable. How these advantages of ISSC would effect the overall system perform ance should be fu rth er exam ined and sim ulated. 47 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.4 Summary In this chapter, we proposed a split-phase transaction caching scheme for the I- Structure-like m em ory system s. We discussed several issues of I-S tructure cache design and described our design approaches. W e also described the details of our ISSC im plem entation. We validated our design by Generic M ultiT hreaded machine (G M T) sim ulator with several benchm arks. From the simulations, we have dem onstrated the impact of our ISSC ru n tim e system on the split-phased transaction m em ory accessing in the non-blocking m ultithreaded execution model. W ith a cache block size of 16, a hit ratio of 90% could be easily achieved in all benchm ark programs. T he num ber of network packages also decreases a lot com paring to the original qu an tity without ISSC. W ith all these effects, our ISSC increased th e system utilization and improves the overall system perform ance up to 95%. The cache advance scheme in our ISSC also provides th e adaptability to the unpredictable com m unication characteristics in DSM system s. This makes our ISSC achieve th e sam e perform ance w ithout being affected by th e variation of th e communication latency. A lthough som e of the sim ulation results are prelim inary and need to be conducted with a wider array of benchm arks, we are encouraged by the dram atical reduction in network traffic, by the evidence of global d ata locality exploited by our ISSC and by the im pact of our ISSC on the overall system perform ance. We continues our studies by expanding the benchm arks to a variety of appli cations. In th e m eantim e, th e overhead of this software cache has to be further evaluated. However, as the speed of the processors increases dram atically, the gap between com putation speed and the network overhead becomes larger and larger. The idea of this software cache becomes more prom ising. We looks for an appro- preate platform to im plem ent our ISSC and find the EARTH [37] as our target for the im plem entation. In the next chapter, we describe our im plem entation of ISSC 48 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. on the EARTH m achines and in chapter 5 we show the performance m easurem ent of our ISSC on EARTH-M ANNA machines. 4 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C h a p ter 4 ISSC implementation on EARTH systems 4.1 EARTH Architecture The EARTH, Efficient A rchitecture for .Running T hreads, project [37, 69] lead by Prof. G uang Gao originally from McGill University, C anada in th e Fall of 1993 and now continued at the University of Delaware is a fine-grain non-blocking mul tithreaded execution model for the efficient im plem entation of m ultithreading on off-the-shelf m icroprocessors with minimal additional hardware support for m ulti threading. 4.1.1 Pine Grain Multi-Threading Modern m ulti-threaded systems can be classified into two broad classes according to the granularity of the threads th at they can efficiently support while yielding good performance: coarse grain m ulti-threading and fine grain m ulti-threading. Typi cally in a coarse grain m ulti-threading system (1) th e thread switching mechanism involves interactions with the operating system; and (2) there is a lim ited number of light-weighted processes to which threads m ust be bound. In a coarse grain m ulti threading system , a thread can be viewed as a refinem ent of an operating system process. In contrast, in a fine grain m ulti-threading system : (1) the unit of compu tation is a collection of instructions grouped in a code block; (2) th e system does 50 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. not impose lim its on the num ber of threads th at can be active at the same tim e; (3) the system does not require binding to any sort of lim ited resources; 1 and (4) the thread switching m echanism is quite efficient and does not involve the operating system , it typically requires th at only a small amount of sta te inform ation be saved in each switching. In a fine grain m ulti-threading system a th read can be viewed as the coarsening of an instruction. T he fine grain m ulti-threading system studied here, EA RTH , is derived from the data-flow model of com putation. In th e classical strict data-flow model an instruc tion is enabled for execution when all its operands are available [30, 38, 29, 69]. To enforce this enabling condition, the instructions th at produce operands must be able to send a synchronization signal to all th e instructions th at will consume their results. This model proved unwieldy for the im plem entation of m achines based on current standard off-the-shelf hardw are and com piler technology. In EARTH, the unit of com putation is not an instruction, but a code-block form ed by m any instructions. An instantiation of the code-block running on a processing node is called a fiber, and m ultiple code-blocks are grouped into threaded functions. A successful program w ritten in Threaded-C [70], the program m ing language for EA RTH , will produce enough fibers to m aintain the local processor busy while rem ote com putations and d ata fetching operations are performed. Figure 4.1 shows the EARTH m odel and it assumes th a t each processing node has an Execution U nit (EU) that executes the fibers and a Synchronization Unit (SU) that is responsible for: (1) the em ulation of a global address space; (2) the communications through the network; (3) the inter-fiber synchronization; and (4) the im plem entation of a load balancing mechanism. W hen the m odel is implemented 1The only lim itations on the number of active threads in a fine grain multi-threading system are caused by the memory space available to store active thread descriptors. If the data structure that holds these descriptors is stored in virtual memory, a very large number of active threads can indeed be supported. 51 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. SU SU EU EU Local Memory Local Memory Network Figure 4.1: T he EARTH Model on processors w ith a single processor per processing unit, the functions of the SU are em ulated in software by a R unT im e System (RTS). 4.1.2 Split Phase Communication and Synchronization A cornerstone of the EARTH model is th e m echanism th at enables the superposition of local com putation and rem ote operations: the split-phase transaction. W henever an operation involves a long an d /o r unpredictable latency, the statem ent th at re quests th at th e operation be perform ed is issued in one fiber and the statem ent that depends on the result of the operation is issued on a different fiber. A dorm ant fiber receives synchronization signals from other fibers — executing either in the same processor or on a rem ote processor — through a synchronization slot. A typical split phase operation, an EARTH block-move-sync operation, is illus trate d in Figure 4.2. In Figure 4.2(a): (1) a fiber running on the execution unit of processor Pi issues a request th at a block of d ata be copied from the m em ory of a processor Pj to its local memory. T he requesting fiber m ay continue perform ing op erations th at do not depend on the arrival of the requested block, but will eventually term inate and allow the EU of processor Pi to run other enabled threads. The block move request m u st specify the source and destination addresses for th e movement 52 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. INTER-NODE NETWORK INTER-NODE NETWORK MEMORY M is Figure 4.2: (a) (1) An active fiber in the EU of P,- requests an EARTH split-phase block-move-sync operation; (2) The SU of P,- decodes the source address to the m em ory of Pj and sends a request for the block; (3) The SU of Pj receives the request and reads the block from the local memory, (b) (4) The SU of Pj sends the block over the network to the SU of Pt - ; (5) T he SU of P ,- writes the block in the local m em ory; (6) The SU of Pt - decrements a synchronization slot counter, th at becomes zero and causes the spawning of a fiber th at will use the block transferred. as well as the address of a synchronization slot th a t will receive a synchronization signal when the d ata transfer is complete. (2) The SU of P,-, having received the request for the block move, sends a block request to the SU of Pj through the net work. (3) the SU of Pj reads the requested block from the local memory of P j. In Figure 4.2(b): (4) the SU of Pj sends, through the network, the requested block to the SU of Pi. (5) T he SU of P, writes th e block into th e destination address. Finally (6) the synchronization slot indicated in the block move request receives a synchronization signal and causes the fiber th a t will use the transferred d ata to be spawned and executed in the EU of processor P,-. In this exam ple we assume th at the destination of the block move and the synchronization slot th a t received the signal upon the completion of the d ata transfer were in the same processor th at requested the d ata movement. However the EARTH m odel is general enough to allow each one of these addresses to be in a different processor. Observe in the exam ple presented in Figure 4.2 th at the EU of processor P j is never involved in the d ata transfer requested by processor P,-. Thus if two processing units are actually available in the machine to support the EU-SU model of EARTH, 5 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. the only im pact of the d a ta m ovem ent on the execution of fibers in Pj would be possible conflicts on accesses to the m em ory between th e SU and the EU. Moreover, during th e steps (2) to (6) in Figure 4.2, the EU of Pi is not involved and is free to execute other enabled thread. T he capacity to overlap the rem ote d ata transfer with the execution of other fibers in th e EU is a distinguishing characteristic of a fine grain m ulti-threading system . 4.2 Single Assignment Storage Structures In this chapter we study the use of software cache for I-structures, a single-assignment data structure, in the EARTH model. T he nam e I-structure was originally used by Arvind an d Thom as in the context of functional languages to designate an array built w ith a fine-grained update operator w ith no repeated indexes [11]. Later, I- structures were proposed as separate d a ta structures for functional program s. In [10], Arvind, Nikhil, and Pingali dem onstrate, through several program m ing examples, th at the introduction of I-structures in functional languages eliminates inefficien cies and increases the program m ability of functional languages. The proposition to incorporate I-structures in functional languages was derived from the observation th at w ithout the ability to store a state, it is very difficult to solve even simple problems in a m anner th at is efficient, easy to code, and enables the exploitation of parallelism [10]. O ur m otivation to introduce a single assignm ent stru ctu re in Threaded-C stems from the observation that the use of such structures significantly reduces the num ber of synchronization operations required in some program s. The single-assignment characteristic of I-structures elim inates the need for consistency related network operations when these structures are enhanced w ith tem porary storage buffers. The former m akes it easier to code problem solutions in Threaded-C, and the latter makes it easier to im plem ent software caches for I-structures. 54 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. delete r e a d Empty re a d Heap allocate reset w rite Deferred delete Full O ' read delete reset w rite FATAL ERROR delete reset Figure 4.3: State Transition D iagram for the I-Structure Im plem entation O riginally an I-structure was defined as an array of elements 2, where each el em ent of the array can be in one of three states: empty, full, and deferred. Each elem ent of the array can only be w ritten once, thus the name single-assignment, but it can be read many tim es. W hen the I-structure is created, all th e elem ents of the array are empty. If a read occurs before the write, the element goes into th e deferred state and the read operation is kept in a queue associated w ith th a t elem ent. Sub sequent reads are also queued. W hen a write to an em pty element occurs, the value is w ritten and the elem ent becomes full. If the elem ent was in the deferred state, all th e reads th a t were queued for th a t elem ent are serviced before the w riting operation is com plete, and the element goes into the full state. A read to a full elem ent returns im m ediately with the value previously w ritten. A w rite to a full elem ent is consid ered a fatal error and causes th e program to term inate. Figure 4.3 shows the state diagram of an I-structure. N otice th at this state diagram includes the operations 2However, nothing prevents the im plem entation of a single element I-structure, or other data structure organizations. 55 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. delete and reset th a t were not in the original definition of an I-structure. These op erations were included in our im plem entation because, different from the functional language environm ent in which the I-structure was originally defined, Threaded-C is an im perative language th a t does not offer garbage collection. Therefore the pro gram m er m ust delete d ata structures after they are no longer needed. The reset operation allows reuse of I-structures avoiding frequent deletion/allocation in some applications Observe th a t for its proper functioning, the state transitions on the I-structure m ust be atomic. For instance if a write is perform ed in a deferred element, all reads in the queue of the deferred element m ust be served the value w ritten before another operation to the sam e elem ent can be perform ed. In the current implem entations of Threaded-C this atom icity is derived from the fact th at fibers are non-preem ptive and th at with a single processor in each processing node, only a single thread can run on a node at a tim e. The two key functions to implement I-structures in threaded-C are the I-READ and I-W RITE operations. T H R E A D E D I_R E A D _x(in t iid, int in d ex , void *G L O B A L p lace, S P T R slot_adr) Reads the elem ent in d ex of the I-structure iid . T h e value read is w ritten in p la c e by a split phase transaction th at when com pleted synchronizes the slot slo t_ a d r. If th e elem ent index is em pty, I_READ stores p la c e and s lo t_ a d r in the reading queue corresponding to th a t element. W hen the write operation to that elem ent is performed, the value w ritten is copied in p la c e and the slot s lo t_ a d r is synchronized. T H R E A D E D I_W R IT E _x(in t iid , int in d ex , T valu e) Writes v a lu e to th e element in d ex of the I-structure i i d . If the element in d ex 56 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. is full, I-WRITE prints a fatal error message in th e standard error output and term inates the program . 4.3 ISSC Implementation on EARTH 4.3.1 ISSC implementation using Threaded-C language Split-phased transactions for rem ote d ata m em ory accesses provide the ability to tol erate com m unication latency in a m ulti-threaded system . The d ata obtained through a split phase transaction is m anaged by the program m er, and is not autom atically cached by the system. Therefore if repeated requests for the same d ata are issued, they will be sent through the network to the source of the data requested. We presented our design and im plem entation of our I-Structure Software Cache (ISSC) in C hapter 3 [51, 52, 53] to cache I-Structure elements on m ulti-threaded system s th a t support split-phased transactions. The ISSC takes advantage of the spatial and tem poral localities of m em ory operations in I-Structure m em ory systems. The single assignment property of the I-Structure memory system enables the im plem entation of the ISSC as a software cache w ithout any hardw are support. The ISSC intercepts all the read operations to the 1-Structure. A rem ote memory request is sent out to the rem ote host only if the requested data is not available on the software cache of the local host. We explore the spatial locality in the references to the I-structure through a blocking mechanism. Instead of requesting a single elem ent of the structure, an entire block of d ata including the requested element is requested to the node th a t hosts the I-structure. The state transition diagram for an elem ent of the ISSC is shown in Figure 4.4. There is no space allocated in the ISSC for invalid elements. An invalid element m ight be allocated in ISSC and change state because a read of the elem ent is per form ed by the local node, or because a read to another element in the sam e block is perform ed. In either case a request is issued to the host node. If the element 57 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. rep lacem en t o r in v alid atio n write read replacement or invalidation Full Deferred Request write allocation write read write read Deferred Read X T ' read FATAL ERROR Figure 4.4: S tate Transition Diagram for the I-S tructure Software Cache itself is read, it goes into the deferred read state, otherw ise it goes into the deferred request state. If a read to a deferred request element is issued, there is no need for a new request to be issued and the elem ent goes into th e deferred read state. Further read operations to a deferred read element are queued in the element and do not cause further state transitions. If a w rite to a deferred read or deferred request is performed in the host node and the value w ritten is sent to the local node, the elem ent goes into the fu ll state. Read operations for elem ents in the full state are serviced im m ediately and do not cause any state transition. A w rite to a full element is a fatal error. Both a full elem ent and a deferred request elem ent can be evicted from the ISSC either by a replacem ent operation or by an invalidation operation. A deferred read elem ent is irreplaceable. An invalidation or a replacem ent of such an elem ent is a fatal error. A write to an invalid element is ignored and the elem ent is not placed in the cache. The ISSC is im plem ented in the Threaded-C [6, 70] language for EARTH [37] system s. Our im plem entation of ISSC builds on the I-S tructure user library [7, 5]. In 58 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. this section we describe the key d ata structures, functions and policies im plem ented in the ISSC library. D A T A S T R U C T U R E Cache. This is the m ain d ata structure for I-Structure software caches. T he layout of our software cache is the one of a set-associative cache. Set-associative software caches have faster cache entry searching tim e than fully associative caches and b etter cache utilization th an direct m apped caches. The caching address consists of the node num ber of the host node, the I-Structure I.D. and the index of the elem ent for which a read is requested. Upon receiving a read request, the caching address is m apped to a set by a hash function, and a software search is perform ed to see if th ere is a m atch for the address in the set. In our sim ulation studies [51, 52], we determ ined th at a cache block of 8 data elem ents would yield reasonable cache h it ratio. Therefore, in our experim ents discussed in chapter 5, we use a cache block size of 8 and im plem ented the software cache with 256 sets and 8 cache blocks within each set. T h at would be 16K elements in th e cache. The com plete definition of the d ata structures used in ISSC im plem entation is shown in Appendix A. T H R E A D E D In itC a c h e (S P T R done) InitCache allocates m em ory space for software cache in local node and initial izes it. The initialization should be done before any cache accesses. After the initialization, a synchronization signal is sent to the address done. T H R E A D E D S C _I_R E A D (in t n od e, in t i_id, int in d ex , in t ty p e , v o id *G L O B A L p la ce, S P T R slot_adr) This is the read function for I-Structure elem ents through th e utilization of the software caches. Instead of invoking the original I_READ_X at the rem ote node in which the I-Structure is allocated to request an I-Structure d ata element, the SC_I_READ is invoked in the local node. No requests are sent to the owner 59 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. node of the I-Structure if the d ata already exists in the local software cache or if the elem ent has already been requested. The node th a t hosts th e I-Structure is node_id, i_ id is the I-Structure requested, in d ex is the elem ent of the I- Structure, ty p e is th e d ata type of the element, and p la c e and s lo t_ a d r are the address where the requested d a ta will be sent and synchronized when the d ata is back. 4.3.2 Usage of ISSC in Threaded-C language A simple exam ple to show how the ISSC library is used in a Threaded-C program is shown in Figure 4.5. In this exam ple, an I-Structure floating point array of length 8 is allocated on th e last node of the system . The d ata of these I-Structure elements are then generated by a node, and node 0 reads back the value of those 8 data elements. In line 4, we define the I-Structure host, I-NODE, as the last node of the system, NUM -NODES-l. In Thread-0 (lines 20-25), we initialize th e I-Structure in I_NODE and software caches on each node. In Thread_l (lines 27-31), a floating point I- Structure array of 8 elem ents is allocated in I_NODE. T he handle for the allocated I-Structure is stored in F_str. In Thread_2 (lines 33-45), the data of I-Structure array F s tr are generated by A R R A Y H N IT function (line 6-12) which is invoked by TO K E N function in line 34. Then, d ata of this I-Structure array are read back in line 35-44. In line 38, we use a com piler flag to activate/deactivate the ISSC. If the CACHE flag has been defined in the compilation, the function S C -I-R E A D is invoked locally (at node N O D E JD ), otherwise, the function I-R E A D JF is invoked in the I-Structure host (node I_NODE). After all the 8 d ata elements are read back from node I_NODE, THREAD_3 is activated, prints out th e data, and term inates the program. In this program , if ISSC is not used, 8 d ata requests for th e I-Structure array F_str are sent to the rem ote node, I-NODE, by invoking 8 I_READ_F functions 60 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. r 1: 0inciude <stdioJt> 2: ffdefine EXTERN 3 : tinclud* "isscJt" 4 : 0define IJ iO D E NU M JfODES-l 5: 6: THREADED ARRAY_INIT(int i_node, int ijd , int length) 7 : { 8 : int i; 9: for(i=0; i < length; i++) 10: lNVOKE(i_noder I_WRITE_F. ijd , (float)i); 11: END^FUNCTIONO: 12: / 13: 14: THREADED BdAJNQ 15: / 16: SLOT S YNC_jSLOTS(3J; 17: int F jtr, i; 18: float Fj/ariable[8]; 19: 20 : [NlT_SYNC(OJfUMJlODES+ IJfUMJXOD ES+1,1): 21: INVOKEffJtfODE. 1_INJT. SLOT^ADRfO)); 22: /* Allocate cache space on each node */ 23: fbr1i=0; i<NUM J10DES; i++) 24: INVOKED iniiC arfu, SLOT_ADR(0)); 25: ENDJTHREADO: 26: 27: THREADJ: 28: TNITJYNC(I, I. / , 2); 29: / • Allocate f-Structure */ 30: INVOKE(lJNODE, L*ALLOCATE, 8, TO_GLOBAl(AF_str). SLOTjiDR(l)); 31: ENDJTHREADO; 32: 33: THREADJ2: 34: TOKEN(ARRAY_lNn„ l_NODE, F_str, 8); 35: INITJSYNC(Z 5 , 8, 3); 36: /* Read from F_str[0:7}*/ 37: fo r ( i~0; i<8; i++) 3 8 : 0ifdef CACHE 3 9: lNVOKE(NODEJDt SC JJR EAD , IJIO D E , P j tr , i, P, 4 0 : TOjGLOBAU&PjtarUbUpD, SLOT_ADR(2)); 4 4 : Oendif 4 5 : ENDJTHREADO: 46: 4 7 : THREADJ: 48: forii=0; i<8; i++) printfl’’* * > / ”, F_variable[iJ); 4 9: REWRNO; Figure 4.5: T hreaded-C with ISSC program exam ple 4 1 : false 42: 43: IN YOKE (IJIO D E , l_READ_Ft P jtr , i, TOjCLOBAUAPjrariabUpJ), SLOT_ADR(2)); J R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. on I_NODE. However, if ISSC is used, even though 8 SC_I_READ functions are invoked locally, only one d ata request is sent to th e rem ote node I-NODE. A more com plete exam ple of using ISSC in a real application w ritten by Threaded- C language is shown in Appendix B. 6 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C h a p te r 5 Experiment results on EARTH systems To study the effectiveness of our im plem entations of both I-structures and ISSC, we coded four benchm arks (see Section 5.3) in both versions of the system (Threaded- C w ith I-Structure and Threaded-C with I-Structures and ISSC). O ur experim ental results were obtained in the MANNA m achine. We also measured the latency for basic EARTH operations and for I-structure based operations. 5.1 Highlights of Experimental Results Our m ain results can be sum m arized as follows: 1. T h e a d d itio n o f ISSC to th e E A R T H sy ste m results in in crea sed ro b u stn ess to la te n c y variation . The speedup obtained w ith ISSC in creases for machines w ith higher costs for rem ote operations (see Figures 5.1 and 5.2 for details). 2. T h e ISSC sig n ifica n tly red u ces th e am ou n t o f traffic in th e netw ork. As shown in Table 5.3 in all applications the num ber or rem ote requests for I- structure elem ents was reduced from one up to four orders of m agnitude. 3. T h e so le a d d itio n o f I-S tru ctu res (w ith o u t ISSC ) d ecreases th e p erfo rm a n ce o f th e E A R T H sy ste m . Even for machines w ith higher laten cies, th e overhead of the software em ulation of I-structures hurts performance 63 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. (as shown in the graphs of Figure 5.3) unless it is offset by th e benefits of the ISSC (see Section 5.3 for details). 4. ISSC o p era tio n s can b e im p lem en ted v ery efficien tly in th e M A N N A m ach in e. In 5.2 we dem onstrate th a t the MANNA m achine netw ork interface is very efficient. O ur experim ents dem onstrated th at our im plem entation of the ISSC on top of the EARTH operations is also efficient (see Table 5.1). 5. T h e p erform an ce o f th e sy stem w ith ISSC im proves for all benchm arks fo r m ach in es w ith m o d era tely high la te n c y for rem o te op eration s. As shown in Figure 5.2 for all four benchm arks if 10 fxs (500 cycles) are added to the latency of M ANNA (which is 3.5 fxs = 175 cycles), the benchmarks running on the software w ith the ISSC produces greater spreedup over the system w ith I-structures only. 5.2 The Cost of ISSC Operations O ur studies are based on an im plem entation of EARTH on the M ANNA machine. M ANNA is a 20 node, 40 processor m achine. Each node has two Intel i860 XP processor running at 50 MHz with 32 MB m em ory and is interconnected w ith other nodes through a crossbar switch network. T he MANNA m achine is a research plat form of which only a few were constructed. W ith the full control of network in terface in MANNA m achine, the im plem entations of inter-node com m unication and synchronizations are very efficient as dem onstrated by the m easurem ents presented in this section. We m easure the latency of some EARTH and ISSC operations for the EARTH-M ANNA-SPN machine. EARTH-M ANNA-SPN is an im plem entation of the EARTH model on the MANNA m achine in which only one processor is used in each node [69]. 64 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. O peration Local R em ote GetJSync 141 348 Fun. Call 250 451 I_READ_F 317 492 ISSC hit 479 — ISSC miss 2693 — ISSC deferred 1354 — Table 5.1: Latency of EARTH and ISSC operations on EARTH-M ANNA-SPN, mea sured in num ber of cycles (1 cycle = 20 ns). The M ANNA machine is a research platform of which only a few were con structed. W ith th e full control of netw ork interface in MANNA m achine, the imple m entations of inter-node com m unication and synchronizations are very efficient as dem onstrated by the m easurem ents presented in this section. However, this network efficiency is usually not available in affordable and widely available netw orks of work stations. T he sending and receiving of network packages may take from hundreds to thousands of cycles depending on the design of the network interface [21]. In some m achines, a parallel environm ent is built on top of the TC P protocol and the com m unication interface overhead m ay be as high as hundreds of m icro-seconds [44]. Even with som e improved protocols, like Fast Sockets [66] and Active Messages [72], it still costs 40~60 micro-seconds to send a message to the network. The latency of the operations required to com m unicate and synchronize across processing nodes is a determ inant factor in the perform ance of some applications. Observe th a t th e processor is not busy with the operation for the num ber of clock cycles shown in Table 5.1. Most of th e rem ote operation tim e is spent either waiting on queues or in the network, thus releasing the processor to execute other ready fibers. Table 5.1 lists the latency of some EARTH and ISSC operations in th e MANNA platform used in the analytical m odel. In a local m easurem ent all operations are within a processor, while in a rem ote m easurem ent, all operations are issued to other 65 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. nodes through netw ork. T he EARTH operations m easured in Table 5.1 include a get_sync operation in which thread 1 requests a word of d ata from thread 2 and thread 2 synchronizes th read 1 when th e d ata arrives; and function calls which represent the invocation of a threaded function either in the same node or in a rem ote node. At the bottom of Table 5.1 are the m easured latency of ISSC operations and of the basic I-Structure read function, I_READ_F. The m easurem ent starts from thread 1 invoking the I_READ_F function in I-Structure node either in the same node or in a rem ote node until th e I_READ_F function finished and synchronizing thread 1 when the d ata arrives. ISSC hit measures the invoking of an I_READ_F for a rem ote data, finding the requesting d a ta in local software cache and synchronizing the requesting thread with the d ata found in software cache. ISSC miss is the case that the entire surrounding data block is not found in the software cache and a new request for the whole block is issued to a rem ote node, and finally the requested d ata along w ith the whole d ata block are sent back from rem ote node and the synchronization is done. Notes th at, this measurem ent is m ade by issuing m ultiple requests in a pipeline fashion. Therefore the tim e spent on the rem ote node is overlapped with other issues of requests and only the tim e spent in local node is measured. ISSC deferred is the case th a t the surrounding d a ta block already allocated in the local software cache however the requested d ata elem ent is not there yet. The original request is therefore deferred in the software cache until th e requested d ata is available along w ith entire d ata block or sent back individually from rem ote I-Structure node. The same m easurem ent as ISSC miss is done to ensure th a t no idle tim e and rem ote operation tim e is m easured. T he difference betw een local and rem ote cases of I_READ_F denotes four tim es of th e com m unication interface overhead. It includes one for the requester sending the request, one for th e I-Structure node receiving the request, one for I-Structure node sending the d a ta back and finally one for the requester receiving the data. The 66 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. one-way com m unication interface overhead takes only 175/4 processor cycles (0.825 [is). This m easurem ent indicates th a t the inter-node com m unication in MANNA machines is very efficient when com pared with network of w orkstations. 5.3 Description of Benchmarks To m easure th e im provem ent in the system performance when both I-structures and ISSC are used, we selected four different benchmarks: dense m atrix m ultiplication, Conjugate G radient, Hopfield netw ork, and sparse m atrix m ultiplication. To com pare the perform ance of the software cache with the original system , we implemented three versions of codes for each benchm ark: A plain Threaded-C code, a Threaded- C code using th e I-Structure library, Threaded-C+IS and a Threaded-C code using both the I-structure library and the I-Structure Software Cache (ISSC ), Threaded- C+ISSC. All our experim ental results were performed in the M ANNA machine. The two processors of a processing node on MANNA share 32 M bytes of DRAM. The nodes of M ANNA are diskless, therefore all the code, runtim e system , data, and the software em ulations of the I-structure and the ISSC m ust fit in 16 M bytes per node. Therefore we were only able to test m oderate d ata set size for the benchmarks. In a related research work, Theobald developed a detailed cycle-by-cycle simulation of the MANNA architecture and dem onstrated th at applications scale well for larger versions of the platform [69]. D en se M a tr ix M u ltip lic a tio n . Two 128x128 dense m atrices are multiplied. The algorithm th a t we use in this study is a simple m inded, non-blocking algorithm th at com putes C = A x B . T he com putation of rows in the resulting m atrix C is evenly distributed am ong all nodes. Node 0 invokes threads on each node to com pute the rows th a t it is responsible for. The results of C elements are w ritten directly to where they reside. 67 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C o n ju g a te G ra d ien t. T he Conjugate G radient algorithm from th e NAS bench m ark suite [13, 12] uses the inverse power m ethod to find an estim ate of the largest eigenvalue of a symm etric positive definite sparse m atrix with a ran dom p attern of non-zeros. In our experim ent, the problem size is 256 linear equations w ith 256 unknown variables. Calculations of m atrix-vector m ul tiplications are done in parallel across all th e nodes and the calculations of vector-vector m ultiplication are done on node 0. In this algorithm , most of the com putation consists in updating array elements. Therefore, the benchmark does not have m uch tem poral data locality. H o p field N etw o rk . Hopfield is a kernel benchm ark [7] based on the Hopfield Net work. It is a recursive neural network th a t is often used in com binatorial op tim ization problem s as well as an associative memory. T h e network is formed by a set of neurons th a t are connected by synapses. At tim e k -f 1, the ac tivation value of each neuron is up d ated based on the activation values of neurons at tim e k weighted by synapse values. In the I-S tructure and ISSC im plem entation, two I-Structure arrays are used to store th e current and pre vious activation values of neurons. Before updating to th e current value, the I-S tructure array is reset and reassigned w ith a new I.D ., therefore, the same m em ory space can be re-utilized and no cache flush would be needed. The problem size we tested is 256 neurons. S parse M a tr ix M u ltip lica tio n . Sparse m atrix m ultiplication is an application w ith irregular d a ta access pattern. Two unstructured sparse 256x256 m atrices A and B are random ly generated w ith density of 10%. M atrix A is then stored in Compress Row Storage (CRS) form at and m atrix B is stored in Compress Colum n Storage (CCS) format. A dense resulting m atrix C is generated by m ultiplying A and B . 68 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. N um ber of Nodes Benchmarks Dense M.M. C.G. Hopfield Sparse M.M 2 99.71 93.70 99.90 99.92 4 99.52 93.69 99.80 99.87 8 99.13 93.52 99.61 99.76 16 98.35 92.92 99.22 99.53 Table 5.2: I-Structure Software Cache Hit R atios (%) Number of Nodes Benchmarks Dense M.M. C.G. Hopfield Sparse M.M no ISSC w /ISSC no ISSC w /ISSC no ISSC w /ISSC no ISSC w /ISSC 2 528384 1536 33536 2112 32768 32 986668 761 4 396288 1920 25152 1587 24576 48 731842 971 8 231168 2016 14672 950 14336 56 426002 1038 16 123840 2040 7860 557 7680 60 227979 1078 Table 5.3: Average num ber of rem ote mem ory requests per node Table 5.2 shows the cache hit ratios of th e four benchm arks in our experim ents and Table 5.3 shows th e average num ber of rem ote m em ory requests in each bench m ark both w ithout and with ISSC. ISSC did help the system to exploit global d ata locality. For three of the four benchm arks (except conjugate gradient), m ore than 99% of cache hit ratios could be achieved, and even in conjugate gradient algorithm which has poor tem poral data locality, 93% of cache hit ratio could be achieved. Table 5.3 shows th a t ISSC reduces the num ber of rem ote m em ory requests actually sent to rem ote nodes. In all the cases, a t least 93% of the original rem ote m em ory requests are elim inated out by the 1-Structure Software Cache. 5.4 Robustness to Latency Variation We m easured th e speedup between th e I-Structure Software Cache version of the benchm arks and a version of the same benchm arks w ritten in plain Threaded-C and running on a single processing node. We perform ed two sets of experim ents. The first set, shown in Figure 5.1 measures the perform ance on the M ANNA machine. As a 6 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. to o I o 0.0 0. 0 2 . 0 0.0 Number of Ncxt»« (a.) M a t r i x M u l t i p l i c a t i o n N unM rof Nod*» ( b ) C o n j u g a t e G r a d i e n t i 10.0 o o .o TKI 2 . 0 o .o 2 0 . 0 f 1 0 . 0 S .O o. ,o (c ) H o p f i e l d N um M rcf NOd»« (d ) S p a r s e M a t r i x M u l t i p l i c a t i o n Figure 5.1: Speedup in th e MANNA m achine. result of th e efficient im plem entation of the network and its interface on the MANNA machine, th e plain Threaded-C version have the best perform ance for all benchmarks. W hen the cost to execute split-phase operations is very low, th e overhead incurred in the I-Structure and software cache operations in Threaded-C+ IS and Threaded- C + ISSC can degrade the performance. In the Conjugate G radient and Hopfield benchm arks, Threaded-C+IS version has better performance th an the Threaded- C-f-ISSC version. This is because of th e poor temporal d ata locality in the algorithms which results in th e high ratio of deferred cache hits in th e cache hit situation. The overhead of deferred hits is much larger th an the I-Structure access and th e ISSC hit as reported in th e beginning of this section. In our second set of experiments, shown in Figure 5.2, we add 10 /is to th e cost of both I-structure and ISSC operations. This is equivalent to a m achine w ith a 70 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A t s & U S p a & p 1 0 .0 1 2 .0 4.0 lo OTTif «a»a-c ♦ . i«3. 0 . 0 N u m M r o f N o d M ( a ) M a t r i x M u l t i p l i c a t i o n s.o .o 3 . 5 2 . 0 1.0 0 .0 (fc>) C o n j u g a t e G r a d i e n t 0 . 0 2.0 0 .0 O f N o < M « 20.0 1 0 . 0 SO 0 .0 ( c ) H o p f l o l d ( d ) S p a r s e M a t r i x M u l t i p l i c a t i o n Figure 5.2: Absolute speedup with 10 fj.s com m unication interface overhead 71 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. higher com m unication/com putation cost ratio, i.e., a m achine in which requesting a rem ote of four applications on the MANNA m achine with 10 /us add-on synthetic com m unication interface overhead. In this set of experim ents, Threaded-C+ISSC version out-perform s other two versions in all applications. Even though we added 10/us syn thetic com m unication interface overhead into this set of experim ents, it is still far less th an the com m unication interface overhead of fast local area network [66], which cost 4060 micro-seconds. However, the Plain Threaded-C version still has b e tte r per form ance than Threaded-C-f-IS because of th e I-Structure access overhead. In these experim ent, we show th a t even though ISSC pays extra overhead for its operations, by taking advantage of th e global d ata locality in applications and with th e am ount of com m unication interface overhead saved by ISSC, the I-S tructure Software Cache im proves system perform ance in the Network of W orkstation platform s. From the previous experim ents, we know th at the com m unication interface over head is a determ inant factor in the perform ance of I-Structure Software Caches. To have a better understanding of the relationship between the ISSC perform ance and the com m unication interface overhead, we ran our experim ents on a 16 node system w ith a variable synthetic communication overhead for our selected benchm arks. Fig ure 5.3 shows the execution time of applications under different overhead. In each application, we m arked the point of com m unication interface overhead where the Threaded-C+ISSC version starts to out-perform the plain Threaded-C im plem enta tion. ISSC starts to help the system when the communication interface overhead is greater than 6.3/us, 9.2pts, 6[is and 3.2/zs respectively in dense m atrix m ultiplication, conjugate gradient, Hopfield and sparse m atrix m ultiplication. W hen the com m uni cation interface overhead exceeds 100/us, th e Threaded-C+ISSC versions run alm ost 10 tim es faster th an th e plain Threaded-C versions in our benchm arks. 7 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Communloktlon intMlao* OvartvMd Qi«) (a) Matrix Multiplicatior Communloanon intadao* Ov«rn««<] (fia) (b) C onjugate G radient ^ T w f w a x i - e «. n o* Commuruoalion <c) Hopfield (d) S p a rse Matrix h/lci Implication Figure 5.3: Execution tim e with synthetically variable com m unication interface over head 7 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5.5 Summary In this Chapter, we com pare the perform ance of two extensions to the original pro gram m ing environment in the EARTH system : (i) the program m ing environment is extended with an im plem entation of I-structures, a single assignm ent data struc ture th a t facilitates the im plem entation of synchronizations across m ultiple process ing nodes; (ii) the program m ing environm ent is extended w ith I-structures and an I-Structure Software Cache (ISSC) th a t enables the exploitation of temporal and spatial locality in the I-structures. T he m otivation to introduce both I-structures and ISSC to the original EARTH program m ing environment stem s from the single assignment nature of I-structures. W hen single-assignment storage structures are used, the need for consistency related transactions in the network is eliminated. O ur studies are based on an im plem entation of EARTH on the MANNA ma chine. MANNA is a 20 node, 40 processor machine. Each processing node has two processors. The nodes are interconnected through a crossbar switch network. In the EARTH-MANNA version th at we use, the functions of the synchronization unit are em ulated in the same processor th at executes the fibers. Because neither MANNA nor the original program m ing environm ent for EARTH provide direct support for single assignment storage such as I-structures, we emulate both the I-structure oper ations and the ISSC in software. O ur study focuses on the robustness of the resulting program m ing model to latency tolerance. Therefore we m easure the latency for ba sic EARTH operations, I-structure operations, and ISSC operations. Then we vary these latencies by introducing delays in the operations to identify the lower bound of latency (measured in processor cycles) beyond which the introduction of I-structure and ISSC in the system would no longer have a positive effect on performance. Our results indicate th at the extension of the Threaded-C program m ing environment w ith ISSC is robust to variations on latency. This robustness is reflected in better speedup curves for m achines with highes costs for remote operations. 7 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C h a p te r 6 Performance Modeling O ur ISSC is a pure software approach to exploit the global data locality in non- blocking m ultithreaded execution w ithout adding any hardw are complexity. It pro vides the ability to reduce the com m unication latency while m aintaining th e ability to tolerate the communication latency in m ultithreaded execution. Some reason able research questions are: Do software caches really work? Will the overheads o f software cache operations compromise its performance? What are the conditions fo r IS S C to improve system performance? W hat kind o f applications could benefit from IS S C ? It is the single assignment property of the I-Structure m em ory system th at makes the use of a software cache profitable. Because the cache of a single assignm ent m em ory is inherently coherent, no cache coherence problem is encoun tered in an I-Structure cache design. T he absence of coherence related operations significantly reduces the overhead of th e software cache system . Indeed, w ith th e ca pability of com m unication latency tolerance in m ultithreaded execution, the m ajor benefit of ISSC comes from the saving from com m unication interface overhead. In this C hapter, we present an analytical model for th e performance of a m ul tithreading system with and w ithout ISSC support. This model consists of two sets of factors, platform -related and benchm ark-related param eters, which affect the 7 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. perform ance of ISSC. From this m odel, we could analyze th e lower bound of com m unication interface overhead from which ISSC starts to yield perform ance gain in different benchm arks and platform s. 6.1 Performance Analysis Before we present the analytical m odel, we analyzes the perform ance m easurem ent that we shown in Chapter 5. In tab le 5.2, we shown the cache hit ratios of the four benchm arks in our experiments and Table 5.3 shows the average num ber of rem ote m em ory requests in each benchm ark both with and w ithout ISSC. ISSC did help the system to exploit global d a ta locality. For three of the four benchm arks (except conjugate gradient), more than 99% of cache hit ratios could be achieved, and even in conjugate gradient algorithm which has poor tem poral d a ta locality, 93% of cache hit ratio could be achieved. Table 5.3 shows that ISSC reduces the num ber of rem ote m em ory requests actually sent to rem ote nodes. In all th e cases, at least 93% of the original rem ote memory requests are elim inated out by the I-S tructure Software Cache. W ith th e capability of com m unication latency tolerance in m ultithreaded exe cution, th e m ajor benefit of ISSC comes from the saving from the com m unication interface overhead. In the m easurem ents presented in Section 5.2 we show that the network im plem entation of M ANNA is very efficient. However, this efficiency is usually not available in affordable and widely available netw orks of workstations. To have a b e tter understanding of the relationship between the ISSC performance and the cost of rem ote operations, we ran our experim ents on a 16 node EARTH- MANNA system with adding a variable synthetic com m unication interface overhead on top of existing overhead. Figure 6.1 shows the execution tim e of applications as th e com m unication over head increases. T he marked points are the values m easured from actual runs on 76 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 5 c ui 3. I e 1 Ui Figure 6.1: Execution tim e w ith add-on synthetically variable com m unication inter face overhead. (a)Dense M atrix M ultiplication (b)C onjugate Gradient (c)Hopfield (d) Sparse Matrix M ultiplication EARTH-MANNA machine and the tim ing curves are derived from the m easurem ent in a least square sense with degree of 1. The tim ing equations are shown in Ta ble 6.1. To find where ISSC starts to improve system perform ance, we could ju st solve th e equation T r h r e a d e d - c + i s s c < T r h r e a d e d - c to find th e cross points of the T r h r e a d e d - c + i s s c and Ttbreaded— c curves. These points are shown at the right of Table 6.1. The logical m eaning of these points is th at when the communication interface overhead of a system is greater than the value plus th e existing com m uni cation interface overhead on th e MANNA machine (0.825 fj,s), then ISSC will yield increased performance gain. As we can see, ISSC starts to help the system when th e communication interface overhead is greater than 7.1^s, 9.7/zs, 5.9/^s, and 4.8/^s, respectively in dense m atrix m ultiplication, conjugate gradient, Hopfield and sparse 77 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. B e n c h m a r k s V e r s io n s C ro ss T h r e a d e d - C T h r e a d e d - C + I S T h r e a d e d - C + I S S C P o in t D e n s e M .M . 2 .4 7 8 X 1 0 * C o a + 3 . 1 3 9 X 1 0 * 2 .4 7 1 X 1 0 * C o a + 4 .1 7 4 X 10* 3 8 7 7 .6 C o a + 4 .4 3 1 X 1 0 * 5 .3 C .G . 1 .8 5 5 X 1 0 * C „ „ + 1 .0 7 4 X 1 0 * 1 .8 4 7 x 10 * C o a + 1 .3 0 1 X 1 0 * 1 2 7 4 .8 C „ a + 2 .6 1 8 X 10 * 8 .9 H o p fie ld 1 .5 4 8 x 10* C » a + 7 .1 9 8 X 1 0 4 1 .5 4 1 X 1 0 * C o a + 9 . 1 1 4 X 10* 1 1 8 .8 C „ „ + 1 .5 0 5 X 1 0 * 5 .1 S p a r s e M .M . 4 .8 5 5 x 10* C o o + 2 .0 4 0 X 1 0 * 4 .8 1 1 X 1 0 * C o a + 3 .5 4 2 x 1 0 * 1 7 9 8 .7 C o a + 3 . 5 2 1 X 10 * 3 .9 5 Table 6.1: Tim ing equations and the cross-points (fis) m atrix m ultiplication, which are still far less than the ones in m ost of network of workstations. 6.2 The Analytical Models The experim ents presented in C hapter 5 provide useful information about the per form ance of the ISSC on existing hardware platform . However we would like to be able to predict, for machines yet to be built, under which circum stances the imple m entation of ISSC becomes profitable. To enable such predictions, in this section, we develop an analytical m odel for the execution tim e of benchm arks written in Threaded-C, Tthreaded-ci Threaded-C with I-Structure support, TVs, and Threaded- C w ith both I-Structure and I-Structure Software Cache support, T rssc- The base for our analytical model is Tg, th e execution tim e of the benchm ark on a fine grain m ulti-threaded machine w ithout I-Structures and the cost of split-phase memory ac cesses is deducted. Our model uses the following set of benchmark-related parameters and platform-related parameters: Benchm ark-related param eters: N l,: Number of local reads N r : Number of rem ote reads R h . i t - Cache hit ratio on rem ote reads Rd~h.it' Cache deferred h it ratio Platform -related param eters: CQ : One-way com m unication interface overhead (original) 78 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. One-way com m unication interface overhead (add-on) Or. Local I-S tructure read service tim e Or' - R ead request invoking tim e Ohit- ISSC h it service time o ■ • ' - ' m i s s • ISSC m iss service tim e O d e f- ISSC deferred hit service tim e W here R d - h . i t is th e ratio between the cache hits th a t have been deferred and the to tal num ber of cache hits. The higher R d - h i t is, th e poor tem poral d ata locality is in the application. T h e C 0 and C oa are defined as one-way com m unication interface overheads which are only incurred in either sending or receiving network data, but not both. The definitions of other platform -related param eters were presented in Section 5.2. In our analytical model < 9 mtss does not include com m unication interface overhead. The analytical m odels are defined as follows: T t h r e a d e d —c — T b + { N L , + N f t ) O r + N r 2 ( C 0 + C o a ) T i s = T b + N l O i + N r O t + Nr2{C0 + C o a ) T issc = T b + N l O i 4 - N a R h i t { 1 — R d - h i t ) O h u + N r R h u R d - h i t O d e / + / V g ( l — R h i t ) O m i s s + N r ( 1 — R h i t ) 2 ( C 0 + C o a ) In the developm ent of the analytical model, we assum e owner com putation rule. Therefore, all the w rite operations are perform ed locally and incur no com m unication overhead. We also assum e th a t the I-structure arrays are evenly distributed across the nodes. Therefore th a t the jobs are also evenly distributed. We assume the sam e basic execution tim e, T b for all three versions of the system. In fact, Tg in T issc should be sm aller than the ones in T t h r e a d e d - c and 77s because caching rem ote m em ory requests decreases th e average turn-around tim e for all the requests and as, a result, it increases parallelism and processor utilization. However, this assum ption 79 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. P aram eters Benchm arks Dense M.M. C.G. Hopfield Sparse M.M N l 139328 614 512 19140 N r 123840 9210 7680 234784 R hu (%) 98.35 93.65 99.22 99.55 Rd-hit (%) 20.00 51.80 100.00 21.20 Table 6.2: Benchm ark-related Param eters P aram eters a O i O r O h it O m is s O d e f m icro-second 0.875 6.34 2.82 9.58 51.54 27.08 Table 6.3: P latform -related Param eters M easured from MANNA machine in T issc provides the upper-bound of the execution tim e for the system with ISSC. In our im plem entation, only rem ote reads are cached in ISSC. Hence, those local I-Structure reads in T iss c still need the I-S tructure read service in local node. In these models, th e rem ote costs for Tthreaded-c and T/s are N r 2 ( C 0 + Coa) and f°r T issc is N r (1 — Rhit)2(C0 + Coa) which only include the com m unication overhead incurred in the local node. The overheads in rem ote node are actually hidden by the m ultithreaded execution. 6.2.1 Verifying the Model To verify the analytical models, we compare th e execution tim e prediction obtained from th e models w ith our experim ental results on EARTH-MANNA shown in Chap ter 5. In Table 6.2, we list the benchm ark-related param eters which are collected from our experim ents on M ANNA for a selected set of benchmarks. T he platform - related param eters of M ANNA machine, m easured in Section 5.2, are listed in term s of (xs in Table 6.3. From our analytical m odels, we know th a t the execution tim e of Threaded-C and Threaded-C+IS versions are linear proportional to the add-on com m unication overhead, Coa, w ith the factor of 2 tim es num ber of remote reads, '2N r . which are 247680, 18420, 15360, and 469568 respectively in dense m atrix m ultiplication, 80 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. conjugate gradient, Hopfield and sparse m atrix m ultiplication. Also, the execution tim e of Threaded-C+ISSC is also linear proportional to Co a with the factor of two tim es the num ber of cache misses, 2 N r ( 1 — R h . i t ) - , which are 4086, 1170, 119, and 2113 respectively. These num bers m atch the curve-fitting tim ing equations from our experiments described in Table 6.1 within 10% error range. According to the analytical models, for T issc < Tthread.ed.-c-, we need, (N 'l + N p ) O r 4- N r '2 (C 0 + Coa) > N ^ O l - ( - N r R h . i t ( I — R d - h i t ) O h i t + N r R h i t R d - h i t O d e / + N r ( 1 — R h i t ) 0 m i s s + N r { \ — R hit)2(C 0 + Coa) = * * ivj+TV n R h . i t (2Co + 2 Coa) > + ]y/+Vr R h i t ( (1 — R d —h i t ) 0 h i t + R d - h i t O d e f ) + N Z + f a ( 1 ~ R h i t ) O m i s s ~ O r ........................ ( 1 ) The m eaning of Equation 1 is quite straight forward. The condition for ISSC starts to improve the system is th a t the com m unication interface overhead saved by ISSC (left hand side of the equation) should be greater than the I-Structure read service tim e required for local access plus ISSC operation overhead minus the read request handling tim e in the original system (right hand side of the equation). We plug in the N l , N r , R h i t , and R d - h i t param eters for each benchmark and use the MANNA param eters to derive the minimum add-on communication interface overhead from which point ISSC starts to improve th e system performance. We get 6.7/j.s, 9.0/is, 11.5^/s, and 4.6^s respectively for dense m atrix m ultiplication, conjugate gradient, Hopfield, and sparse m atrix m ultiplication. Our analytical model for T issc defines the upper bound of the execution time. Therefore, the cross-point derived from Equation 1 is th e lower-bound of communi cation interface overhead from which ISSC starts to im prove system performance. For example, if the point derived from our models is 10/us, for this upper-bound estim ation of T is s c , we could say th at as long as the com m unication interface over head is larger than 10/is, our ISSC is going to im prove the performance. Values of these cross-points derived from our analytical models are greater than but close to 81 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. th e values we m easured in our experim ents shown in Table 6.1 except in Hopfield. This is because the synchronization of the activation updates after each tim e stam p yields partial sequential behavior. In this case, the basic execution tim e in T issc is m uch smaller th an in T t h . r e a d . e d . - c - Therefore the cross-point we predicted is much larger than what we m easured. 6.2.2 Performance Predictions In this section, we introduced our analytical models for the m ultithreading system w ith and w ithout I-Structure software cache support and we verified these models w ith our experim ent results based on EARTH-M ANNA m achine. W ith these m od els, we could predict the lower bound of communication interface overhead from which ISSC starts to yield perform ance gain in different kind of benchm arks and platform s. By using these m odels, for a fixed platform param eters (like plug in the param eters m easured from EARTH-M ANNA) and varied benchm ark-related param eters, we could estim ate the value of com m unication overhead where Threaded-C +ISSC starts to out-perform pure Threaded-C for the benchm arks w ith different charac teristics. Figure 6.2 shows these cross-points of different kinds of benchm arks by varying the cache hit ratio and deferred hit ratio while assum ing only half of m em ory requests are issued to rem ote nodes. From this figure, we could see th at even in those benchm arks w ith poor locality {Rhit = 0.5 and Rd-hit = 1-0), ISSC still yield perform ance gain for com m unication interface overhead greater than 40/y.s, which is still faster than m ost of the network im plem entation in netw ork of workstations. For those benchmarks w ith extrem ely good locality, i.e. more th an 98% of cache hit ratio w ith 0 deferred hit, ISSC starts to improve the system for the com m unication overhead as low as 5/xs. Some researcher dedicate their work on com m unication optim izations to reduce the num ber of rem ote m em ory accesses. This kind of optim izations are based on 82 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Pwtonnanea ptaddion ferdtferant banctanarfcs (wflh Nr/(f*>NI) • 03) 1 25 \ 0.7 0.75 0.0 (SSC hft ratio C % ) Figure 6.2: Perform ance prediction for dif ferent benchm arks Performance preddicn for commuucatlan optimization (wth Rd-htt=03 and Rh4=O.B) Ratio d remote accesses (Nr/(Nr«NI)) Figure 6.3: Perform ance prediction for communication optim ization R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. th e static analysis of the program behavior which is different from exploiting th e d ata locality during the run-tim e by the caches. However, ISSC could still yield perform ance gain in the benchm arks compiled with the com m unication optim ization techniques. In Figure 6.3, we vary th e ratio of remote m em ory requests to the to tal num ber of memory requests. We find out th a t even in an application with only 10% of m em ory accesses are rem ote and m oderate cache hit ratio (Rhit = 0.8 and R d - h i t = 0.5) ISSC still improves the system at 33.5/^s of com m unication overhead. Parformanca prediction for technology improvement -o 14 5 12 5 0 100 150 200 350 400 450 500 Figure 6.4: Perform ance prediction for technology im provem ent As the speed of processors becomes faster and faster, the gap between the com putatio n and communication latencies become larger and larger. Because, our ISSC is a pure software im plem entation, th e ISSC operation overhead decreases propor tional to th e increase of processor speed. In Figure 6.4 we vary th e platform -related param eters based on 50MHz MANNA processor by increasing th e speed of proces sors for an application w ith 50 % of rem ote memory accesses, 80% cache hit ratio, and 50% deferred hit ratio. From this curve, we could predict th a t if we have a 500MHz processor available, which is already there on the m arket, the cross-point drops to less than 2/as. In this case, ISSC could almost yield perform ance gain on any parallel machine. 84 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 6.3 Summary Do software caches really work? In this chapter, we dem onstrated a software imple m entation of I-Structure cache, i.e. ISSC , can deliver performance gains for most distributed m em ory system s which don’t have extrem ely fast inter-node communi cations, such as network of workstations [21, 44, 66, 41]. ISSC caches values obtained through split-phase transactions in th e operation of an I-Structure. It also exploit spatial data locality by clustering individual element requests into block. O ur experim ent results show th at the inclusion of ISSC in a parallel system th at provides split-phase transactions reduces the num ber of rem ote memory requests dram atically and reduces th e traffic in the network. The most significant effect to the system performance is the elimination of the large am ount of com m unication interface overhead which is incurred by rem ote requests. We developed analytical models for the perform ance of a distributed memory m ultithreading system with and without I-Structure Software Cache support. We verified these models with our experim ent results on an existing m ultithreaded ar chitecture, EARTH-M ANNA. These models consist of two sets of factors, platform - related and benchm ark-related. Platform -related param eters are those latencies in curred by rem ote m em ory requests and ISSC operations. B enchm ark-related pa ram eters are the characteristics of applications, such as num ber of rem ote and local memory accesses and d ata locality. By finding th e cross-point of two execution tim e curves, which have the communication interface overhead as variable, of the systems without and w ith ISSC, we could find when ISSC starts to yield perform ance im provement for different benchm arks and platform s. Through system atic analysis, we show th a t ISSC delivers performance gains for a wide range of applications in most of the parallel environm ents, especially in network of workstations. 8 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. C h a p ter 7 Conclusions and future research 7.1 Conclusions In this dissertation, a split-phased transaction caching scheme for the I-Structure- like memory system s is proposed and im plem ented as a runtim e system to exploit global data locality in the non-blocking m ultithreaded system s. Our ISSC provides a software caching m echanism to further reduce the com m unication latency by caching the split-phase transactions while m aintaining the benefits of latency tolerance in m ultithreaded execution. The ISSC design was first validated by our Generic M ultiThreaded machine (GM T) sim ulator w ith several benchm arks. Then, we im plem ented our ISSC as an user library on EA R TH systems using Threaded-C language. W ith the im plem enta tion on real m achines, we were able to m easure the overhead of the ISSC operations and measure its actu al performance w ith some benchm arks. We further developed analytical models for the for the perform ance of a m ultithreading system with and w ithout ISSC support. From these m odels, we can analyze the lower bound of com m unication interface overhead from which ISSC starts to yield perform ance gain in different benchm arks and platforms. The following contributions are achieved in this research work, 86 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • Com bination of the benefits of latency tolerance and latency reduction in dis tributed m em ory m ultiprocess systems. Traditional m ultithreading models provide th e capability of latency tolerance through overlapping useful com putation w ith the long communication overhead in d istributed memory en vironm ent. Caching provides the capability of this latency reductions in the shared m em ory environm ent. However, our ISSC provides a software caching m echanism to further reduce the com m unication latency by caching the split- phase transactions while m aintaining the benefits of latency tolerance in m ul tithreaded execution in distributed m em ory m ultiprocessor systems. • Network traffic reduction to reduce com m unication overhead and network con tention. From our experim ents, we shown the effect of our ISSC on network traffic reduction. M ore than 90% of the original rem ote mem ory request are elim inated out by our ISSC. Each rem ote m em ory request needs to be sent through netw ork interface to the rem ote node, and each request will suffer from the network interface overhead four tim es. Therefore, our ISSC elim inates quite large am ount of network interface overhead incurred by rem ote m em ory requests. Moreover, it relieves network traffic and avoids potential network contention problems. • Harmless low-cost software im plem entation. ISSC is a pure software approach to exploit the global d ata locality w ith adding any hardw are complexity. The design of ISSC is efficient enough to be im plem ented in software layer without degrading the system performance. Indeed, the overhead of ISSC itself would had dragged down the system perform ance, but the trem endous am ount of com m unication interface overhead saved by the ISSC n o t only compensate its overhead but also improve the overall system perform ance. • Single th read perform ance improved by latency reduction. In some applica tions w ith em barrassing parallelism, the long com m unication latency may not 87 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. be tolerated by enough threads. In these applications, ISSC’s capability of latency reduction could improve the system perform ance. • Consistent cache performance and robust fine-grain m ulti-threaded execution in Network of W orkstation platform. The cache advance scheme in our ISSC provides th e adaptability to the unpredictable com m unication characteristics in the Network of W orkstation environments. This makes the system achieve the same perform ance w ithout being affected by the variation of the communi cation latency. ISSC also elim inate trem endous am ount network interface over head incurred by the large number of split-phase rem ote memory requests in the fine-grain m ultithreaded systems. This m ake the fine-grain m ultithreaded execution m ore robust in the NOW platform s. • Frame work for further split-phased transaction cache design. This research established a solid foundation for further split-phased transaction cache design. The design issues we discussed and the approaches adopted by our design pro vide fundam ental knowledge for it. The analytical model we developed would allow us to predict the performance of the caching w ith advanced technology im provem ent which may not be available today. 7.2 Future research There are several research directions could be derived from this research. • Cache coherence protocol design and im plem entation for multiple-assignment split-phased transactions. The ISSC could be extended with cache coherence protocol when multiple-assignment storage system s are required. In some ap plication, frequent updates of variables are desired and using I-Structures in this kind of application m ay degrade the system performance because of ex cessive overhead caused by frequent I-Structures deallocation and reallocation. 88 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Extending the split-phased transaction software caches w ith proper coherence protocols could do a lot of help on exploiting global d ata locality for all applica tion. Releasing the constraint of memory construct and destruct of I-Structures would provide the program m ers with full control of m em ory usages and hence program m ers have m ore flexibility to im plem ent application. However, the extra overhead incurred by th e cache coherence protocol needs to be evaluated in m ore details. It m ay m ake the software caches less beneficial because of heavier software cache operation overhead. Fortunately, releasing the single assignm ent constrain in m em ory construct will simplify the cache design in some aspects. For example, no m ore deferred reads on data elements and therefore no deferred read handling is needed. F urther more, the whole memory block could be brought back to the requester w ithout checking the states of each individual data elements. All of these m ay still make the idea of split-phased transaction software cache for m ultiple assignment memory system feasible and beneficial. • Hardware supports for I-Structure caching. Hardware supported cache systems for split-phased transactions could further m anifest the benefits of I-Structure caching in non-blocking m ultithreaded architectures. T here are two different approaches to im plem ent the hardware supported cache: The first approach is to use a piece of dedicated hardware. T he concept of I-Structure caches is not lim ited on software im plem entation. It could be im plem ented in hard ware as well. W ith a customized chip or FPG A to work as the controller of I-Structure cache m anagem ent along with SRAM for caching d ata storage, the overhead of I-S tructure cache operations could be reduced dramatically, and hence, I-Structure cache will deliver perform ance gains and m ore significant performance im provem ent on all platform s. Indeed, the I-Structure memory system m anagem ent could be also incorporated into this dedicated hardware 89 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. to further im prove the system perform ance. T he alternative of hardw are sup ported I-S tructure caching is to use a decoupled processor. A decoupled pro cessor for com m unication and m em ory m anagem ent has been adopted in m any m ultiprocessor system design. The operations of I-Structure m em ory access and caching could be also executed on this decoupled processor. This will off load the I-S tructure cache overhead from other processors which then could be dedicated to useful com putation. As a m a tte r of fact, no complex floating point operations needed in these m anagem ent jobs, and therefore, a low cost micro-controller or DSP could be used as this purpose. • Network Caching. W hile some researchers concentrate on the developm ent of faster netw ork interfaces [16, 27, 57], the concept of our split-phased transac tion caches for the distributed d ata could be integrated into next generation network interface design. In this netw ork interface, a message from local pro cessor requesting for a remote d ata elem ent will be translated into a new message requesting for the whole d ata block containing the original requested element. A cache block space will be reserved for th e new request in the network interface before the message is injected into the network and th e suc cessive requests for the elements in this block will wait in the interface w ithout actually been sent to the rem ote nodes. T he requested rem ote d ata elem ent along w ith o th er surrounding elem ents in the sam e d ata block are cached in the interface when they are brought back from rem ote nodes. Cache coherence protocol should be also implemented in the design to provide general purpose usages for parallel computing on NOW platform s. Using this next generation network interface with the capability of network caching in NOW , network traffic could be reduced dram atically and fine-grain parallelism in NOW p lat forms would become possible. 9 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • Integrating D ata Caches into Non-Strict Caches. The hardw are supported I- S tru ctu re cache could be further integrated w ith local L2 caches. W ith this approach, the rem ote d ata fetched via split-phased are only stored in the in tegrated cache. There are no local storages for rem ote data. Therefore, all the d a ta references are referred by the global addresses as in shared mem ory system s. In th is approach, the continuation vector carried by split-phased fetch only includes the thread num ber which is going to consume this d ata without pro viding local storage address for storing the fetched data. W hen data is brought back to local host and stored in the integrated non-strict d ata cache, a signal is sent to consuming thread to inform the arrival of requested d ata item. W hen all th e d ata become available in the non-strict d ata cache, it is enabled and the d a ta are accessed directly from caches during the execution. To avoid the cache controller autom atically replace the data that was just fetched before the corresponding thread is actually executed, we could elimi nate this problem by allowing memory reservations in the local cache. W hen a cache line is first allocated for a missing read, the cache line is reserved and the m issing read is deferred at the corresponding location for the requested elem ent. The deferred read is pending on the d ata cache until the data item is brought back from rem ote node and is referred by the consuming thread. To im plem ent this, when a rem ote data item arrives and is stored into the cache, if the cache elem ent is in deferred state, signals are sent to the consuming threads indicating by the pending reads to inform the arrival of this data item. However, these pending reads are not de-queued until the d ata item actually referred by the consuming threads. W hen the consuming starts to execute and refers th e d ata item , the pending read associated with this consuming thread is rem oved from the queue. The cache elem ent is not changed to present state until all the pending reads are removed. As long as there is any deferred cache 91 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. element in th e cache line, the cache line will still be reserved until all the cache elem ents in this cache line are all either in present or empty state. A reserved cache line is not subject to replacement by the cache controller until its state has been changed to non-reserved by th e runtim e system. O ur initial studies indicate th at this property can be integrated with the support for non- strict access hardw are cache w ith the single addition of one bit for the state representation of each cache line. • Apply non-blocking m ultithreaded execution m odel w ith split-phased transac tion cache support on SMT architectures. A robust fine-grain non-blocking m ultithreaded execution model with ISSC support could be im plem ented on versatile architectures. It would be very interesting to implement it on SM T architectures [71, 54, 32, 15]. Each ready thread from the execution model has all the variables it needs locally and is guaranteed to be run from start to the end w ithout synchronization, remote m em ory requests and other long latency operations inside the thread. Each ready thread is an independent entity and won’t interfere with each other. This would very likely to drive the SMT processors with very high throughput. Since all the variables needed in a thread are available locally, we could further bring all the data frame m em ory and instruction fram e memory into caches right before the thread is scheduled for execution. All of these features of non-blocking m ultithreaded execution model applied on SMT architectures could fully exploit the benefits of SM T architecture for a single application performance. 9 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Reference List [1] A. Agarwal. Performance Tradeoffs in M ultithreaded Processors. IE E E Trans actions on Parallel and Distributed Systems, Septem ber 1992. [2] A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. K ranz, J. Kubia- towicz, B.-H. Lim, K. M ackenzie, and D. Yeung. The M IT Alewife Machine: A rchitecture and Perform ance. In ISCA 95, 1995. [3] A. Agarwal, J. Kubiatowicz, D. Kranz, B.H. Lim, D. Yeung, G. D ’Souza, and M. Parkin. Sparcle: An Evolutionary Processor Design for Multiprocessors. IE E E M icro, pages 48-61, Ju n e 1993. [4] Rovert Alverson, David C allahan, Daniel Cum m ings, Brain Koblenz, Allan Porterfield, and Burton Sm ith. The Tera com puter system . In Conference Proceedings, 1990 International conference on Supercomputing, June 1990. [5] J. N. A m aral, G. Gao, and X. Tang. An im plem entation of a hopfield network kernel on earth. In X Brazilian Symposium on Com puter Architecture and High Perfor mance Processing, pages 223-232, Buzios, R J, Brazil, Sept. 1998. [6] J.N . A m aral, Z. Ruiz, S. Ryan, A. Marques, C. M orrone, and G.R. Gao. P ortable Threaded-C Release 1.1. Technical note 05, C om puter Architecture and Parallel System Laboratory, University of Delaware, Septem ber 10 1998. [7] Jose Nelson Amaral and G uang R. Gao. Im plem entation of I-Structures as a Library of Functions in P ortable Threaded-C. Technical note 04, Computer 93 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A rchitecture and Parallel System Laboratory, University of Delaware, Ju ly 28 1998. [8] B. S. Ang, A rvind, and D. Chiou. S tartT th e Next Generation: Integrating Global Caches and Dataflow A rchitecture. CSG MEMO 354, Laboratory for Com puter Science, M IT., February 1994. [9] Arvind, R. S. N ikhil, and K. K. Pingali. I-Structures: D ata Structures for Par allel Com puting. A C M Transactions on Programming Languages and System s, October 1989. [10] Arvind, R ishiyur S. Nikhil, and Keshav K. Pingali. I-structures: D ata struc tures for parallel com puting. A C M T O P LA S, ll(4):598-632, O ctober 1989. [11] Arvind and R. E. Thom as. I-structures: An efficient data structure for func tional languages. Technical Report M IT /L C S/T M -178, M assachusetts In stitu te of Technology, Cam bridge, 1981. M IT Lab. for C om puter Science. [12] D. Bailey, E Barszcz, K. Barton, D. Browning, R. C arter, L. Dagum, R. Fatoohi, S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakr- ishnan, and S. W eeratunga. The NAS parallel benchm arks. Technical R eport RNR-94-007, R N R , M arch 1994. [13] D. H. Bailey, J. T . barton, T. A. Lasinski, and H. D. Simon. The NAS parallel benchmarks. Technical Report NASA Technical M emorandum 103863, NASA Ames Research C enter, July 1993. [14] Michael J. Beckerle. Overview of the STA RT(*T) m ultithreaded com puter. In Digest o f Papers, 38th IE E E Com puter Society International Conference, COM PCON Spring ’ 93, Feb. 1993. 9 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [15] M. Bekerm an and et al. Perform ance and hardware com plexity tradeoffs in designing m ultithreaded architectures. In Proceedings o f Parallel Architectures and Compilation Techniques, 1996. [16] M. A. Blumrich, K. Li, R. A lpert, C. Dubnicki, and E. Felten. V irtual Memory M apped Network Interface for the SHRIM P M ulticom puter. In Proceedings of the 21th Annual International Symposium on Computer Architecture, 1994. [17] D. Cann. Compilation techniques for high performance applicative com puta tion. Technical Report CS-89-108, Colorado State University, 1989. [18] D. Cann and J. Feo. Sisal 1.2 : An A lthernative to FOTRAN for shared Mem ory Multiprocessors. Technical Report UCRL-102263, Lawrence Livermore Na tional Laboratory, 1989. rev 1. for ACM SIGGPLAN ’90. [19] D. Cann and R. Oldehoeft. A guide to the optimizing Sisal compiler. Techni cal Report UCRL-MA-108369, Lawrence Livermore National Laboratory, Sep. 1991. [20] David Chaiken, John Kubiatowicz, and A nant Agarwal. LimitLESS directories: A scalable cache coherence scheme. In Proceedings o f the Fourth International Conference on Architectural Support fo r Programming Languages and Operating System s, pages 224-234, April 8-11, 1991. [21] D. Culler, R. Karp, D. P atterson, A. Sahay, K. Schauser, E. Santos, R. Sub- ram onian, and T. von Eicken. LogP: Towards a Realistic Model of Parallel C om putation. In Proceedings o f the Fourth A C M SIG P L A N Sym posium on Principles and Practice o f Parallel Programming, May 1993. [22] D. E. Culler, A. Sah, K. E. Schauser, T . von Eicken, and J. W awrzynek. A compiler-controlled threaded abstract machine. In Proceedings o f A SP L OS-IV, April 1991. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [23] David E. Culler, Seth Copen G oldstein, Klaus Erik Schauser, and T . von Eicken. Empirical study of a dataflow language on the CM-5. In G.R. Gao, L. Bic, and J-L. Gaudiot, editors, Advanced Topics in Dataflow Com puting and M ulti threading, pages 187— 210. IEEE Press, 1994. [24] F. Darema, D.A. Georage, V.A. Norton, and G.F. Pfister. A single program- m ultiple-data com putational m odel for E PE X /FO R T R A N . Parallel Comput ing, 7:11-24, April 1988. [25] J. B. Dennis and G. R. Gao. O n M emory Models and Cache M anagem ent for Shared-M emory M ultiprocessors. CSG MEMO 363, Laboratory for Com puter Science, M IT., M arch 1995. [26] Jack B. Dennis. T he Paradigm Compiler: Mapping a functional language for the Connection M achine. In Scientific Applications fo the Connection M achine, pages 301-315, 1989. [27] C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis, and K. Li. VMMC-2: Efficient Support for Reliable, Connection-O riented Com m unication. In Proceedings o f the Hot Interconnects Sym posium V, August 1997. [28] Michel Dubois and Faye A. Briggs. Effects of Cache Coherence in M ultiproces sors. In Proceedings o f the 9th Annual Symposium on Com puter Architecture, pages 299-308, May 1982. [29] Guang R. Gao. An Efficient H ybrid Dataflow A rchitecture Model. Journal of Parallelism, 19(4), Decem ber 1993. [30] Guang R. Gao, H erbert H. J. H um , and Yue-Bong Wong. Parallel Function Invocation in a D ynam ic A rgum ent-Fetching Dataflow A rchitecture. In Proc. o f PARBASE-90: Intl. Conf. on Databases, Parallel Architectures, and their Applications, M iam i Beach, Florida, pages 112-116, M arch 1990. 96 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [31] J-L Gaudiot and C-T Cheng. A Scalable Cache Design for I-Structures. In Proceedings o f the International Conference on Parallel Processing, Aug. 1996. [32] M. G ulati and N .Bagherzadeh. Performance study of a m ultithreaded super scalar m icroprocessor. In Proceedings of I n t’ l Sym p. on High-Performance Com puter Architecture, 1996. [33] Robert H. H alstead Jr. and Tetsuya Fujita. MAS A: A m ultithreaded processor architecture for parallel symbolic computing. In Proceedings o f the 15th Annual International Sym posium on Computer Architecture, pages 443-451, 1988. [34] J. Hicks, D. Chiou, B. S. Ang, and Arvind. Perfornam ce Studies of Id on Monsoon Dataflow System . Journal of Parallel and Distributed Computing, pages 273-300, 1993. [35] High Perform ance F ortran Forum. High-performance fortran language specifi cation. Technical report, Rice University, May 1993. [36] Seema H iranandani, Ken Kennedy, and Chau-W en Tseng. Com piler optim iza tions for fortran d on m im d distributed-m em ory m achine. In Proceedings of Supercomputing ’ 91, pages 86-100, Nov. 1991. [37] H. H .J. H um , O. M aquelin, K. B. Theobald, X. Tian, X. Tang, G. Gao, P. Cupryk, N. Elm asri, L. J. Hendren, A. Jim enez, S. K rishnan, A. Marquez, S. Merali, S. S. Nem awarkar, P. Panangaden, X. Xue, and Y. Zhu. A Design Study of the EA RTH M ultiprocessor. In P A C T 95, June 1995. [38] H erbert H ing-Jing Hum . The Super-Actor Machine: a Hybrid Dataflow/von Neum ann Architecture. PhD thesis, School of C om puter Science, McGill Uni versity, M ontreal, Q uebec, 1992. [39] Robert A. Iannucci. A dataflow/von Neumann Hybrid Architecture. PhD thesis, M assachusetts In stitu te of Technology, July 1988. 97 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [40] Robert A. Iannucci. Toward a dataflow/von neum ann hybrid architecture. In Proceedings o f the 15th Annual International Symposium on Com puter Archi tecture, pages 131-140, M ay 1988. [41] V. K aram cheti and A. Chien. Software overhead in m essaging layers: W here does the tim e go? In Proceedings o f the 6th AC M International Conference on Architectural Support fo r Programming Languages and System s (ASP LO S VI), Oct. 5-7, 1994. [42] M. K atevenis. Reduced instruction set computer architectures fo r VLSI. PhD thesis, C om put. Sci. Division (EECS), UCB/CSD 83/141, Univ. California at Berkeley, O ct. 1983. [43] K. M. Kavi, A.R. Hurson, P. Patadia, E. Abraham , and P. Shanm ugam . Design of Cache M emories For M ulti-Threaded Dataflow A rchitecture. In ISCA 95, pages 253-264, 1995. [44] K. Keeton, T. Anderson, and D. Patterson. LogP Quantified: The Case for Low-Overhead Local Area Networks. In Hot Interconnects III: A Symposium on High Perform ance Interconnects, August 1995. [45] Kathleen K nobe, Joan D. Lukas, and G uy L. Steele, Jr. D ata optim ization: Allocation of arrays to reduce com m unication on SIMD m achines. Journal o f Parallel and Distributed Computing, 8(2): 102-118, February 1990. [46] Yuetsu K odam a and et al. A prototype of a highly parallel dataflow machine EM-4 and its prelim inary evaluation. In Proceedings o f Info Japan 90, pages 291-298, O ctober 1990. [47] Yuetsu K odam a and et al. EMC-Y: Parallel processing elem ent optimizing com m unication and com putation. In Conference Proceedings, 1993 Interna tional Conference on Supercomputing, pages 167-174, July 1993. 98 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [48] Charles Koelbel. Com puile-tim e generation of com m unications for scientific programs. Technical report crpc-tr91089, Center for Research on Parallel Com putation,R ice University, January 1991. [49] James T. K uehn and B urton J. Smith. The Horizon supercom puting system: A rchitecture and software. In Proceedings o f Supercomputing ’ 88, pages 28-34, Nonvember 1988. [50] J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. B axter, M. Horowitz, A. G upta, M. Rosenblum, and J. Hennessy. T he Stanford FLASH M ultiprocessor. In ISC A 94, 1994. [51] Wen-Yen Lin and Jean-Luc Gaudiot. I-structure Software caches - A split- phase transaction runtim e cache system. In Proceedings o f the 1996 Parallel Architectures and Compilation Techniques Conference, Oct. 1996. [52] Wen-Yen Lin and Jean-Luc Gaudiot. Exploiting Global D ata Locality in Non- Blocking M ultithreaded Architectures. In Proceedings o f the Third Interna tional sym posium on Parallel Architectures, Algorithm s and Networks, Dec. 1997. [53] Wen-Yen Lin and Jean-Luc Gaudiot. The Design of An I-Structure Software Cache System . In Workshop on Multithreaded Execution, Architecture and Compilation, 1998. Held in conjunction with H P C A -4, Feb. 1998. [54] M. Loikkanen and N. Bagherzadeh. A fine-grain m ultithreading superscalar architecture. In Proceedings o f Parallel Architectures and Compilation Tech niques, 1996. [55] O. C. M aquelin, H. H.J. Hum, and G. R. Gao. Costs and Benefits of Mul tithreading w ith Off-the-Shelf RISC Processors. In Proceedings o f EURO- P A R ’ 95, A ugust 1995. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [56] J. R. M cGraw and et al. SISAL: Stream s and Iteration in a single assignment language - language reference m anual version 1.2. Technical Report M-146, Lawrence Liverm ore N ational Laboratory, 1985. [57] S. S. M ukherjee, B. Falsafi, M. D. Hill, and D. A. W ood. Coherent Network Interfaces for Fine-G rain Com munication. In Proceedings o f the 23th Annual International Sym posium on Computer Architecture. 1996. [58] R. S. Nikhil and Arvind. Can dataflow subsum e von N eum ann com puter? In Proceedings o f ISC A -16, M ay-Jun 1989. [59] Rishiyur S. N ikhil and Arvind. Id: a language with im plicit parallelism. CSG M EM O 305, C om putation Structures G roup, 1990. [60] H. Nishikawa, H. Terada, S. Komori, K. Shim a, T. Okam oto, and S. M iyata. A rchitecture of a VLSI-Oriented D ata-D riven Processor: the Q -vl. In J-L. G au diot and L. Bic, editors, Advanced Topics in Data-Flow Computing. Prentice Hall, 1991. [61] M ichael D. Noakes, Deborah A. W allah, and W illiam J. Dally. The J-M achine m ulticom puter: An architectural evaluation. In Proceedings o f the 20th Annual International Sym posium on Computer Architecture, pages 224-235, May 1993. [62] G.M. Papadopoulos. Implementation o f a General-purpose Dataflow Multipro cessor. PhD thesis, Laboratory for C om puter Science, M IT., August 1988. [63] G.M. Papadopoulos. Implementation o f a General-purpose Dataflow Multipro cessor. The M IT Press, 1991. [64] G.M. Papadopoulos and D. Culler. Monsoon: An E xplicit Token-Store Archi tecture. In Proceedings o f the IIth Annual International Sym posium on Com puter Architecture, pages 82-91, June 1990. 100 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [65] D. P atterson an d C. Sequin. A VLSI RISC. IE E E Com puter Mag., 15(9):S— 21, Sept. 1982. [66] S. Rodrigues, T . Anderson, and D. Culler. High-Perform ance Local A rea Com m unication W ith Fast Sockets. In U SEN IX 1997 A nnual Technical Conference, Jan 1997. [67] L. Roh and W . A. N ajjar. Design of Storage Hierarchy in M ultithreaded Ar chitectures. In IE E E Proceedings o f M ICRO-28, pages 271-278, 1995. [68] Shuichi Sakai and et al. An architecture of a dataflow single chip processor. In Proceedings o f the 16th Annual International Sym posium on Com puter Archi tecture, pages 46-53, May 1989. [69] Kevin B. T heobald. E A R T H - An Efficient Architecture fo r Running THreads. PhD thesis, School of Com puter Science, McGill University, M ontreal, Quebec, 1999. [70] Kevin B. T heobald, Jose Nelson Amaral, G erd Heber, Olivier Maque- lin, X inan Tang, and G uang R. Gao. Overview of the portable threaded-c language. CAPSL Technical Memo 19, University of Delaware, http://w w w .capsl.udel.edu, M arch 16 1998. [71] D.M. Tullsen, S.J. Eggers, and H.M. Levy. Sim ulatneous m ultithreading: Max imizing on-chip parallelism. In 22nd Ann. I n t’ l Sym p. On Com puter Architec ture, 1995. [72] T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: a m echanism for integrated com m unication and com putation. In Proceedings o f the 19th Annual International Sym posium on Com puter Archi tecture, pages 256-266, May 19-21, 1992. 101 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. [73] X3J3. F O R T R A N 90, draft o f the inetnational standard. The FORTRAN Technical Com m ittee of ANSI, 1990. [74] N. Yoo. Generic M ultiThreaded m achine (GM T) sim ulator. Com puter engi neering technical report, D epartm ent of Electrical Engineering - Systems, Uni versity of Southern California, December 1993. [75] Hans P. Zima, Heinz-J. Bast, and Michael G erndt. SUPERB: a tool for semi autom atic M IM D/SIM D parallelization. Parallel Computing, 6:1-18, January 1988. 102 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Appendix A ISSC’s Implementation on EARTH using Threaded-C Language R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The com plete definitions of d ata structures for I-Structure Software Caches (ISSC) im plem entation is defined in “issc.h” header file. ISSC operations are also defined in th a t header file. A .l ISSC Structure /* include the data structures defined for i-struct */ ♦include "i-struct.h" /* total of 16K words by default */ ♦define CacheBlockSize 8 ♦define CacheSetSize 8 ♦define NofCacheSet 256 int NI_DELAY; /* parameter for add-on network interface delay */ int NET_LATENCY; /* parameter for add-on network latency * / int CNofRead; int CNofHit; int CNofDeferred; int CNofRemoteHit; int CNofRemoteMiss ; /********** Data Structure definition for Software Cache ************ * Cache Block, is the basic unit of Cache access. * "g.array", is the array ID (now using the beginning * address of the referred I-Structure array. * "b_index", is the index of first cache element in 104 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. * the original I-Structure array. * "handler", the the array which stores the sync, slot * of deferred request service handler for * each element to handle the pending * requests. * In a cache block, * If C (deferred_flag==0) && (tag==NUL) ) { * means an empty cache block * > else if ((deferred_flag==l) && (tag!=NULL)) { * means the request of this cache block is on * going. * y else if ((deferred_flag==0) && (tag!=NULL)) { * means this cache block is a valid cache * block. * } ■ else means in error state * * If (reserved==l) { * means that this cache block has at least one * deferred request, so that it can not be * replaced. * else means this cache block is free to be * replaced. * * Cache Set, contains the index of next victim when cache replace * is needed. * * Cache, contains several performance measurement variables and * an array of CacheSet. 1 0 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. * The location of CacheSet to be checked is indexed by a * simple hash function. ********************************************************************/ typedef struct CacheBlock_str CacheBlock; struct CacheBlock_str - C char deferred_flag; chax referenced; char type; unsigned long g_array; int b_index; array_cell element [CacheBlockSize] ; SPTR handler[CacheBlockSize] ; >; typedef struct CacheSet_str CacheSet; struct CacheSet_str { int victim; CacheBlock block[CacheSetSize] ; >; typedef struct Cache_str Cache; struct Cache_str - C int NofRead; int NofLocal; int NofHit; int NofDeferred; int NoflnitMiss; int NofReplaced; 106 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. int NofPassed; int NofWrite; int NofRemoteHit; int NofRemoteMiss; double hit_time; double miss_time; double tag_time; double service_time; CacheSet set CNof CacheSet] ; >; enum { B=0, S, L, F, D, G, BL >; /* Software Cache space is allocated here */ Cache *cache; 1 0 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. A .2 ISSC Operations T he following functions define ISSC operations either invoked by user program or invoked w ith ISSC operations. /* Flushing ISSC * / THREADED FlushCache(SPTR done); / * ISSC initialization function */ THREADED InitCache(int ni_delay, SPTR done); /* These threaded functions should be invoked locally */ / * I-Structure element fetch using ISSC * / THREADED SC_I_READ(int i_node, int iid, int index, int type, void *GLOBAL place, SPTR slot_adr) ; / * I-Structure block fetch using ISSC */ THREADED SC_I_READ_BLOCK(int i_node, int iid, int index, int block_size, void *GLOBAL place, SPTR slot_adr); / * This function is called by software cache libraxy to be invoked in I-Structure node * / THREADED I_BLKMOV_RSYNC(int iid, int index, void ^GLOBAL c_block, void ^GLOBAL place, int block_size, unsigned long g_array, SPTR slot_adr) , * 108 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. THREADED I_BLKMOVBLOCK_RSYNC(int iid, int index, void *GLOBAL c_block, void *GLOBAL place, void*GLOBAL data_buf, int block_size, unsigned long g_axray, SPTR slot_adr); THREADED Block_Handle(int i_node, int iid, int index, int type, int set_no, int block_no, int element_no); THREADED Block_Handle_BLQCK(int i_node, int iid, int index, int block_size, int set_no, int block_no, int element_no) ; THREADED Single_Handle(int type, int set_no, int block_no, int element_no, SPTR done); THREADED Deferred_Server(int type, int set_no, int block_no, int element_no); THREADED Deferred_Server_BLOCK(int block_size, int set_no, int block_no, int element_no); THREADED Or_Deferred_Server(int type, int set_no, int block_no, int element_no); THREADED Or_Deferred_Server_BLOCK(int block_size, int set_no, R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. int block_no, int element_no) ; /* The following functions are used to add the network interface overhead */ THREADED BLOCK_I_READ(int i_node, int iid, int index, int type, void ^GLOBAL place); THREADED NQW_I_READ(int i_node, int iid, int index, int type, void *GLOBAL place, SPTR slot_adr); THREADED NOW_I_READ_TEST(int i_node, int iid, int index, int type, void *GL0BAL place, SPTR slot_adr); THREADED NQW_I_WRITE_B(int i_node,int iid, int index, char value); THREADED NOW_I_WRITE_S(int i_node,int iid, int index, short int value) ; THREADED NOW_I_WRITE_L(int i_node,int iid, int index, long int value) ; THREADED NQW_I_WRITE_F(int i_node,int iid, int index, float value); THREADED NOW_I_WRITE_D(int i_node,int iid, int index, double value); 1 1 0 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. THREADED NOW_I_WRITE_G(int i_node,int iid, int index, void +GLOBAL value); THREADED NQW_I_WRITE_BLOCK_SYNC(int i_node,int iid, int index, void *GLOBAL origin, SPTR slot_adr); THREADED NQW_GET_RSYNC(void ^GLOBAL src, void +GL0BAL dest, int type, SPTR slot_adr) ; void delayCint delay.par) ; THREADED Print_Cache_Util(SPTR done); THREADED Gather_Cache_Stat(int if_cache, int dim, int delay_time, int exec_time); 111 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Appendix B Using IS SC with Hopfield Benchmark 1 1 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. The following program shows how Hopfield benchm ark is implemented using Threaded- C with the support of I-Structure and ISSC. A makefile is also shown here to show how the program is compiled and how to use ISSC option. B .l Hopfield Benchmark * * hopfield - An implementation of a Hopfield kernel in Threaded-C. * * Author: Jose Nelson Amaral <amaral<3capsl .udel ,edu> * Computer Architecture and Parallel Systems Laboratory * (http://www.capsl.udel.edu) - University of Delaware * * Purpose: Implement a solution for a Hopfield network kernel * demonstrating the use of the I-STRUCTURES in a * synchronization mechanism. * * Release Date: June 11 1998 * * Revised by Wen-Yen Lin <wenyenlQusc.edu> * * Purpose: Implement a larger number of neurons suitable for testing * if I-Structure and I-Structure Software Caches. 113 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. * * Release Date: April 29 1999 * ********************************************************************/ #include <stdio.h> /* * Because the implementation of the I-structure library uses * conditional pre-processing, the user must include the empty * definition of EXTERN in the file that contains the MAIN function * before the file i-struct.h is included. This line must not be * present in any other files. * / #define EXTERN #include "issc.h" Cache *cache; /* Cyclic 1 distribution macros */ #define OWNER(index) ((index) * / . NUM.NODES) #define POSITION(index) ((index) / NUM.NODES) #define IMAP(node, pos) ((pos)*NUM_NODES + (node)) /* Block distribution macros */ /* 114 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. #define OWNER(index) ((index) / ((DIM*DIM)/NUM_NQDES)) #def ine POSITION (index) ((index) * / . ((DIM*DIM)/NUM_NODES)) #define IMAP (node, pos) ((node)*((DIM*DIM)/NUM_NODES) + (pos)) * / #define STOPPING_CRITERIUM 0.0001 int NET_SIZE; /* float synapse [NET_SIZE]; * / float *synapse; /* I-Structure IDs for neuron activations * / float *i_old, *i_new, *temp; float change_of_state; /* Declare 2 arrays here for storing the activation */ float Arrayl [256]; float Array2[256] ; /******************************************************************** * * Adds the value of all the changes of state in each neuron, and * synchronizes the slot specified by {\tt done}. This function is * invoked by the function activation_update and effectively 115 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. * synchronizes the completion of the function execution. * ********************************************************************/ THREADED compute_change_of_state(float change, SPTR done) •c change_of_state += change; RSYNC(done); END _FUNCTION(); > * THREAD_0 allocates the memory necessary for the a_old array. * THREAD_0 is also responsible for invoking the I_READ_F function * for each element of the I-structure array. * * Because the sync slot is initialized with -C\tt num\_neurons}, -C\tt * THREAD\_1} is not spawned until all the read operations are * serviced. {\tt THREAD\_l} computes the new activation for the * neuron {\tt NODE\_ID}, invokes the {\tt I\_WRITE\_F} function to * write this new activation value to the {\tt new} I-structure, * computes the square of the amount of change in the activation * value and reports this change to node 0 invoking the function {\tt * compute\_change\_of\_state()>. This later function synchronizes the 116 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. * sync slot {\tt done}- to signal that the activation update is * complete. * THREADED activation_update(int neuron_id, float *new, float *old, SPTR done) { SLOT SYNC.SLOTS[2]; float *a_old; float activation; float change; int i ; INIT_SYNC(0, NET.SIZE, NET_SIZE, 1); #ifdef DEBUG fprintf (stderr, "Activation Update for neuron#'/,d. . . \n" , neuron_id); #endif a_old = (float *) malloc(NET_SIZE*sizeof(float)); for(i=0 ; i<NET_SIZE ; i++) { /* #ifdef CACHE INVOKE(NODE.ID, SC_I_READ, OWNER(i), old, POSITION(i), F, T0_GL0BAL(&a_old[i]), SL0T_ADR(0)); #else INVOKE(NODE.ID, N0W_I_READ, OWNER(i), old, POSITION(i), F, 117 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. TO_GLOBAL(&a_old[i] ) , SL0T_ADR(0)); #endif * / INVOKE(NODE.ID, NOW_GET_RSYNC, MAKE_GPTR((float *)old+POSITION(i), OWNER(i)), TO_GLOBAL(&a_old[i]) , F, SLOT.ADR(O)); > END_THREAD(); THREAD_1: #ifdef DEBUG fprintf (stderr,"Node ' /id - activation update: Spawned THREAD_1. \n", NODE.ID); #endif activation = 0; for(i=0 ; i<NET_SIZE ; i++) activation += synapse[POSITION(neuron.id) * NET_SIZE+i] * a_old[i] ; #ifdef DEBUG fprintf (stderr, "Node '/d updates activation=/if \n" , NODE.ID, activation) ; #endif activation = (activation > 0.0) ? +1.0 : -1.0; #ifdef DEBUG fprintf(stderr, "activation turned to =%f\n", NODE.ID, activation) ; 118 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. #endif /* INVOKE(OWNER(neuron_id), I.WRITE.F, new, POSITION(neuron.id) , activation) ; * / INIT_SYNC(1, 1, 1, 2); DATA_RSYNC_F(activation, MAKE_GPTR((float *)new + POSITION(neuron_id),OWNER(neuron_id)), SLOT_ADR(l) ); #ifdef DEBUG fprintf(stderr, "Old activation=%f\n", a_old[neuron_id]); #endif change = (activation - a.oldCneuron.id]); change = change*change; #ifdef DEBUG fprintf (stderr, "change=,/f\n" , change) ; #endif INVOKE(0, compute_change_of_state, change, done); free(a_old) ; END.THREAD() ; THREAD.2: END.FUNCTI0NO ; > THREADED InitGlobal(int dim, int ni.delay, SPTR done) < 119 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. int i, j ; NET.SIZE = dim; NI_DELAY = ni_delay; /* Allocate and initialize synapses array */ synapse = (float *)malloc(NET_SIZE * NET.SIZE / NUM.NQDES * sizeof(float)); f or (i=0; i< (NET_SIZE/NUM_NODES) ; i++) - C for(j=0; j<NET_SIZE; j++) " C synapse[i*NET_SIZE + j] = 0.01*(IMAP(N0DE_ID,i)+l)*j ; > > #ifdef DEBUG printf("\nNI_DELAY = %d\n",ni_delay) ; #endif DEBUG RSYNC(done); END_FUNCTION(); > THREADED LocalAllocate(int num, SPTR done) • C / * 120 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. INVOKE(NODE.ID, I.ALLOCATE, num/NUM.NODES, TO_GLOBAL(&i_old), SLOT.ADR(O)); INVOKE(NODE.ID, I.ALLOCATE, num/NUM.NODES, TO.GLOBAL(fti.new), SLOT.ADR(O)); * / i.old = Arrayl; i.new = Array2; RSYNC(done); END.FUNCTION(); > THREADED RESET.I.NEW(SPTR done) { float *temp; temp = i.old; i.old = i.new; i.new = temp; RSYNC(done); END.FUNCTI0NO ; > THREADED MAIN(int argc, char** argv) • C SLOT SYNC.SLOTS [5] ; 121 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. /* void *GLOBAL i.old; void +GLOBAL i.new; void *GLOBAL temp; * / float *final; int i, par_no; int flip; int num.neurons; int num_iter; unsigned long tl,t2,dtl,dt2,delay_time; NI.DELAY = 0; NET.LATENCY = 0; NET.SIZE = 16; if (argc > 1) - C par_no = 0; while(pax_no < argc) - C if(!strcmpCaxgvLpax.no], "-ni")) { sscanfCaxgv[par_no+l], '"/.d", &NI.DELAY); par.no = par_no+2; > else if (! strcmpCargv[par_no] , "-nl")) { sscanf (eirgv[par_no+l] , '"/.d", &NET.LATENCY) ; par.no = pax.no +2; > else if (!strcmp(axgv[par.no] , "-d")) { sscanf (axgv [par_no+l] , "’ /.d", &NET.SIZE); pax.no = pax.no +2; 122 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. ) - else - C par_no++; > > > num_neurons = NET.SIZE; INIT_SYNC(0,NUM_N0DES,NUM_N0DES,1); INIT.SYNC(1,NUM.NODES,NUM.NODES,2); INIT.SYNC(2,num.neurons,NUM_NODES,3); INIT_SYNC(3,num_neurons,num_neurons,4); INIT_SYNC(4,num_neurons,num_netirons ,5) ; final = (float *)malloc(num_neurons*sizeof (float)) ; for(i=0; i<NUM_NODES; i++) INVOKE(i, InitGlobal, NET.SIZE, NI.DELAY, SLOT.ADR(O)); END.THREADO ; THREAD.1: #ifdef DEBUG fprintf(stderr,"MAIN: Allocating i.old and i_new\n"); #endif /* Allocates two I-Structures i.old and i.new on each nodes */ for(i=0; i<NUM_N0DES; i++) INV0KE(i, LocalAllocate, nnm.neurons, SLOT.ADR(l)); 123 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. END.THREAD( ) ; THREAD_2: /* synchronized by I_ALLOCATEs of i_old and i_new */ #ifdef DEBUG fprintf(stderr,"MAIN: Initializing i_old.\n"); # en d if flip = -1.0; for(i=0 ; i<num_neurons ; i++) flip = -1.0*flip; /* INVOKE(OWNER(i), I.WRITE.F, i.old, POSITION(i), flip*0.01*(float)(i+1)) ; * / DATA.RSYNC.F(flip*0.01*(float)(i+1), MAKE_GPTR((float *)i_old+P0SITI0N(i), OWNER(i)), SL0T_ADR(2) ) ; > num_iter=0; d t l = c t .r e a d O ; delay(NI.DELAY); dt2 = c t .r e a d O ; delay.time = (dt2-dtl)/25; 1 * 2 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. END.THREAD( ) ; THREAD.3: #ifdef DEBUG fprintf(stderr,"MAIN: Activation Update\n"); #endif if(num_iter == 0) tl=ct_read() ; num_iter++; change.of.state = 0.0; for(i=0 ; i<num_neurons ; i++) INVOKE(OWNER(i), activation.update, i, i.new, i.old, SLOT.ADR(3)) ; END.THREAD(); THREAD.4: #ifdef DEBUG fprintf(stderr,"MAIN: Criterium check.\n") ; #endif /* temp = i.old; i.old = i.new; i.new = temp; * / #ifdef DEBUG fprintf (stderr, " => change.of .state = * / , f \n" , change.of.state); #endif R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. if (change.of .state > STOPPING.CRITERIUM) { for(i=0; i<NUM_NODES; i++) INVOKE(i, RESET.I.NEW, SL0T_ADR(2)) ; > else { t2 = c t .r e a d O ; /* for(i=0 ; KNUM.NODES; i++) INVOKE(i, I.DELETE, i.old); * / for(i=0 ; i<num_neurons ; i++) / * INVOKE(OWNER(i), I.READ.F, i.new, POSITION(i), T0_GL0BAL(&final[i]), SL0T_ADR(4) ) ; */ INVOKE(NODE.ID, NOW.GET.RSYNC, MAKE_GPTR(i_new+POSITION(i), OWNER(i)) , TO.GLOBAL(&finalCi]), F, SL0T_ADR(4)); > END.THREADO ; THREAD.5: #ifdef DEBUG fprintf(stderr,"MAIN: Finishing.\n"); #endif free(finad); 126 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. printf ("Number of iteration=J(d\n" ,num_iter) ; printf("Execution time = %dus\n", (t2-tl)/25); CALL(Gather_Cache_Stat, -1, NET.SIZE, delay.time, (t2-tl)/25); RETURN(); > B.2 Makefile CC = etcc CFLAGS = -04 TARGET = -target manna-spn INCLUDE = "wenyenl/lib/i-struct LIB = /m/capslguests/wenyenl/lib/i-struct/i-struct .o /m/capslguests/wenyenl/lib/i-struct/issc.o all: hopfield.is hopfield.issc hopfield.is: hopfieldl.c $(CC) $(CFLAGS) -1$(INCLUDE) $(TARGET) $(LIB) -o hopfield.is hopfieldl.c hopfield.issc: hopfieldl.c $(CC) $(CFLAGS) -1$(INCLUDE) $(TARGET) $(LIB) -DCACHE -o hopfield.issc hopfieldl.c clean: -rm -f *.o core hopfield.is hopfield.issc R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Architectural support for network -based computing
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
Decoupled memory access architectures with speculative pre -execution
PDF
A low-complexity construction of algebraic geometric codes better than the Gilbert -Varshamov bound
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
PDF
A comprehensive framework for the specification of hardware /software systems
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
Group key agreement: Theory and practice
PDF
Fine-grained control of security services
PDF
Information hiding in digital images: Watermarking and steganography
PDF
On The Learning Of Rules For An Information Extraction System
PDF
Automatic array partitioning and distributed-array compilation for efficient communication
PDF
Clustering techniques for coarse -grained, antifuse-based FPGAs
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
A framework for coarse grain parallel execution of functional programs
Asset Metadata
Creator
Lin, Wen-Yen (author)
Core Title
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
School
Grauate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Gaudiot, Jean-Luc (
committee chair
), Huang, Ming-Deh (
committee member
), Pedram, Massoud (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-64013
Unique identifier
UC11327413
Identifier
3018017.pdf (filename),usctheses-c16-64013 (legacy record id)
Legacy Identifier
3018017.pdf
Dmrecord
64013
Document Type
Dissertation
Rights
Lin, Wen-Yen
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical