Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
VECTORIZED INTERPROCESSOR COMMUNICATION AND DATA MOVEMENT IN SHARED-MEMORY MULTIPROCESSORS by Dhabaleswar Kumar Panda A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) December 1991 Copyright 1991 Dhabaleswar Kumar Panda UMI Number: DP22828 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Pi.b..stKng UMI DP22828 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48 1 0 6 - 1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089-4015 This dissertation, w ritten by ........P h sb alesw ar _ Kumar ^Panda ............................... under the direction of hxs D issertation Committee, and approved by all its members, has been presented to and accepted b y The Graduate School, in partial fulfillm ent of re quirem ents for the degree of Ph.D. CpS *9/ D O C TO R OF PHILOSOPH Y Dean of Graduate Studies D aie ^ 1991 DISSERTATION COMMITTEE xatrperson DEDICATED TO my loving parents Bapa and Maa for all their encouragements and sacrifices A ck n ow led gem en t s • First of all, I would like to thank my thesis advisor Prof. Kai Hwang for all his guid-j jance, encouragements, and support. It has been a great learning experience to work I with him as a research and teaching assistant. His continuous encouragement andi critical feedback to my research ideas have always driven me to accept challenging problems and to provide simple solutions to them. His strong confidence in me and expectation of seeing me as an established researcher have made me work harder; through out my Ph.D. student life. My deepest gratitude to have him as my teacher j and a good friend. I am indebted to Prof. Viktor Prasanna for giving me an opportunity to pursue my Ph.D. career at USC, providing me with initial support and guidance during my [early work for the Ph.D., exposing me to the field of theoretical computer science,j and building a strong research foundation with me. I am also indebted to my thesis i committee members Prof. Michel Dubois, Prof. C. S. Raghavendra, and Prof. Ming-j Deh Huang for their help, criticism, ideas, and suggestions. I would like to thank I members of VISCOM project team for their critical comments towards my initial! I ideas and work, which later gave rise to this thesis. ; Being an international student, life during graduate study goes with frustration,' agony, and pain. I was fortunate to have many friends who used to come forward when needed to extend help in spite of their own difficulties. This friendly help and discussions have made my USC life full of memorable moments. My sincere gratitude I to all these friends: Suresh, Rajagopal, Sunanda, Saumya, Sharad, Rachna, Meera, | Santosh, Mao, Rajendra, and Aarti. This thesis will remain incomplete w ithout! I mentioning my deep gratitude to Lucille Stivers and Gandhi Puvvada. I am veryi indebted to them for providing help which nobody else other than family members j i would have done. 1 I gratefully acknowledge the financial support provided by NSF grant No. MIPS] 89-04172. | My parents and family members have provided continuous encouragement a n d ' moral support to me during my higher studies including my stay at USA. I feel very delighted to fulfill their dream of seeing me with a doctorate degree. It has been! very memorable to have Debashree coming to my life two and half months before my i graduation as my life partner. It has been only her love, affection, companionship, jand understanding, which has given a big push to this thesis work towards its later stage. I feel very happy to share my joy of writing this thesis with my wife, parents, and other family members. C on ten ts D ed icatio n ii ‘ A cknow ledgem ents iii L ist O f F ig u res viii L ist O f T ables xij A b s tra c t xiij I 1 In tro d u c tio n 1 j 1.1 Data Movement in M ultiprocessors........................................................... 1, ! 1.2 Vectorized Memory A ccess............................................................................. 2; j 1.3 On-the-fly Data M an ip u latio n ....................................................................... 3j 1.4 Communication V ecto rizatio n ..................................................................... 5] 1.5 Organization of the T h e s is ........................................................................... 7 1 i j2 In terlea v ed M em o ry A ccess in M u ltip ro c esso rs 9i | 2.1 In tro d u ctio n ........................................................................................................ 9} j 2.2 Three Multiprocessor Configurations .......................................................... 9^ 2.2.1 Bus-based Multiprocessor ................................................................. 11 j 2.2.2 Crossbar-Connected M ultiprocessor.............................................. 11' I 2.2.3 Orthogonally-connected M ultiprocessor........................................... 11! | 2.3 Data Allocation in Shared M e m o ry ............................................................... 12 i ! 2.4 Interleaved Memory Access Types . . . i ..................................................... 131 ! I ;3 D a ta M a n ip u la tio n H a rd w a re 19 i 3.1 In tro d u ctio n .......................................................................................................... 19 ! 3.2 Vector Register W indow s................................................................................... 19 3.2.1 Organization ........................................................................................ 20 i 3.2.2 R econfigurability................................................................................... 20 3.2.3 Data C oherency..................................................................................... 23 ; 3.3 Index M a n ip u la to r............................................................................................. 24 I | I 1 V 1 37371 Organization ...................................................................................... 24 3.3.2 Interleaved-Read-Write with On-the-fly Indexing......................... 25 3.4 Parallel Data Movement and M anipulation............................................... 27 3.5 Software Interface and Programmability ................................................... 31 F ast D a ta M a n ip u la tio n w ith O n -th e-fly In d e x in g 32 4.1 In tro d u ctio n ....................................................................................................... 32 4.2 Equivalence to a Generalized N etw ork......................................................... 32 4.3 D ata Movement Cost M o d e l......................................................................... 34 4.3.1 Similarity Between OMP and M C C ............................................... 35 4.3.2 Using Clos Network to Analyze C o s t ............................................ 35 4.3.3 A Modified Clos Network ............................................................... 38 4.4 Data Movement Complexity A n a ly sis......................................................... 40 4.4.1 Comparing OMP with C C M ............................................................ 41 4.4.2 Reduction of Data Movement Steps On O M P ................................. 44 4.4.3 Comparing OMP with M C C ............................................................ 48 4.5 Simulation Experiments and Results ......................................................... 51 4.5.1 A CSIM-based Multiprocessor S im u la to r..................................... 52 4.5.2 Simulation E x p e rim e n ts................................................................... 53 4.5.3 Simulation Results and Im plications............................................... 53 V ectorized In te rp ro c e sso r C o m m u n icatio n 58 5.1 In tro d u ctio n ....................................................................................................... 58 5.2 From Message Passing To Shared-Variable C o m m u n icatio n s.................. 59 5.2.1 Message-Passing S t e p s ...................................................................... 59 5.2.2 Primitive Message-Passing P a tte rn s ............................................... 61 5.2.3 Shared-Variable Mailbox Com m unication..................................... 62 5.3 Vectorizing Communication P a t t e r n s ......................................................... 64 5.3.1 Operational D igraphs.......................................................................... 67 5.3.2 The Vectorization P rocedure............................................................ 68 5.3.3 An Example Program C onversion.................................................. 74 5.4 Communication Bandwidth ......................................................................... 78 P ro g ra m C onversion F rom M u ltic o m p u te rs to M u ltip ro c e sso rs 80 6.1 In tro d u ctio n ....................................................................................................... 80 6.2 Converting Hypercube Programs ............................................................... 81 6.2.1 Vectorizing Communication P a t te r n s ............................................ 81 6.2.2 Hypercube Communication C om plexity........................................ 83 6.2.3 Vectorized Communication C om plexity........................................ 86 6.2.4 Reduction in Communication Complexity .................................. 87 6.3 Simulation Experiments and Results ......................................................... 91 6.3.1 Simulation Experiments P e rfo rm e d ............................................... 91 6.3.2 Simulation Results and Im plications............................................... 93 67373 TfaBeoffs~in~Computations and Communications 6.4 Converting Mesh Programs .............................................. 6.4.1 Mesh with Boundary W ra p -a ro u n d .................... 6.4.2 Mesh with Generalized W ra p -a ro u n d ................. 7 C onclusions and Suggested Future Research 7.1 Summary of Research Contributions .............................. 7.2 Suggestions for Future R esearch ........................................ Bibliography A ppendix A: A rchitecture M odeling CSIM M acros A ppendix B: Sim ulation Program Listings L ist O f F igu res 1.1 A typical shared-memory multiprocessor configuration with shared- interleaved memories......................................................................................... 2 1.2 Use of vectorized data access in multiprocessors in supporting (a) computation, (b) data manipulation, and (c) interprocessor commu nication................................................................................................................. 3 1.3 Traditional scalar mailbox approach for interprocessor communication in shared-memory multiprocessors................................................................. 5 2.1 Three shared-memory multiprocessor configurations using interleaved memory organizations: (a) single bus-based multiprocessor, (b) crossbar- connected multiprocessor, and (c) orthogonally-connected m ultipro cessor. ................................................................................................................ 10 2.2 Shuffled data allocation of an (8 x 8) m atrix onto a (4 x 4) mem ory array in example multiprocessor organizations with 1-D and 2-D memory interleaving.......................................................................................... 14 2.3 Possible different scalar and vector memory accesses................................ 15 2.4 Bus protocols and corresponding access time to implement different scalar and vector memory accesses................................................................ 16 3.1 Block diagram organization of vector register windows with on-the-fly index m anipulator..................................................................................................21 3.2 Reconfigurability in vector register windows................................................ 22 3.3 Address mapping scheme for vector register windows....................................22 3.4 Functional organization and operating principles of an example index manipulator with 4 memory modules on an interleaved bus....................... 26 3.5 Illustrative examples of selected classes of data movement operations with a (4 x 4) m atrix data structure.................................................................29 4.1 Equivalence of on-the-fly indexing scheme to a generalized switch. . . 34 4.2 Logical connectivity in the distributed memories of a mesh-connected- computer.............................................................................................................. 35 ~ 473~ ~ Clos network models-to analyze data movements on a (4 x4) OMP (a) a 3-stage Clos network model supporting alternate row and column operations, (b) a modified Clos network modeling for row, column, row-column, and column-row operations...................................................... 37 4.4 Basic data movement steps in the orthogonal multiprocessor and the associated state transition............................................................................... 40 4.5 Single step or multiple substeps realization of an intra-column data movement operation on an example 3— processor crossbar-connected multiprocessor with 9 memory modules and 3 interleaved buses. . . . 42 4.6 Source tag generation for linear array data routing in MCC from Clos network switch settings 49 I 4.7 Factors of reduction in (a) computation overhead (instructions used J for computation and data manipulation) (b) total execution time by ; using on-the-fly index manipulation to shift 5 columns of a (P x P) i matrix on different orthogonal multiprocessor configurations.................54' 4.8 Comparison of (a) instruction overheads and (b) total execution time in shifting a (P x P) m atrix by 5 columns on a 16-processor orthogonal multiprocessor 56 | 4.9 Comparison of (a) computational overhead (b) total execution time i in performing row and column shifts of a (P x P) m atrix by using on-the-fly index manipulation on OMP and CCM.................................... 57 I 5.1 Message-passing steps in a sample program for a 4-node multicomputer. 60 5.2 (a)Communication Graphs for primitive message-passing patterns and i (b) transformation to memory-write and memory-read graphs 63 1 5.3 Memory access types and corresponding memory-read and memory- write graphs 65 ! 5.4 (a) Connectivity graph and (b) operational digraphs of crossbar-connected ; multiprocessor configuration........................................................................... 68 6.1 Different clustering options for a 32-node hypercube to convert its | program to run on a 16-processor multiprocessor 82 I 6.2 Comparison of communication complexities on 16 processor systems I for various problem sizes 95 i 6.3 Comparison of timing complexities for three problems on different ; hypercube, orthogonal multiprocessor, and crossbar-connected multi- i processor configurations 98 ■ 6.4 Comparison of timing complexities for two problems on different hy percube, orthogonal multiprocessor, and crossbar-connected m ulti processor configurations......................................................................................100 6.5 Converting a (4 x 4) mesh with wrap-around connections onto a 4- ! processor OMP......................................................................................................102 ix 5 7 6 Converting a (4 x 4) generalized mesh program onto a 4 processor : OMP by mapping each column of 4 nodes to run on a single processor of the OMP 103 ; x L ist O f T ables 2.1 Possible Scalar and Vector Memory Accesses with Interleaved Mem ory Organization and Corresponding Access Time................................ 3.1 Reconfigurability in vector register windows for variable m atrix sizes. 4.1 Comparison of Maximum Time Overheads For Implementing Data Movement Operations with n 2 data........................................................... 5.1 Possible Message-Passing Patterns............................................................. 5.2 Memory-access Types Leading to Minimal Access Time for Imple menting Primitive Patterns on Three Multiprocessors Using Shared- Variable Communication Vectorization..................................................... 5.3 Estim ated Vectorized Communication Bandwidth in Mbytes/sec on | a 32 x 32 Orthogonal Multiprocessor for varying message lengths. . j 5.4 Estim ated Communication Bandwidth in M bytes/sec for Different ! OMP Sizes........................................................................................................ ^ 6.1 Equivalent Memory Accesses to Implement Primitive Message P at terns on Multiprocessors by Communication Vectorization................. . 6.2 Communication Complexities for Primitive Patterns on a m-node Hy- j percube using Circuit-Switched Communication.................................... | 6.3 Time Complexities and Associated Parameters to Implement Primi- ! tive Patterns on Two Multiprocessors using Communication vector- | ization............................................................................................................... i 6.4 Analytically Estimated Asymptotic Percentage Reductions in Com- ! munication Complexity by Converting a 16-processor Hypercube Pro- j gram Onto 16-processor Multiprocessors.................................................. ! 6.5 Percentage Reductions in Communication Complexity Derived by Sim- | ulation while Converting a 16-processor Hypercube Program to run j on 16-processor Multiprocessors................................................................... A b stract i ! Vectorized memory access schemes have been traditionally used in multiprocessors to enhance computational efficiency. However, applications requiring dense commu nication and data manipulation are unable to take advantage of these memory access schemes. In this thesis, we take a new approach to vectorized shared-memory access [With an objective of implementing processor-memory data movement, memory-to- [ | ' memory data manipulation, and processor-processor communication; all in vector- j ized manner. I This thesis has two major contributions. The first contribution lies in developing ! i a novel vectorized memory access scheme to blend with interleaved memory orga- \ I 1 nization. During vector data transfer between processor and interleaved memory j system, this scheme allows data elements of a vector to be manipulated on-the-fly • under program control. Using this scheme, we develop a new concept of atomic vector read-modify-write cycle and demonstrate parallel data m anipulation with minimal overhead from processors. W ith two-dimensional interleaved memory organization, : jwe demonstrate up to 75 % savings in computational bandwidth in implementing m atrix shifts and rotations. This scheme demonstrates potential to achieve concur rent computation and data manipulation. The second contribution is in developing a new concept of memory-based vec- iorized interprocessor communication on multiprocessors with interleaved shared memories. We configure this shared-memory as a collection of vector mailboxes. I ^ » W ith a suitable allocation of these mailboxes, we demonstrate that processors can | exchange messages by vector memory-write and memory-read accesses. Similar to vectorizing computational steps, this approach allows communication steps of a par allel program to be vectorized. We present a communication vectorization scheme. , This scheme vectorizes interprocessor communication steps of a distributed-memory multicomputer programs and implements them on a shared-memory multiprocessor, j 'Due to vector-oriented communication, sucli program conversion leads to a signifi cant reduction in communication complexity. Three multiprocessor configurations axe evaluated in their capabilities to support this vectorization. Communication complexities in these multiprocessors are compared with those of a hypercube sys tem using circuit-switched message passing. For applications requiring all-to-all type [of dense message patterns, communication complexity reduces by a factor of two to 1 four when a hypercube system is compared with a shared-memory multiprocessor of ,the same size. I C h ap ter 1 In tro d u ctio n 1.1 D a ta M ovem en t in M u ltip ro cesso rs Efficient processor-memory data movement often leads to significant performance enhancement in shared-memory multiprocessors [Bai87, BC90, YTL87]. Traditional approach in multiprocessor architecture development has been to implement high bandwidth and low latency processor-memory data movement. This objective has been geared towards matching computational bandwidth of the system with memory bandwidth and achieving high-performance computation. However, many applica-j I tions like numerical simulations, large-scale m atrix computations, and image process- j ing demand operations like shift, rotate, and row-column exchanges on matrix/image! data. These data movement operations either occur between computational steps I or are required between consecutive phases of a multiphase algorithm while solving a problem. Data exchange between processors during computation is well known] as interprocessor communication. The later type of data movement occurs when a given data allocation in memory does not work as an optimal allocation for the parallel program to operate on. This requires a change in data allocation. In the absence of any computation or processing, these types of data movement are known as data manipulation. Hence, data movement plays an im portant role in all three 1 factors of multiprocessing such as computation, interprocessor communication, and data manipulation. 1.2 V ectorized M em ory A ccess Interleaved memory organization supports vectorized data movement between mem ory modules and processors. Vector supercomputers use this memory interleaving extensively to implement pipelined data transfers between interleaved memory mod ules and pipelined computational units. Memory interleaving is used in multipro cessors to support high bandwidth and low latency shared-memory access. Consider multiprocessor configurations supporting interleaved shared memories as shown in Fig. 1.1. Each processor has its own local memory. System intercon nect facilitates vectorized data transfer between processors and interleaved memory modules. These vectorized memory accesses have been used traditionally to read vector data from memory, compute on the data, and write back results to memory as vectors. Such processor-memory vector data transfer facilitates computation as indicated by Fig. 1.2a. Shared Memory Processors System Interconnect interleaved memory modules local memories Figure 1.1: A typical shared-memory multiprocessor configuration with shared- interleaved memories. In this thesis, our goal is to investigate effective usage of vectorized memory access in implementing efficient memory-to-memory data manipulation (Fig. 1.2b) 2 and processor-processor interprocessor communication (Fig. 1.2c). We consider mul tiprocessors with one and two-dimensional memory interleaving. Multiprocessors with single or multiple buses support one-dimensional memory interleaving. Two- dimensional memory interleaving is supported by orthogonally-connected multipro cessor [HTK89]. Using these multiprocessor configurations as test beds, we compare and contrast the effect of memory interleaving in supporting vectorized data m anip ulation and inter processor communication. (a) computation (b) data manipulation (c) interprocessor Communication P - processor subsystem I - system interconnect M - memory subsystem Figure 1.2: Use of vectorized data access in multiprocessors in supporting (a) com-! putation, (b) data manipulation, and (c) interprocessor communication. 1.3 O n -th e-fly D a ta M a n ip u la tio n One easier approach in using vectorized memory access in data manipulation is to use a three step sequence: processors reading data from shared-memory using vector-read accesses, manipulating data in their respective data buffers by instructionj execution, and finally writing m anipulated data back to shared-memory as vector- write accesses. In this sequence, processors take active part in data manipulation operation and it undercuts processors from regular computational duties. Moreover, data m anipulation operations are themselves defined as operations requiring neither 3 processing nor computation. This leads to a problem of determining whether it is possible to implement data manipulation operations using vectorized memory access and minimal (or none) participation from processors. We take an on-the-fly indexing approach to solve this problem. During a vector- read access, data elements fetched from interleaved memory modules are w ritten to data buffers associated with respective processors. Similarly, a vector-write access reads associated data buffers and writes their contents to interleaved memories. We introduce a scheme of indexing these data buffers. During vector-read and vector- write accesses, the indexing scheme provides flexibility to select appropriate data buffers. This selection is programmable and gets implemented on-the-fly during vector-read and vector-write accesses. Using this scheme, we develop a concept of vector-read-modify-write access. W ith n-way memory interleaving, this scheme reads n data elements in one interleaved-read cycle and writes them back with any desired mapping in the following interleaved- write cycle. These two back-to-back cycles implement an atomic vector-read-modify- write access. Data manipulation operation using this atomic access is initiated by processor costing only few instructions. Rest of the operation does not need at tention from processor and gets implemented concurrently with computation. We show that such atomic access provides functionality of a generalized interconnec tion network, characterized by Thompson [Tho78]. The effectiveness of this scheme is evaluated by analyzing time complexity in realizing permutations and general ized mapping. Two-dimensionally interleaved orthogonal multiprocessor is shown to be powerful with this scheme compared to one-dimensionally interleaved crossbar- connected multiprocessor. Simulation experiments indicate that as the degree of memory interleaving increases, on-the-fly indexing becomes more powerful and pro vides significant savings in computational bandwidth. 4 1.4 C om m u n ica tio n V ecto riza tio n Processors of a shared-memory system have traditionally used memory-based mail boxes [Zho90] to communicate with each other by passing messages as shown in type of scalar communication works well for parallel programs requiring only one-to- one or permutation type of communication. However, most scientific and numerical applications use dense message patterns like broadcast, multicast, and personalized multicast [HJ89, Kum88, LEN90, LN90]. Several routing schemes have been pro posed in the literature [CE90, LS90] for implementing these communication-intensive message patterns efficiently on distributed-memory multicomputers. These patterns can easily be implemented through memory-based message pass ing by breaking them into multiple scalar communication steps. However, this ap proach limits the communication efficiency due to network and memory-access con flicts [Map90]. In this thesis, we raise a challenge in finding out new schemes for fast implementation of broadcast and multicast type of message patterns on shared- memory multiprocessors. Fig. 1.3. These communication operations are primarily of scalar type in nature and have been used for synchronization and semaphore implementations. The mailbox Processors P. M em ory m odules m ailbox (i,k) m ailbox (j,k) Figure 1.3: Traditional scalar mailbox approach for interprocessor communication in shared-memory multiprocessors. 5 We solve this problem by using memory-based communication and vectorized memory access techniques. A methodology is provided to convert send and receive operations of an interprocessor communication step into equivalent write and read memory-based mailbox accesses. Instead of scalar mailbox, we develop a new concept of vector mailbox. These vector mailboxes are allocated to interleaved memory modules in such a way that maximum number of mailbox accesses are implemented as vector accesses. This facilitates fast implementation of broadcast and multicast type of message patterns. We determine primitive message patterns used in parallel programs. We present a communication vectorization scheme. This scheme analyzes processor-memory interconnection network and interleaved memory organization of a shared-memory system, allocates vector mailboxes to memory modules, and determine optimal vec tor and scalar memory accesses to implement a message pattern with minimal time. Compared to link-based communication, memory-based message passing requires low software overhead [WB91]. Vectorized message-passing still reduces communi cation overhead of a parallel program by taking advantage of pipelined messagej i transfer to vector mailboxes. This leads to the use of shared-memory as an effective^ interprocessor communication medium. Processors considered in our multiprocessor models have local memories attached to them. Hence, communication vectorization concept leads to new ways of implementing distributed-memory parallel programs on shared-memory multiprocessors by using shared-memory to implement vectorized communication. This broadens the scope of a shared-memory system as a flexible system architecture [Gha89] and supports both shared-memory and message-passing models of parallel computation [BM89]. 6 We demonstrate such program conversion from various distributed-memory mul ticomputers such as hypercube and mesh to run on crossbar-connected and orthog onally connected multiprocessors. We concentrate more on circuit-switched hyper cube [Bok91a], due to its popularity among parallel computing researchers. We convert several hypercube programs representing primitive message patterns. The associated reductions in communication complexity are determined both analytically and through simulations. Significant reduction in communication complexity are ob served by converting hypercube programs using one-to-all, all-to-one, and all-to-all dense interprocessor communication. 1.5 O rgan ization o f th e T h esis This dissertation is organized into six chapters. Chapter 2 emphasizes on interleaved memory access in multiprocessors. We present three multiprocessor configurations using one-dimensional and two-dimensional memory interleaving. Different data allocation and memory access schemes with interleaved memory organization are discussed. Access times for these memory accesses are derived. Data manipulation using vector-read-modify-write access requires efficient data buffer organization associated with processor. We present such a hardware buffer organization, defined as vector-register-windows, in Chapter 3. This data buffer organization is reconfigurable to m atch with degree of memory interleaving. We also discuss about the capability of this buffer organization to support data coherency in multiprocessors with shared-memories. Hardware design and programmable controls of an index manipulator to support on-the-fly indexing and vector-read-modify-write access are presented. We illustrate few parallel data movement and manipulation operations on crossbar-connected and orthogonally-connected multiprocessors using on-the-fly indexing scheme. 7 Chapter 4 emphasizes on analyzing data manipulation capability of on-the-fly indexing scheme. We demonstrate that vector-read-modify-write access is func tionally equivalent to realizing interconnections with a generalized interconnection switch box. We analyze data manipulation capability of crossbar-connected and orthogonally-connected multiprocessors using this indexing scheme and compare it with that of a mesh-connected-computer with comparable size. We take a Clos net work modeling approach in this comparison. Related simulation experiments and results are presented. Chapter 5 centers around the theme of vectorized interprocessor communication. Modeling of interprocessor communication steps and determining primitive message patterns are presented. We introduce vector mailboxes and associated memory- based communication. A communication vectorization methodology is presented to determine optimal vector and scalar memory accesses to implement a message pattern with minimal access time. Conversion of multicomputer programs to run on multiprocessors is emphasized in Chapter 6. Analytical and simulation results indicate reductions in communication complexity. Chapter 7 concludes this dissertation and suggests future research directions. The first part emphasizes on summary of results and contributions. The second part indicates future short-term and long-term research directions. 8 C h ap ter 2 In terlea v ed M em ory A ccess in M u ltip ro cesso rs 2.1 In tro d u ctio n Memory interleaving leads to pipelined data transfer between processor and memory modules. For a multiprocessor with interleaved memory organization, it is critical that data be allocated efficiently to these interleaved memories to facilitate compu tation. It is also im portant to identify vector and scalar access types feasible with a given memory interleaving. In this chapter, we present three multiprocessor con figurations. The first two use one-dimensional memory interleaving, while the third one uses two-dimensional memory interleaving. We discuss allocating m atrix/im age data to these two different interleaved memory organizations to facilitate data ac cess by rows and columns. We analyze basics of memory interleaving, identify all possible memory accesses, and determine their respective access times. I 2.2 T h ree M u ltip ro cesso r C on figu ration s Figure 2.1 illustrates three shared-memory configurations: single bus-based, crossbar- connected, and orthogonally-connected. All these configurations use interleaved shared memories. 9 Interprocessor interrupt bus p° < a p‘(oi Processors v p 'c p ^ r - 1 - ! • n-1 MCn MC MC, izrzi n-1 6 6 o Bo M em ories M0 M, »V: (a) Single bus-based Processors M emories MC, 0,n-l '0,0 System Inter connect MC 1, 11-1 (C rossbar) ‘ n-1 n-1 (b) Crossbar-coimected RB, MC, MC l,n -l n-1 CB, (c) Orthogonally-connected Interprocessor Interrupt bus Figure 2.1: Three shared-memory multiprocessor configurations using interleaved memory organizations: (a) single bus-based multiprocessor, (b) crossbar-connected multiprocessor, and (c) orthogonally-connected multiprocessor (P,-: Processor, M ,- or Memory module, 13,: Interleaved access Bus, M Cf. Memory Controller). 10 2.2.1 B u s-b a sed M u ltip ro cesso r Consider an n processor system with n memory modules as shown in Fig. 2.1(a). The memory modules, M0, M i,..., Mn_1; are connected to a single bus B 0. The processors, P0,P i ,. . . ,P n- i, are connected to the bus through n identical memory controllers, M C q, M C\,... ,M C n- i- In addition to scalar memory access, this mem ory controller supports vector-read and vector-write accesses. Memories are n-way one-dimens ion ally interleaved. The system provides fully shared-memory capabil ity. An optional interprocessor interrupt bus is used to enable fast synchronization among the processors. 2.2.2 C ro ssb a r-C o n n ected M u ltip ro cesso r Figure 2.1b represents an n processor system with n 2 memory modules, Mo,o, Mo,i, ..., Mn_ iin_i. These memory modules are connected to n parallel buses. The processors are connected to these n buses through a system interconnect. This system interconnect can be a crossbar, multiple buses, or any multi-stage intercon nection network. W ithout any loss of generality, we assume a crossbar interconnect for our analysis. This configuration supports one-dimensional memory interleaving and provides fully shared-memory capability. We identify such configuration as a crossbar-connected multiprocessor (CCM) through out the thesis. 2 .2.3 O rth o g o n a lly -co n n e cted M u ltip ro cesso r Figure 2.1c shows an orthogonally-connected multiprocessor configuration with n processors, 2n buses, and n2 memory modules two-dimensional memory interleav ing. This architecture concept was originally reported in [HTK89]. A design imple m entation of this orthogonal multiprocessor was reported in [HDP+90]. We iden tify this configuration as either orthogonally-connected multiprocessor or orthogonal 11 multiprocessor (OMP). A group of n row buses, RBo, R B i, ..., R B n- \ , are directly connected to the processors in the horizontal dimension. The remaining n column buses, C B q,C B \, ... ,C B n-!, are distributed across the two-dimensional memory organization in an orthogonal way. Consider indices in the range 0 < i, j, k < n — 1. A memory module M ij is connected to two buses, R B i and C B j. Both R B i and CBi are controlled by the same memory controller MC{. Both row and column buses support n-way interleaving. This configuration allows a memory module to be shared between two processors Pi and Pj. A processor Pk is capable of accessing n memory modules Mk,j (or M jtk) in row (or column) mode. This orthogonal access mode underlines the basic principle behind the OMP and provides a conflict-free shared-memory access [HTK89, SM89]. This memory organization makes the multiprocessor a partially shared-memory system. In this thesis, we emphasize only on two-dimensional OMP. But the principle of orthogonal access mode can be extended to higher dimensions. Readers are referred to [HDP+90, HK88] for OM P(n, &), a multidimensional orthogonal multiprocessor architecture with dimension n and multiplicity k along each dimension. 2.3 D a ta A llo ca tio n in Shared M em o ry Each interleaved bus accesses a group of n memory modules in a pipelined manner. The S-access interleaving, with a fixed address offset to all n memory modules on the bus, corresponds to an implicit vector load/store operation with a vector length of n. We define the content of a memory location as an element. During a vector-read (write) access, a group of n elements forming a vector are read from (written to) n memory modules in a pipelined manner. Both the crossbar-connected multiprocessor and the OMP use a two-dimensional (n x n) memory organization. Consider a (p x p) m atrix A = (ajj) with p > 12 n. Allocation of the p2 m atrix elements to the (n x n) memory modules can be made in several ways. A uniform allocation distributes (p2/n 2) elements to each memory module. Figure 2.2 illustrates a shuffled partition approach in distributing the elements of an (8 x 8) m atrix onto a (4 x 4) memory array. Each row of the target m atrix is folded onto two consecutive address offsets. In this example, the capacity of each memory module is assumed to be four words. Figure 2.2 a shows the effect of 1-D row interleaving corresponding to the crossbar- connected multiprocessor. A vector-read row access with address offset 1 results in vectors with elements (ao,4, ao,5> ao,6, ao,7) an< l (a 2,4, ° 2,5, a2,6 , “2,7) for processors Pq and P2 respectively. Figure 2.2 b shows 2-D row and column interleaving. Similar to the row access, a vector-read column access with address offset 1 results in fetching of two vectors with elements (ao,4, < * 1,4, < 22,4, 03,4) and (< * 0,6 , 0 1 , 6 , 0 2,6 , 0 3 ,6) for processors P0 and P2, respectively. 2.4 In terlea v ed M em o ry A ccess T y p es Design of memory sub-system in all three multiprocessor configurations is assumed to support both scalar and vector memory accesses. This flexibility gives rise to six different types of memory write accesses as shown in Fig. 2.3. W ithout loss of generality, we assume the bus width to be equal to width of a memory word. Scalar- write allows a processor to write one data to a fixed memory module in a single access step. Block-scalar-write allows to write a block(/) of data to consecutive I addresses of a single memory module. A single data is written to identical addresses of all memory modules on a bus using broadcast access. Block-broadcast facilitates a block of data to be broadcasted to all memory modules. Multiple data (or a vector of data) are w ritten to the same address of a set of memory modules connected to a single bus by vector-write access. This access is 13 0 ^ 1 2 3 0 1 2 3 £ 2 3 0 1 2 3 (a) O ne dim ensional row interleaving in crossbar-connected m ultiprocessor £ _ i 2 3 0 1 R ow 2 3 address offset [7 2 3 0 1 2 3 (b) T w o dim ensional row and colum n interleaving in orthogonal m ultiprocessor Figure 2.2: Shuffled data allocation of an (8 x 8) m atrix onto a (4 x 4) memory array in example multiprocessor organizations with 1-D and 2-D memory interleaving. 14 * 0 , M 2 3 C olum n address offset 0 1 2 3 0 1 2 3 ST, Q A . 0,1 2 a a0 , « o, 0 1 2 3 0,3 0„ 0,5 a 0 ,0 a0,7 *1,W | i i j p i l 4,0 4* 4,1 4 5,0 a5J 1 3 1 ,1 MltJ a l,5 a 5,l _______^5J ST 2,V H i § ! 4,2 l *4.( *4,3 3 ,1 5,2 a5.d al,3 M l,l 1,7 5,3 ac- IGTl ' 2, 1........ *24 a 2 :S ' a 2,7 ‘6.0 6,1 a6 ‘ 4 3it Iff 3,1 *3,1 “ 7 1 7 3, 1 "a ,5 '7.0 7,1 7 ‘ f t # : * 6.2 6,3 a*- 7 17 3, 1 3,3 " 7 1 7 3,5 13,7 '7,2 a7,3 7,( ao,o a0,l ^0.7 a 0 4 a0,5 a4,0 a4,l a4.4 a4,5 a i,o M l ,t 31,1 Ml ,l a l,4 al,5 a5,0 a5,l a5,4 a5,5 a2,0 ^2.C a2,l M2.1 a2,4 a 2.s a6,0 a6,l a6.4 a6.5 a3,0 M3,t a3,l M3,l a3,4 a3,5 a 7,0 a7,l a7.4 a7,5 a0,2 a0,3 m Gj a0,6 a()7 a4,2 a4,3 a4,6 a4,7 3 1,2 ^7,2 3 1,3 M7,3 a l,6 a l,7 a5,2 a5,3 a5.6 a5.7 a2,2 -**2,2 a2,3 M2,3 3 2.6 a 27 a6,2 a6,3 a6.6 a6.7 a3,2 M3,2 a3,3 ^3,3 a3,6 a3,7 a7,2 a7,3 a7.6 a7.7 alternatively referred as interleaved-write access. This access is extensively used in traditional vector supercomputers to support high-bandwidth data transfer between pipelined computational units and memory system. Block-vector-write access or Block-interleaved-write access allows multiple vectors to be w ritten to consecutive addresses of interleaved memory modules. Such a memory sub-system also supports corresponding read accesses: scalar-read, block-scalar-read, vector-read, and block- vector-read. The vector read accesses are also referred as interleaved-read and block- interleaved-read in this thesis. © — [ 4 c> ..6 (a) scalar-write - i ) 6 - o- (b) block-scalar-write ......£ (c) broadcast ......4 (d) block-broadcast ■ ••• ...... (e) vector-write or interlea ved-write (f) block-vector-write or block-interlea ved-write D o ontr Figure 2.3: Possible different scalar and vector memory accesses. processor bus controller interleaved personalized broadcast memory module data data These memory accesses take different amount of access time. Consider a conflict- free scalar memory access by a processor P, to an interleaved memory module M i< 0 . As shown in Fig. 2.4a, there is a fixed tim e a to initiate this memory access. This time includes time to activate memory controller, memory controller putting required address on the bus and address propagation delay on the bus. Let /? represents time 15 needed for memory access and data propagation delay on the bus. Thus, a single scalar memory access (either read or write) takes (a + (3) time. a P Initiating a a P T T I X X I memory r6ad/write access memory module: 0 (a) scalar-read or scalar-write memory access read/write memory modules: Mjq Mj ^ ........... M i,n-l (b) vector-read or vector-write a P T T X X 8 T X X I T 1 T 1 1 Initiatin« a , ................. read/write memory access Mi,0 Mi,l ^incrementing............... read/write............... ► the address M- i,n-l vector: Mi,0 Mi,l 1 1 1 0 0 0 (c) block-vector-read or block-vector-write • • ^ i , n - l 1 Figure 2.4: Bus protocols and corresponding access time to implement different scalar and vector memory accesses. Consider a vector access as shown in Fig. 2.4b. The vector length is assumed equal to the degree of memory interleaving. For n-way interleaved memories, this is n. For a read access, it takes (a + (3) time to read the first data element from the memory module M,i0. Successive data elements from other memory modules are streamed out of the bus in (n — 1) minor cycles with a pipelined cycle time of r. Similarly for a write access, data elements are first streamed into registers associated with memory modules via the bus. A parallel write control is activated afterwards to implement vector-write. Time for this parallel-write access is included in /? as the memory access time. Thus, the overall time to perform a single vector-access is {a + (3 + (n - l ) r ). Figure 2.4 c shows the protocol for a block-vector access. A block of I vectors are written to consecutive addresses of the memory modules in consecutive I interleaved 16 cycles. We assume an overhead of 8 to increment subsequent addresses of the memory modules. For a typical implementation, t < 8 < f3 < a. Thus, a block-vector-read or a block-vector-write access takes (a + (/ — 1)5 + (nl — l) r ) time. Consider a case where a processor reads n data elements from an interleaved bus and writes them back immediately to the same address offset after rearranging data elements. W ith appropriate hardware support, these two interleaved (vector) accesses can be implemented as one atomic step. The bus protocol for this cycle is similar to that of Fig. 2.4c with a value of I = 2. The tim e 8 is used to switch ■from read to write access and change an index set (we define a concept of index set in Chapter 3). The first memory access is an interleaved-read and the following access is an interleaved-write. Similar to a read-modify-write scalar access, this cycle implements vector-read-modify-write access. We define this atomic vector access as an Interleaved-Read-Write (IRW) step. The overall time to perform an IRW step is (a + fi + (2n — 1 ) r + 8). Table 2.1 identifies all possible memory accesses and their associated access time. Scalar/vector accesses with single word are identified as A1-A5. Corresponding block memory accesses are identified as BA1-BA5. Memory access A6 represents atomic IRW step. Using the current bus technology, one can have r = 50 nsec, a = 800 nsec, /? = 200 nsec, and 8 = 200 nsec for n < 32 [HDP+90]. 17 Table 2.1: Possible Scalar and Vector Memory Accesses with Interleaved Memory Organization and Corresponding Access Time (a = fixed access time, j3 = memory access and bus transfer time, r = pipelined bus transfer time, and 6 = address change overhead in block access, I = number of words/vectors in a block access, n = vector length and degree of memory interleaving). Identifier Access Types Access Time A1 scalar-read a + /? A2 scalar-write BA1 block-scalar-read a + 1 /3 BA2 block-scalar-write A3 broadcast a + (3 BA3 block-broadcast a + 1 /3 A4 vector-read or interleaved-read a + (3 + (n — l) r A5 vector-write or interleaved-write BA4 block-vector-read a + (3 + (nl — 1 )r + (I — 1)6 BA5 block-vector-write A6 interleaved-read-write a + (3 + (2 n — 1 )r + 6 18 C h a p ter 3 D a ta M a n ip u la tio n H ardw are 3.1 In tro d u ctio n Interleaved memory organization provides vectorized data access to and from mem ory modules. D ata buffers, attached to processors, are used to store these vector data during memory access. Functionality associated with data buffer design facilitates efficient vector access. We take such an approach in this chapter in designing a new type of data buffer organization, defined as vector register windows. Design of an associated index manipulator is presented to implement on-the-fly data manipula tion during vector memory access. We illustrate parallel data movement operations on multiprocessors using this data m anipulation hardware. 3.2 V ector R eg ister W in d o w s Consider an n-way interleaved memory organization and processors with vector com putation capability. If vector data resides in shared-memory then all arithmetic-logic operations on vector data are done from vector registers to vector registers, attached to the processors. This input vector operand must first be loaded from memory to a vector register before it can be used by a processor. Similarly, an output vector operand must be stored into a vector register first and w ritten to memory later. 19 We consider a data buffer organization consisting of several windows of such vector registers. We define this organization as Vector Register Windows (VRWs). The elements of VRWs are accessed by a processor in its local memory address space. The only memory transfers possible with VRWs are from memory to vector registers (vector-read) and from vector registers to memory (vector-write). A read (write) ac cess to (from) a designated vector register window from (to) the shared memory is specified by the programmer and is implemented as a DMA-like operation. 3.2.1 O rga n iza tio n The VRW’s organization, associated with each processor, is shown in Fig. 3.1. There axe two components of this organization. We first consider the data buffer organi zation. The index manipulator is discussed later. Each interleaved memory access results in a vector of n elements. A set of v row or column vectors is grouped to gether to form a window of vectors. The VRWs consist of w windows. Consider a multiprocessor system with n = 16 and memory elements to be 4 bytes long. Using 64 KBytes of static RAM memory, one can implement VRWs with a capacity of IK vectors. 3.2.2 R eco n fig u ra b ility Vector registers are dynamically reconfigured in variable-size windows as illustrated in Fig. 3.2 and Fig. 3.3. For a target application using (p x p) matrices, any m atrix row or column is allocated to interleaved memories as row or column vectors as shown in Fig. 2.2. A row or column of this m atrix becomes equivalent to [£•] vectors. Hence, v = l"nl Provides a natural window size for the VRWs. The address mapping scheme for vector register windows works as follows: The two least significant bits, A 0 and Aj, are used to select one of the four bytes within an element. The four bits A 2 — A 5 20 Local Data Bus Address Bus Data Buffers Index Memory Interleaved Bus Index Manipulator Processor Bus Interface Logic Configuration and Control Registers v-X Buffer n-1 Buffer 0 Buffer 2 ATU: Address Translation Unit TRS: Transceiver MC: Memory Controller Figure 3.1: Block diagram organization of vector register windows with on-the-fly index manipulator. are used to select an element within a vector. The remaining 10 bits, A 6 — A i5, are dynamically partitioned between the vector field and the window field. Depending on the size of the application m atrix, log2 [£] bits are used for identifying a vector within a window. The remaining 10 — log2f^] bits are used to identify a window. Consider using VRWs with 64 KBytes capacity. Table 3.1 shows the number of available windows (iu) and window size (i>) on the example multiprocessor system for| f a wide range of application m atrix sizes. The VRWs provide a sufficient number ofj windows for handling large matrices to encapsulate locality of data references. T he dynamic reconfigurability of window size supports programming flexibility in appli cations involving multiple matrices with different dimensions. Detailed hardware design to support this reconfigurability feature is presented in [PHRH90]. 21 vectors windows I elements 0 n-1 . . . . 2 1 0 1 n-1 • • • 2 1 0 : : v-1 n-1 . . . 2 1 0 0 n-1 • . . . 2 1 0 1 n-1 . . ?, 1 n : v-1 n-1 . . . 2 1 0 i 0 n-1 . . . . 2 1 0 1 n-1 . . . . 2 1 0 • : v-1 n-1 . . 2 1 0 Figure 3.2: Reconfigurability in vector register windows. A15 * • Window Vector Element Byte <----- Reconfigurable partition Figure 3.3: Address mapping scheme for vector register windows. Table 3.1: Reconfigurability in vector register windows for variable m atrix sizes, (window size = number of vector components in each window). Matrix Size Window Size (v) Number of Windows 16 x 16 1 1024 32 x 32 2 512 64 x 64 4 256 128 x 128 8 128 256 x 256 16 64 512 x 512 32 32 1024 x 1024 64 16 2048 x 2048 128 8 4096 x 4096 256 4 8192 x 8192 512 2 22 3.2.3 D a ta C o h eren cy From a processor’s perspective, the VRWs are partitioned into two separate address spaces to treat global read-write and global read-only data, separately. These two address spaces separate the VRWs to noncacheable and cacheable spaces, respec tively. The caching boundary allows a user to partition the VRWs into cacheable space. The VRWs addresses falling below the caching boundary are cacheable. This boundary is programmable and can be dynamically moved during the execution of an application program [PHRH90], If a vector operand corresponds to global read-only data type, it is read into a vector register in the cacheable space of the VRWs. Otherwise, it is read into the noncacheable space. So, when the processor accesses elements of a global read-only vector from the VRWs, the elements are allowed to be cached to the internal data cache of the processor, if available. Noncacheable data are directly accessed from the VRWs and are bypassed by the internal cache. This scheme serves the following advantages: 1. If a read-only vector is used many times by an application, the computation becomes fast by keeping the data in the internal cache of the processor. 2. In case of context switching, all global modified register windows in the non cacheable address space of the VRWs are written back to the shared memory. The cacheable register windows containing global read-only data are not writ ten back to the shared memory. 3. Only few selective windows need to be flushed for small applications. This provides a program-controlled, selective, and fast flushing scheme. 23 3.3 In d ex M an ip u lator 3.3.1 O rg a n iza tio n Figure 3.1 shows the block diagram of an index m anipulator hardware. An index memory, m apped to the processor’s local address space, allows a number of index sets to be resident at any point of time. The frequently used index sets, once generated, remain resident in this index memory until overwritten by other index sets. These index sets are computed off-line during compile tim e and down loaded to the processor before the program execution starts. An index set is required to be resident in the index memory before its associated bus transfer starts. The index manipulator operates in three modes. The first mode allows a pro cessor to access the index memory in its local memory address space to store index sets. The second mode corresponds to pipelined bus access, where an index set con trols on-the-fly memory read/w rite operation. The third operational mode allows the processor to access (both read and write) the data buffers in its local address; space as data buffers without any indexing. The second mode allows entries of the required index set to be read out from the index memory by a log2 n-bit counter during n minor cycles of the interleaved access. These selected index set entries pass through address translation logic to generate effective address for data buffers. The accesses to index memory and data buffers are pipelined, i.e. the address generated by index memory during one minor cycle is used as the address to the data buffers in the next minor cycle. Hence, the pipelined cycle time r is totally constrained by the bus propagation delay. It does not include index memory access time, counter incrementing time, and etc. The present technology allows to build such an index m anipulator for r = 50 nsec using data buffers and index memory with fast static RAMs of 25 and 15 nsec access 24 times respectively [HDP+90, HP91]. In this paper, we only consider a single set of n data buffers with one word each. Multiple sets of data buffers can also be used to implement complex data movements. Under this circumstance, the indexing scheme demonstrates flexibility to index data buffers from different sets. For large data sizes, p > n, the index manipulation concept can be easily extended to achieve data manipulation across several rows or columns mapped to the same set of interleaved memory modules. Readers are referred to [PH90] for a reconfigurable vector register windows organization supporting such index m anipulation for large data sizes. 3 .3 .2 In terle a v ed -R ea d -W r ite w ith O n -th e-fly In d e x in g Consider pipelined data transfers during an interleaved-read access by a processor. The n elements are loaded from n memory modules to a set of n data buffers asso ciated with the processor. During a load operation, the data elements are written to different buffers. Similarly, during a store operation, the data elements are read from different buffers and w ritten to the memory modules. Let e, 0 < e < n — 1, be the indices of these buffers. Consider any perm utation or mapping (p) on the index set E = { 0 ,1 ,..., n — 1} with these indices. As the data elements are transm itted (or received) to (or from) the interleaved bus in a pipelined manner, the source (or destination) data buffers can be selected based on p. We define p as an index set and the operation of selecting appropriate buffer during a load/store operation as indexing. These index sets are generated off-line during compile tim e as specified by the programmer and stored in index memories associated with VRWs. During each memory access, a desired index set is selected from the index memory to implement the required data movement. Figure 3.4a shows the functional organization of index m anipulation logic. The interleaved bus is assumed to have 4 memory modules and each processor having 25 Buffers Processor Programmable On-the-fly Index Manipulator Buffer 0 Buffer 1 Interleaved Bus Buffer 2 Buffer 3 buffer index 0 1 2 3 £ h (a) functional organization Address Memory Data Contents offset 1 i i M i i f h (b) a read operation from address offset 1 with index set {0,3, 1,2} buffer index 0 1 2 3 (c) a write operation to address offset 1 with index set {1,1,2,0} Figure 3.4: Functional organization and operating principles of an example index1 manipulator with 4 memory modules on an interleaved bus. 26 4 data buffers. Figure 3.4b shows the on-the-fly index m anipulation scheme during an interleaved-read access with an index set {0,3,1,2}. As the data elements are read from the pipelined bus in sequence, they are w ritten into the buffers 0,3,1, I and 2 respectively. Similarly, during an interleaved-write access with an index set I {1,1,2,0}, the elements are read from the buffers 1,1,2, and 0 respectively and w ritten to the memory modules as shown in Fig. 3.4c. These two interleaved operations can be combined together as a single IWR step producing a m anipulated output data set {g, g, h, d} from the input data set {d,f,g,h}. For a given data m anipulation to be performed, various combination of| { index sets can be used during read and write accesses. W ithout any loss of generality,! j we assume that interleaved-read access is always performed with an identity index set. An appropriate index set is used with interleaved-write access for the desired data manipulation. This indexing scheme requires least intervention from the processor to implement data manipulation on a vector data. The overhead 6, as discussed in section 2.4, j is the minimal tim e to change an index set in the index memory. This allows con current computation and data manipulation in the system. This scheme avoids the' costly procedure of duplicating data into different buffers for broadcast and arbitrary^ broadcast types of data manipulation. The duplication is done on-the-fly under thej control of an appropriate index set. Hence, fast implementation of these data ma-j I nipulation operations are possible using the indexing scheme. i i |3.4 P a ra llel D a ta M o v em en t and M a n ip u la tio n The proposed on-the-fly indexing scheme provides substantial power to overlap CPU computation with data manipulation. In the absence of such a scheme, the required j manipulation can still be achieved by the CPU. In this case, the CPU after an 27 interleaved-read access has to m anipulate data elements in its data buffers through instruction execution and then write them back to memories. On the worst case, this m anipulation requires 2(n + 1) accesses to data buffers. The present scheme completely avoids this CPU involvement. The CPU only initiates data movement | I operation by executing few instructions. The concept of index sets for data m anipulation is identical to routing control steps in an interconnection network. Based on a desired m apping/perm utation, the associated index sets for processors are computed off-line during compile tim e and stored to their respective index memories. Hence, the overhead associated with index set generation does not affect the performance of on-the-fly indexing. For structured data movements like shift and rotate, index set generation by a compiler is straight forward, because interm ediate data movement steps are easy to ! determine. In this section, we provide two examples of parallel data movement to j illustrate this concept. For non-structured data movements like arbitrary permu- j itation or generalized mapping, it is necessary to determine the interm ediate data! movement steps first. We present a theoretical framework in the next chapter toj generate such interm ediate data movement steps. Consider a (4x4) m atrix data as shown in Fig. 3.5. Four different data movement operations are illustrated. We consider a (4 x 4) memory organization with 4 inter leaved buses and 4 processors. We assume the m atrix being stored in the memory, I as shown in Fig. 2.2, with an address offset AO. We use another address offset A1 as tem porary locations. We present two algorithms. Algorithm 1 demonstrates aj shift-left m atrix operation on a crossbar-connected multiprocessor using 1-D mem- j ory interleaving. The system is assumed to consist of 4 processors and 16 memory! modules. Two different index sets, identity and shift-2, are used. The complexity of! I this shift operation is one IRW step. i Algorithm 1: M atrix Shift on a Crossbar-Connected M ultiprocessor ;the algorithm shifts a (4 x 4) matrix by two columns left ;the following index sets are used: ;C0 = { 0,1,2,3 } indicating an identity permutation ;C1 = { 2,3,c0,c0 } indicating a shift-2 mapping, cO is constant 0 begin Parbegin j For each processor P,, i=0 to 3, do • begin I interleaved-read from bus B{ with address offset A O and index set CO ; interleaved-write to bus P, with address offset A O and index set Cl; end; Parend; end. ! I *0,0 a 0,l a 0,2 a 0,3 a 0,2 a 0,3 0 0 0 a 2,0 a 2,l a 2,2 '1,0 a u a l,2 a l,3 a l,2 a l,3 0 0 0 a 3,0 a 3,l a 3,2 ‘ 2,0 a 2,l a 2,2 a 2,3 a 2,2 a 2,3 0 0 0 0 0 0 ‘ 3,0 a 3,l a 3,2 a 3,3 a 3,2 a 3,3 0 0 0 0 0 0 (a) O riginal M atrix (b) Shift-left by (c) Shift-up-right by 2 colum ns 2 row s and 1 colum n >0,0 a 0 ,l a 3,0 a 0,3 a 2,0 a 2,l a 2,2 a 2,3 ll,0 a l .l a 3,l a l,3 a 0,0 a 0,l a 0,2 a 0,3 '2,0 a 2,l a 3,2 a 2,3 a 3,0 a 3,l a 3,2 a 3,3 ‘ 3,0 a 3,l a 3,3 a 3,3 a l,0 a w a l,2 a l,3 ✓ , (e) R ow perm utation (d) Select 3rd row to / i 2nd cohinrn ('0- ^ 3- ^ ) |Figure 3.5: Illustrative examples of selected classes of data movement operations! •with a (4 x 4) m atrix data structure. Algorithm 2 implements a m atrix manipulation on a 4-processor OMP. A select- row-to-column operation (selecting 3rd row to 2nd column) on an OMP using 2- I D memory interleaving is demonstrated. Two different index sets are used. The! manipulation takes place in two phases. The first phase implements shifting 3rd row to 2nd row. Then processor P2 moves second row to second column. This operation takes 2 IRW steps. Algorithm 2: Matrix Manipulation on An Orthogonal Multiprocessor ;the algorithm selects the 3rd row to 2nd column of a (4 x 4) matrix ;the following index sets are used: ;C0 = { 0,1,2,3 } indicating an identity permutation ;C1 = { c0,c0,3,c0 } indicating a select-2 mapping, cO is constant 0 begin j Parbegin 1 Switch to column mode; For each processor P,, i=0 to 3, do begin interleaved-read from bus CP, with address offset A O and index set CO ; interleaved-write to bus CP, with address offset A1 and index set Cl; j ! end; Parend; j For processor P2 do begin Switch to row mode; interleaved-read from bus R B 2 with address offset A1 and index set CO ; Switch to column mode; 1 l interleaved-write to bus C B 2 with address offset A O J and index set CO ; I ’ 1 end; i 7 I end. 1 30. |3^5 Softw are In terface an d P ro g ra m m a b ility jData elements, once resident in the VRWs, are directly accessed by processor in jits local address space in a pipelined fashion. These data accesses bypass index I j manipulator and associated index m anipulation logic. During these accesses, the !elements of the VRWs are physically addressed by processor based on the array i addressing mechanism. To access an element in the VRWs, the window number, the vector number, and the element number are made implicit in the processor’s address. For an example, consider a multiprocessor with 16-way memory interleaving and v vectors/window. To access the ith element of jth vector in kth window, the effective address generated by the processor (with assistance from compiler) is (k x 16 x v) -j- (j x 16) + i + base address of VRWs. By adding an offset argument to the effective address computation, several ab stract data types are supported in the VRWs. This feature is used by the program mer to define overlapped-window access operation with VRWs [PH90]. A window provides a one-to-one mapping to a row (or column) of the target application ma trix. Several smaller windows can be grouped to form a larger window. This mech-j i anism provides scope to define higher-level abstract data structures, like block-rowsi I or block-columns. The flexibility also includes overlapped window access to define data structures on a set of contiguous elements of adjacent rows (or columns) of: an application matrix. A window can be divided into m ultiple smaller windows to! support sub-vector data structures. This modular and structured abstraction of thej VRWs allows a programmer of a multiprocessor system to operate on a large appli-' I cation m atrix by defining the m atrix as a small collection of vector windows rather' than a large collection of shared memory row (or column) vectors. C h a p ter 4 i j F ast D a ta M a n ip u la tio n w ith O n -th e-fly In d ex in g | i ! 4.1 In tro d u ctio n Vector register windows with on-the-fly indexing support efficient data manipula tion. We illustrated this concept in last chapter through examples. However, a natural question arises about how to determine its capability in performing general I classes of data movement like perm utation and generalized mapping. In this chapter, j we take a theoretical approach in determining the capability of index manipulation I hardware. Using on-the-fly data manipulator on interleaved bus, we perform a time complexity analysis for different types of data movement. We compare the complex ity with that on a mesh-connected-computer. Performed simulation experiments and results are presented to indicate capability of such data manipulation in preserving computational bandwidth. 4.2 E q u ivalen ce to a G en era lized N etw o r k j The index m anipulation scheme is powerful due to its programmability feature.; Consider an (n x n ) switch network. Let the n elements fetched during an interleaved- ! read access be designated as inputs and the n elements stored during the following! interleaved-write access with indexing as outputs of this switch network. The index: m anipulation scheme has capability to map any input to arbitrary number of outputs j as long as each output receives only one input. Thus, this switch realizes generalized! i interconnections from inputs to outputs as suggested by Thompson [Tho78]. Thisj switch can also be viewed as arbitrary many-to-one connectivity from the outputs to the inputs [Kum88]. We denote such a switch with generalized interconnection capability as a generalized switch network. The capability of such a switch network is much more powerful than that of a crossbar switch. Theorem 1 A ny one of the n n possible mappings over n data elements can be carried out in one interleaved-read-write step using on-the-fly index manipulation scheme with an n-way interleaved memory organization. Proof: Consider a set D = {d0, d i,. . . , dn_i} of n data elements. Let an one dimensional interleaved memory organization be having n memory modules M 0, M i, . . . , Mn_1. We distribute these data elements across the memory modules such thatj di is stored in M i, 0 < i < n — 1, at a fixed address offset A. Let J = { 0 ,1 ,..., n — 1} j be an ordered set of n indices. Consider a mapping / from the set D to J as f(D) = , {* | di — » j, Vy € J, di G D}. Clearly there are nn possible such mappings. Using i the on-the-fly index manipulation scheme, we perform an interleaved-read access to address A and fetch n data elements to n data buffers with identity perm utation. Subsequently, we perform an interleaved-write access to address A with an index set f{D). Now, the data elements are stored in the memory modules with the desired mapping f(D). ■ Corollary 1 The on-the-fly indexing scheme with an n-way interleaved memory organization is functionally equivalent to realizing an (n x n) generalized switch network as suggested by Thompson [Tho78]. In addition to realizing any of the nn arbitrary mapping in a single IRW step, the indexing scheme provides flexibility to write back one or more memory locations with single (multiple) constants like 0,1, etc. W ithout any loss of generalization, we consider duplication of only one constant 0, identified by the symbol cO. This duplication is quite useful in parallel processing for filling patterns during shift and rotate operations without any wrap-around. Hence, one step IRW operation can be represented as an (n + 1) x n generalized switch network with a capability of realizing any of the (n + l) n mappings. Figure 4.1 a illustrates the potential of a (5 x 4) switch 1 box to implement any of the possible 5 4 mappings. Figures 4.1 b and c show switch | settings to realize mappings (2,0,c0,3) and (0,c0,0,c0) respectively. These switch j settings are nothing but the respective index sets in the indexing scheme needed to ^ realize the mappings. cO cO Input Output (a) Generalized switch I connections using (b) A mapping (2,0,c0,3) (b) A mapping (0,c0,0,c0) ! on-the-fly indexing Figure 4.1: Equivalence of on-the-fly indexing scheme to a generalized switch (cO is; a constant pattern 0). j 4.3 D a ta M o v em en t C ost M o d el From the point of view of data allocation and data movement, similarity betw een1 I OMP and MCC is shown. D ata movements on the OMP using only intra-row and 1 intra-column operations are first analyzed using a Clos network modeling. This Clos* network is then modified to reflect row-column and column-row operations on the: OMP. ! 4 .3.1 S im ila r ity B e tw e e n O M P an d M C C The two-dimensional memory organization of an OMP resembles in many ways to an MCC organization as shown in Fig. 4.2. Each node, MP, of the MCC consists |of local memory and processor. These nodes are organized in a grid structure with j 'horizontal and vertical links. The horizontal links, n of them in a row, are joined together to form a virtual row bus in an OMP. Similarly, the vertical links are joined together to form a virtual column bus. The processing tasks on n nodes in a I . . 'single row of MCC are assigned to a single processor in the OMP. D ata in the local memory of node M P ij of MCC are mapped to memory module M ij of OMP, where 0 < i , j < n — 1. Thus, for two-dimensional data allocation and data movement,J these two architectures are functionally identical. MP, MP, MP, O Memory 4-switch | | Processor 1 Horizontal link Vertical link n-1,0 n-1,n-1 Figure 4.2: Logical connectivity in the distributed memories of a mesh-connected- computer (MP = combined Memory and Processor module). 4 .3 .2 U sin g C los N etw o rk to A n a ly z e C ost We first analyze basic data movements on the OMP in terms of MCC data move ments. An IRW operation in an OMP can be implemented either in row or in column mode. A single IRW step, due to its generalized data m anipulation capability, cani 35 i ___i simulate up to (n — 1) row or column shift operations of an MCC. Consider data movements to realize a permutation. Based on a 3-stage Clos network analysis, Raghavendra and Prasanna Kumar [RK86] have dem onstrated that any perm uta tion can be realized in 3 stages consisting of 3(n — 1) steps on an (n x n) Illiac-IV type network. Based on the rearrangeability property of a 3-stage Clos network, the result remains valid for an MCC. Hence, any perm utation over n 2 data can be I realized in a maximum of 3 row and column operations on an (n x n ) MCC. Each) operation may take one or more row (column) shift steps. We consider this time! overhead aspect in the next section. Consider a 3-stage Clos(n,n,n) network as shown in Fig. 4.3a with (n x n ) perm u tation switch boxes. Let these perm utation switch boxes be replaced with generalized switch boxes, which functionally represent on-the-fly indexing scheme. We number the inputs and outputs of the switch boxes of these 3 stages in row,column, and row m ajor respectively. Let the switch boxes be numbered as S(i,j),0 < i < 2,0 < j < | i n — 1. Since a Clos(n,n,n) network is rearrangeable, it can realize any perm utation! between its n 2 inputs and n2 outputs. The switch settings of the first stage, reflect the ordering of data within jth row. Similarly, the switch settings of the second and third stages reflect the ordering of data within j th column and jth row respectively. Let the switch setting of S(i,j) be the index set for processor j on ith stage. D ata movement corresponding to the switch settings of one stage can be realized in a single IRW tim e step by n processors working in parallel with the| i [associated index sets. Hence, we claim that any permutation over n 2 data can be, I I realized in a maximum of 3 IR W steps on an (n x n) OMP using alternate row and\ column operations. Next, we consider any arbitrary generalized mapping of n 2 elements on an (n x n) OMP. Using (n x n) switches with general broadcast capability, Kum ar [Kum88]; RRRW CRCW RRRW src indices (0 ,0 )- (0 ,1 )- (0 ,2 )- (0 ,3 )- S(1,0) (1 ,0 )- (1,1>— (1,2)— (1 ,3 )- S (l,l) (2 ,0 )- (2,1 y- (2,2 y- (2,3 y- S(l,2) (3,0 y- (3.1)— J (3 .2 ) - (3 .3 ) - S(l,3) - (1,1) Row major ordering stage 1 Row major ordering step k destn indices (0,0) (0,1) (0,2) (0,3) - (2,0) - (2,1) (2,2) (2,3) h < 3>°) - (3,1) (3.2) (3.3) Column major ordering Row major ordering stage 2 stage 3 (a) 3-stage Q os network model S(k+1,0) S(k+2,0) S (k+ l,l) S(k+2,1) S(k+1,2) S(k+2,2) S(k+1,3) S(k+2,3) Column major ordering step k+1 Row major ordering step k+2 row/column operation ........... row-column/column-row i operations j (b) 3-stage modified Clos network model j Figure 4.3: Clos network models to analyze data movements on a (4 x 4) OMP (a)i a 3-stage Clos network model supporting alternate row and column operations, (b ). a modified Clos network modeling for row, column, row-column, and column-row < operations. \ has shown th at any generalized mapping can be realized on a 5-stage Clos(n,n,n) network. The first 3 stages of this network perform the required data duplication and the next 2 stages implement perm utation. T h e o re m 2 Any generalized mapping over n2 data elements can be realized in a maximum of 5 IR W steps on an (n x n) OMP using alternate row and column operations. jProof: Consider the 5 stage Clos network and the associated switch settings as Iproposed in [Kum88]. The switches, considered in this network with general broad- ! i jcast capability, are equivalent to our generalized switch associated with the indexing! scheme. The switch setting of S(i,j) can be treated as the index set of processor' j to realize ith stage of data movement. Each stage of this Clos network can be I i realized in a single IRW step on an OMP using the indexing scheme. i i C o ro lla ry 2 Any generalized mapping of n2 data can be realized in a maximum of 5 row and column operations on an (n x n) MCC. But, the question remains how many row (column) shift steps are needed to; 'implement an intra-row (intra-column) operation on the MCC and how does the j ; lassociated time overhead compare to that of an IRW step. We emphasize on this' 1 aspect in the next section. Until then, we refer MCC data movements in term s of! i operations and OMP data movements in term s of IRW steps. I 4 .3 .3 A M o d ified C los N etw o rk I So far we have only considered IRW steps on an OMP, which are equivalent to MCC intra-row and intra-column operations. These row and column oriented IRW steps on an OMP can be defined as Row-Read-Row-Write (RRRW) and Column-Read- | Column-Write (CRCW). An OMP, due to its two-dimensional memory organization, j provides flexibility of two additional IRW step types known as Row-Read-Column- Write (RRCW ) and Column-Read-Row-Write (CRRW). Consider the Clos network shown in Fig. 4.3a. The inputs and outputs of each stage of a Clos network are numbered based on alternate row and column m ajor ordering. Based on this ordering, data movements corresponding to RRCW and CRRW operations provide straight links between adjacent stages of a Clos network. Figure 4.3b provides a modified Clos network model to analyze all possible data movement steps on an OMP. From a given step (stage), it allows two options to go to the next step (stage). Any data movement operation on an OMP can start with any of the RRRW ,CRCW ,RRCW , or CRCW basic steps and follow through a sequence of these steps. Though it is logically possible to go from one basic step to any other step, some of these sequences are redundant. For example, a RRRW step followed by another RRRW step can be combined together to a single RRRW step. Figure 4.4a shows in dependent two-step sequences. Figure 4.4b shows valid transition sequences between i these basic steps. Any data movement operation on an OMP makes finite number of' transitions (5 transitions for generalized mapping according to Theorem 2) through this state-transition diagram. The reducible steps shown in Fig. 4.4a are valid only when all the processors are active. The equivalence conditions do not hold good in case only a set of processors are active or the set of active processors in two consecutive steps are different. For example, if processor Pq only performs a RRCW operation in step k and processor P3 only performs a CRRW operation in step (k + 1), then these two steps are independent. These two steps can not be reduced to a single RRRW step. We denote such reductions as conditional reductions. Figure 4.4b also distinguishes 39 [between these two types of transitions. The solid line transitions are always valid. The dotted line transitions are valid conditionally. RRRW. RRRW = RRRW RRCW = RRCW CRRW CRCW RRCW RRRW RRCW CRRW CRCW CRRW. RRRW RRCW CRRW CRCW CRRW CRCW CRCW RRRW RRCW CRRW CRCW RRRW RRCW CRRW CRCW (a) Reducible two step sequences CRCW RRRW RRCW CRRW ST (b) Valid transition sequences ST- Start state Any other state is a final state »— always valid i- conditional valid Equivalent operations available in MCC RRRW - Row Read Row Write RRCW - Row Read Col Write CRRW - Col Read Row Write CRCW - Col Read Col Write Figure 4.4: Basic data movement steps in the orthogonal multiprocessor and the associated state transition. [4.4 D a ta M o v em en t C o m p le x ity A n a ly sis ! ;We compare the complexity of data movement on the OMP and the CCM. A re- ! iduction technique is proposed to reduce certain classes of data movement operations 'into fewer steps on an OMP. Finally, we compare the OMP with the MCC. I 4 .4 .1 C o m p a rin g O M P w ith C C M Consider the 4 basic data movement steps RRRW, CRCW, RRCW , and CRRW on an OMP. The results derived in Theorem 2 use only RRRW and CRCW steps, j Both the OMP and the crossbar-connected multiprocessor, as shown in Fig. 2.2, exhibit identical row interleaving. Hence, both multiprocessors perform equally well jwith respect to a RRRW step. They differ in their capabilities to implement an intra-column CRCW step. Hence, we analyze the involved complexity on a crossbar- connected multiprocessor to implement data movement equivalent to a CRCW step on an OMP. Since the crossbar-connected multiprocessor does not have independent column buses, parallel intra-column data movement is not guaranteed to be implemented in a single IRW step. We illustrate this idea by a simple 3 processor example as shown in Fig. 4.5. The switch settings of S(i,j ) on ith stage of the Clos network model corresponds to intra-column operations f(j) to be implemented by processors P j I during ith step of the data movement. If all f(j) are identical as shown in Fig. 4.5a, | then there is no bus conflict. Hence, only one IRW step is sufficient to implement! such intra-column data movement. If they are different, as shown in Fig. 4.5b,! m ultiple substeps are needed to avoid conflict by processors in accessing the buses | during each substep. j i The number of substeps required to implement an intra-column step depends on : jwhether the desired data movement operation is a perm utation or generalized map- i ■ping. If the desired operation is a perm utation, the required intra-column operations I are also perm utation of the respective column data. In case of generalized m apping,, the intra-column operations may be either perm utation or generalized m apping.! Additionally, the interleaved memory organization does not allow scalar access and 41 — s 0 __ f(2) (a) Identical intra-column operations implemented in a single step (b) Different intra-column operations implemented in multiple substeps Figure 4.5: Single step or multiple substeps realization of an intra-column data movement operation on an example 3— processor crossbar-connected multiprocessor with 9 memory modules and 3 interleaved buses (f ( j ), 0 < j < 2, reflects intra column operation belonging to column j). restricts all memory accesses to be interleaved (vector) accesses. Algorithm 3 im plements an arbitrary intra-column generalized mapping on the crossbar-connected multiprocessor. Two sets of n data buffers, VO and VI (each data buffer consisting of a single word), are used for every processor. The elements of the index set are iden tified by a notation Vx.y to index yth data buffer of the set Vx. The intra-column operations are defined by functions f(k), 0 < k < n — 1. Algorithm 3: Intra-column Generalized Mapping on the Crossbar-Connected M ultiprocessor 1 parbegin |2 For each processor Pi, i = 0 ,1 ,..., n — 1, do 3 begin 4 For k = 0 to (n — 1) do 5 begin 6 Perform interleaved-read from bus Bi to data buffer set V O with identity index set; 7 Perform interleaved-read from to data buffer set VI with identity index set; 8 If (k — 0) then 9 Perform interleaved-write to Bi with 42 s index set {VT.O, V0.1,..., VO.n — 1} 10 else 11 If {k — n — 1) then 12 Perform interleaved-write to Bi with index set { V0.0,..., VO.n - 2, VI.n - 1} 13 else 14 Perform interleaved-write to Bi with index set {V0.0,..., V0.& — 1, VI.k, VO.k + 1, • • •, VO.n — 1}; j15 end; 1 16 end] ! 17 parend] T h e o re m 3 Using 2n data buffers per each processor, an (n x n ) crossbar-connected'^ multiprocessor takes up to a maximum of (n2 + 2n)/2 IR W time steps to implementj an intra-column generalized mapping and 3n/2 IR W time steps to implement am i intra-column permutation, which are equivalent to a single C R C W operation on the! OMP. P ro o f: Consider the steps in algorithm 3. For perm utation, step 7 is implemented in a conflict-free manner. In the presence of arbitrary broadcasting in a generalized 6 mapping, step 7 may take up to n interleaved-read steps to take care of bus conflicts, j This arises because a single data might be required to be broadcasted to manyj locations in a particular column. Steps 7 and 8 represent an atomic IRW step. The overhead of interleaved-read operation in step 6 can be approximated to 0.5 IRW tim e step for large n. Hence, algorithm 3 takes up to (n2+ 2 n )/2 IRW steps for generalized! mapping and 3n/2 IRW steps for perm utation. For specific selected mappings such| as complete broadcast and intra-column broadcast etc., the above complexity can bej reduced to n or logn IRW steps. ■ i By providing up to n2 data buffers (n sets of n-data buffers each) for each pro- ‘cessor, the above algorithm can be improved. Each processor can read a maximum! of n rows and then perform a final interleaved-write to its own row with the desiredj I data. According to earlier results, any perm utation (generalized mapping) over n 2 J data can be realized on an OMP in 3 (5) alternate RRRW and CRCW steps. So,: I maximum 2 (3) of these steps can be CRCW steps. This leads to the following I results on perm utation and generalized mapping of n 2 data elements on a (n x n ) crossbar-connected multiprocessor: C o ro lla ry 3 (a). Using n 2 data buffers per each processor, it takes (n + l ) / 2 IR W time steps to implement an intra-column permutation or generalized mapping. (b). A ny permutation can be realized in a maximum of (n -f 2) IR W time steps with n 2 data buffers per each processor. With limited 2n data buffers, the same operation takes a maximum o f (9n + 4)/2 IR W time steps. (c). A ny generalized mapping can be realized in a maximum o /(3 n -f7 )/2 IR W time steps using n 2 data buffers per each processor. With limited 2n data buffers, the same operation takes a maximum of (3n2 + 6 ra + 4)/2 IR W time steps. 4 .4 .2 R e d u c tio n o f D a ta M o v em en t S te p s O n O M P According to Theorem 2, data movement operations corresponding to both permu-1 tation and generalized mapping can be implemented on the OMP using alternate! I RRRW and CRCW steps. We denote such sequences of RRRW and CRCW steps using a sequence operator o. A RRRW step followed by a CRCW step is denoted as RRRWoCRCW. Given a sequence, the steps are implemented from left to right. Since any generalized mapping can be realized in a maximum of 5 intra-row and intra-column steps, we consider sequences with a maximum of 5 steps. Possible se- 1 quences of RRRW and CRCW steps to realize any data movement operation on an OMP are as follows: Each of these sequences can be identified as a ^-sequence, where q ,l < q < 5,j is the length of the sequence. For examples, RRRW oCRCW is a 2-sequence andj RRRWoCRCWo RRRWoCRCW is a 4-sequence. The length q also reflects the I I 44 ! RRRW CRCW RRRWoCRCW CRCW oRRRW RRRW oCRCWo RRRW CRCWoRRRWoCRCW RRRW oCRC Wo RRRWoCRCW CRCWoRRRWoCRCWoRRRW ; RRRWoCRCWoRRRWoCRCWoRRRW CRCWoRRRWoCRCWoRRRWoCRCW | j corresponding number of intra-row and intra-column operations required on thej I JMCC to implement the required data movement. The OMP provides additional RRCW and CRRW steps as discussed earlier.] Hence, we consider possible reduction of 9 -sequences into smaller sequences using; I RRCW and CRRW steps. Consider a RRCW step on an OMP. D ata elements from I one or m ultiple rows are read by the respective processors and w ritten to their own j columns using the indexing scheme. Multiple RRRW and CRCW steps are needed to implement such a transposition operation. This leads to the following sequencing) result: ' I ( T h e o re m 4 Each R R C W or C R RW step has an equivalent q-sequence, 2 < q < 5.! I i P ro o f: Consider a RRCW or CRRW step on the OMP. This step might be corre-) i sponding to a perm utation or to a generalized mapping. In case of a perm utation,! this step can be carried out in a maximum of 3 alternate RRRW and CRCW steps, j One RRCW or CRRW step can never be equivalent to a single RRRW or CRCW step. Hence, in case of a perm utation, there exists an equivalent 2- or 3-sequence for each RRCW or CRRW step. Similar argument holds good for generalized mapping j where the equivalent ^-sequence is either a 4-sequence or a 5-sequence. ■ Consider the 9 -sequences with single RRCW or CRRW equivalent. We define these 9 -sequences as reducible sequences and categorize them into a reducible class called 71. Though it is im portant to formally characterize the class 7Z, our objective in this paper is targeted towards showing the principles of reduction. We only 45 RemonstrateThe existence of th e class 7Z and dem onstrate reduction techniques. For certain classes of data movement operations, 2 -sequence steps are reducible as shown below. Consider indices in the range 0 < < n — 1. Let (i,j) and (i ',j ') represent the indices of the elements of a two-dimensional data array before and after a data movement step, i.e. the data element with index (i,j) moves to new location with index (i',j'). A RRCW step can be functionally defined as i' — f ( i , j ) and j r = i. If f ( i , j ) is a symmetric function of i and j, then this RRCW step can be decomposed into two steps: a CRCW step followed by a RRRW step. For an example, consider f ( i ,j ) = (i + j + u) mod n for 0 < u < n — I. A CRCW step moves data elements (i,j) to (iH ,j"), where j" = j and i" = (i -f j + u) mod n. The following RRRW step moves data elements to where i' = i" and j ' = {% " — j" — u) mod n. Similar symmetric property also makes a RRRW operation followed by a CRCW operation reducible to a single CRRW operation. A large number of characterizations can be used to dem onstrate 3— , 4— , and 5— sequences as reducible. We use a notation b to denote a potential reduction operation from a sequence to an equivalent sequence involving fewer number of steps. A h B indicates that the operation sequence A may be reduced to an equivalent operation sequence B. So the reduction operations with 2 -sequence can be defined by the following two notations: D1 : (C R C W o R R R W ) h R R C W D2 : (R R R W o C R C W ) b C R R W where D 1 and D 2 are tags to identify the respective reduction operations. Fig ure 4.4a shows reducible steps like RRCW followed by CRCW is equivalent to RRCW 46 'etc. Some of these conditional reducible operations related to RRCW , CRRW op-j i . ' jerations followed by either RRRW or CRCW operations can be defined using the* following notations: ' D 3 D4 Db D6 (R R C W o C R C W ) b R R C W (C R R W o R R R W ) b C R R W (R R R W o R R C W ) b R R C W (C R C W o C R R W ) b C R R W The reduction types D l and D2, combined together with D3 — D6, give rise to chain-reductions. Depending on the type of data movement, sequences with alternate, RRRW and CRCW steps can be replaced with shorter sequences involving RRCW and CRRW steps. Sometimes, even a whole sequence of RRRW and CRCW steps can be reduced to a single RRCW or CRRW step. An example of a 4— sequence reduction (possibility of reduction is checked from left to right) is as follows: R R R W l o C R C W , o R R R W 2 o C R C W 2 bm R R R W , o R R C W X o C R C W 2 b£,5 R R C W 2 0 c r c w 2 \-m r r c w 3 i Possibility of chain reduction totally depends on the data movement operation. For certain data movement operations it might not be possible to reduce their q- \ i sequences at all. In this case, the complexity of realizing perm utation (generalized imapping) on an OMP remains 3 (5) IRW steps. But using this reduction m ethod ology, data movement operations belonging to the reduction class 7Z can be imple mented in fewer steps on an OMP involving RRCW or CRRW steps. i ! Now we determine the tim e overhead to implement a RRCW or a CRRW op- i eration on an OMP in number of IRW tim e steps. In case of a RRCW or CRRW operation, Tp is counted twice for two different addresses, one for interleaved-read and lone for interleaved-write. Besides there is an additional overhead of 7 for synchro-1 1 : nization before bus switching. Let the additional overhead compared to a normal IRW cycle, T +^4.2m identified as e. Based on the OMP design param eters used at USC [HDP+90], we have e = 0.35 for n = 32, Tp = 1000 nsec, r = 50 nsec, /? = 300 nsec, and 7 = 500 nsec. This e reduces to 0.2 for n — 64. Hence, each RRCW or CRCW operation on the OMP is equivalent to (1 + e) IRW tim e steps. So, we have the following result: C o ro lla ry 4 Data movement operations belonging to reducible class 1Z can be im- \plemented on an OMP in (1 -f e) IR W time steps using row-column and column-row | operations. i 1 |4.4.3 C o m p a rin g O M P w ith M C C \ i We consider a centralized routing scheme in both architectures. For a 3-stage Clos network model, switch settings can be generated for an arbitrary perm utation by using well known Benes network algorithms. Similarly, switch settings on a 5-stage ]Clos network model for an arbitrary generalized mapping can be determined using I |the algorithm suggested in [Kum8 8 ]. i For an OMP, the switch settings of each stage of this Clos network get translated jto appropriate index sets and gets implemented as alternate RRRW and CRCW isteps. Depending on the type of data movement, some of these RRRW and CRCW I [steps dem onstrate potential to be implemented as RRCW or CRRW steps. I I In case of an MCC, each stage of the Clos network is equivalent to a single I m tra-row or intra-column operation. These operations are equivalent to linear array 1 ^operations. Consider the associated tim e overhead to realize either a perm utation |or a generalized mapping on a linear array as Tunear. Given a switch setting for a linear array operation, the nodes on that linear array can be programmed to receive I I I 48 "HataTffom designated source nodes. We analyze the complexity of this linear array' routing to determine overall tim e complexity for MCC data movements. Consider a switch setting S(k, r) in kth. stage of the Clos network reflecting intra-row data movement operations in rth row of an MCC. Figure 4.6a shows an. example for a (4 x 4) MCC. For a given switch setting, each destination node can be programmed for its source-tag, i.e. the node number from which it will be receiving jdata, as shown in Fig. 4.6b. Each source node attaches its own tag to the data jbefore data movement starts. This is equivalent to a random access read operation ^suggested in [NS81]. We consider a linear array with n processors, P;, i = 0,1,..., ra il. Each processor is associated with two buffers, left (L) and right (R), which can I ;be accessed by its respective neighboring nodes. The following algorithm routes the data to appropriate nodes in two passes both for perm utation and generalized j mapping. I j S (k,r) ! node indices node indices (r>0) nodes 0 1 2 3 — (r,l) (destination) — (t,2) (r 3 ) source tags 2 0 3 0 (a) A n example switch setting for rth row (b) Corresponding source tags in £th stage o f a Clos netw ork for destination nodes jFigure 4.6: Source tag generation for linear array data routing in MCC from Clos 'network switch settings. i I jAlgorithm 4: Linear Array Data Routing for Perm utation and Generalized Mapping begin > parbegin For all processors do , begin (r,0) — M ) — (r,2) — (r,3) — desired For k = 1 to (n - 1) do begin Read data from left neighbor’s buffer j R ; Compare received source-tag with its own source-tag; Keep the data in case of a match; Write data to its own buffer R; end; end; parend; parbegin For all processors do begin For k = 1 to (n — 1) do begin Read data from right neighbor’s buffer L; Compare received source-tag with its own source-tag; Keep the data in case of a match; Write data to its own buffer L; end; end; parend; end. Theorem 5 The time overhead to realize an arbitrary generalized mapping over n 2 data elements on an (n x n ) M CC using alternate row and column operations is comparable to 5 IR W time steps on an (n x n) OMP. Any permutation realization is comparable to 3 IR W time steps. P ro o f: Algorithm 4 requires 2 (n —1 ) routing steps and 2 (n —1 ) compare operations. Let the tim e overheads for a near-neighbor link communication and a compare oper ation in MCC be r' and c respectively. Hence, Tunear = 2(n — l)c' + 2(n — l ) r '. The minor cycle tim e r of an interleaved bus operation on the OMP is dependent only on the bus propagation delay. Based on the present day bus technology [Bal84] and for 50 jn < 64, it can be assumed that r ~ t and Tp ~ nc . This leads to Tirw — Tunear- j Since a maximum of 5 intra-row or intra-column operations are involved on an MCC for any generalized mapping, the overhead is equivalent to 5 linear array operations. This is comparable to 5 IRW steps. Similar argument holds good for perm utation. j Table 4.1 provides a summary of all the comparisons [PH91b] reported in this section. The tim e overheads are normalized to IRW tim e steps. (1 + e) is the number | of IRW tim e steps needed for a row-column or column-row operation on an OMP,, I I as discussed in section 6 .2 . ; Table 4.1: Comparison of Maximum Time Overheads (normalized to IRW tim e steps! on OMP) For Implementing D ata Movement Operations with n2 data. Data Movement Operations (n x n) MCC (n x n) OMP (n x n ) CCM n 2 nodes n processors and n2 memories n processors and n2 memories n data buffers per processor n2 data buffers per processor 2 n data buffers per processor without reduction with conditional reduction Perm utation 3 3 1 + e (n + 2 ) (9n + 4)/2 Generalized Mapping 5 5 1 + e (3 n + 7)/2 (3n2 -f 6n -f 4)/2 4.5 S im u la tio n E x p e r im e n ts an d R e su lts We carried out simulations to determine advantages of on-the-fly indexing. We had objectives to observe the effect of vector length and the degree of memory interleaving on data m anipulation using on-the-fly indexing. 4.5.1 A C S IM -b a sed M u ltip r o c e sso r S im u la to r Our simulator runs on top of a simulation kernel called CSIM [Sch8 6 ]. CSIM is a process-oriented simulation language based on C. It runs under UNIX 1 and sev eral of its vendor-specific variations. Our simulator is capable of running on both SUN-3 and SUN-4 platforms. This multiprocessor simulator is process-oriented and algorithm-driven [HC91, MCH+90]. All system delays included in this simulator are explicit values reflecting the hardware design. The simulator consists of three mod ules: the user developed algorithm, an architecture specification file, and the CSIM simulation kernel. The architecture specification file defines components of a system in term s of CSIM facilities, events, and C data structures. For example, a hypercube system is represented as a collection of node and link facilities. The communication jbuffers are represented as C data structures. Events reflect data-ready conditions indicating arrival of data from sources to destinations. Similar facilities are used to model processors, buses, and memory-modules of a multiprocessor. Macros in the architecture specification file reflect correct system operations. For example, a circuit-switched communication is represented as a sequence of opera tions: getting hold of required links, reserving it for a duration of communication time, performing communication by duplicating data from source-node buffer to jdestination-node buffer, setting the event of data ready for destination processor, and releasing communication links in a reverse order they were reserved. In case af conflicting access to a facility by multiple processes, the underlying CSIM kernel grants it in a first-come-first-served basis. This makes the simulator completely de terministic. Since this simulator is built using C, it produces both final results of a program and the timing estimates. UNIX is a registered trademark of AT&T Bell Laboratories 52 H O .2 S im u la tio n E x p e r im e n ts | ! ! < j ! We carried out two different sets of experiments to determine capability of on-the-fly: indexing. The following two problems were taken: ' • Shifting a (P x P) m atrix by x, 1 < x < P — 1 columns. i • Shifting a (P x P) m atrix byx, 1 < x < P — 1 columns and y, 1 < y < P — 1: j rows. 1 ! i The first set of experiments implemented shifting a m atrix by 5 columns on different OMP configurations. We considered shifting both by processor executing instruction and by using on-the-fly data manipulation. OMP configurations with 8,16, and 32-processors were considered. These configurations provided different degrees of memory interleaving and hence, different vector length. We carried out 6 experiments on each of this processor configuration for P = 32,64,128,256,512, j and 1024. ■ The second set of experiments were carried out on 16 processor OMP and CCM I configurations. We considered shifting both by processor executing instruction and i j by using on-the-fly data manipulation. For similar m atrix sizes, we considered shift- | ing a m atrix by rows and columns. I i i j 4 .5 .3 S im u la tio n R e s u lts an d Im p lic a tio n s I i , In the first set of experiment, we observed two parameters: reduction in number of ' instructions by using data m anipulator and reduction in total execution time. Fig- \ \ ure 4.7a shows the reduction factor in number of instructions used for computation i ! and data manipulation. This reduction increased with vector length. W ith a vector I t length of 32, we got up to a 75 % reduction. Figure 4.7b shows reduction factor I in total execution time. This also increased with vector length. However, for small ! i vector length, the reduction factor remained constant over different m atrix sizes. I ' 75.0 | 67.5 [Reduction factor I in num ber of 52.5 ■ I instructions j used for com putation | and data ! m anipulation I 45.0 + 37.5 30.0 + 22.5 15.0 7.5 shifting a matrix > columns 0 32 -w I 6 .0 * 5.4 4.8 Reduction factor 4.2 in total 3 6 - execution tim e „ „ 3.0 2.4 + 1.8 1. 2 - 0 .6 - shifting a matrix by 5 columns o ★ OMP 32 OMP 16 OMP 8 64 128 256 512 1024 M atrix Size (P) (a) instructions used for data m anipulation o : OMP 32 • : OMP 16 * : OMP 8 32 -w 64 128 256 512 1024 J M atrix Size (P) j (b) total execution tim e I ^Figure 4.7: Factors of reduction in (a) com putation overhead (instructions used for. I com putation and data manipulation) (b) total execution tim e by using on-the-fly lindex m anipulation to shift 5 columns of a (P x P) m atrix on different orthogonal multiprocessor configurations (OMP = orthogonal multiprocessor). I t L 5 4 We compared overheads for computation, data m anipulation, memory access, and synchronization. Figure 4.8a shows a comparison of instructions for various jmatrix sizes. A significant reduction was observed in num ber of instructions used 'for m anipulating data. Figure 4.8b shows the respective execution times. Memory i access constituted a significant factor in overall execution time. This explained reduction factor remaining constant over m atrix sizes as shown in Fig. 4.7b. I The next set of experiments compared the effect of one and two dimensional memory interleaving on data manipulation. The problem considered both row and i columnwise data movements. Figure 4.9a shows the number of instructions used. For Itwo-dimensionally interleaved OMP, minimal number of instructions were used for jperforming the task with on-the-fly indexing. Since CCM does not support memory jinterleaving in column dimension, column-oriented data movements took place in Ja scalar manner. This did not allow on-the-fly indexing to be used. Figure 4.9b shows total execution tim e for the respective configurations. OMP with on-the-fly indexing won in this case too. i i I I ! i I I i i I ! I 55 shifting a matrix by 5 columns on a 16-processor OMP Computation + Data manipulation ■ Shared memory Access ■ Synchronization a. x CL © N '«0 X L < 3 E 1024 x 1024 ^ without OIM '— with OIM 512 x 512 256 x 256 128 x 128 o © o o o C M O o o o o o 0 0 o o o o 0 0 © o C M # of instructions in thousands (a) CL X CL © N S x ‘ d c a E 1024 x 1024 512 x 512 256x256 128 x 128 ^ without OIM with OIM + H - o C M H - o C O o 'J- o i n o C O o h- Total execution time in milliseconds (b) jFigure 4.8: Comparison of (a) instruction overheads and (b) total execution tim e in ^shifting a (P x P) m atrix by 5 columns on a 16-processor orthogonal multiprocessor. 56 Number of instructions used for com putation and data m anipulation (in thousands) 6000 5400 4800 4200 3600 3000 2400 1800 1200 600 shifting a matrix by 5 columns and 3 rows 0 32 64 2*56 1024 128 M atrix Size (P) (a) instructions used for data m anipulation 75.0 67.5 60.0- 52.5- Total 45Q execution tim e 37-5 ' (in milliseconds) 30.0- 22.5- 15.0- 7.5- shifting a matrix by 5 columns and 3 rows 32 — * — 64 : OMP 16 : OMP 16 with OIM : COM 16 : COM 16 with OIM : OMP 16 : OMP 16 with OIM : COM 16 : COM 16 with OIM 512 1024 128 256 M atrix Size (P) 1 (b) total execution time Figure 4.9: Comparison of (a) computational overhead (instructions used for com- jputation and data manipulation) (b) total execution tim e in performing row and column shifts of a (P x P) m atrix by using on-the-fly index m anipulation on or- jthogonal multiprocessor (OM P) and crossbar-connected multiprocessor (CCM). C h a p ter 5 V ecto rized In terp ro cesso r C o m m u n ica tio n 5.1 In tro d u ctio n 'Memory-based mailboxes allow processors to communicate each other by writing and reading messages through appropriate mailboxes. Vectorized memory access schemes in multiprocessor provide high bandwidth and low latency data transfers between processor and memory. In this chapter, we continue taking this vector- oriented approach. We develop a general framework of communication vectoriza- tion to implement memory-based interprocessor communication in shared-memory multiprocessors. We configure interleaved shared memory as a collection of vector •mailboxes. Prim itive message patterns are derived by modeling interprocessor com m unication steps of a parallel program. A methodology is provided to convert send and receive operations of message patterns into equivalent write and read memory- based mailbox access steps. We investigate the effect of architectural param eters of a shared-memory system like interconnection network, memory access tim e, de gree of memory interleaving, and data contention on the efficiency of communication vectorization. 58 5.2 F rom M essa g e P a ssin g To S h ared -V ariab le C o m m u n ica tio n s 'Programs w ritten for a loosely-coupled m ulticom puter [LA81, Zho90] realize inter- iprocessor communication by passing messages over communication links. Multicom p u te r such as hypercube, torus, tree, and pyramid use these programs. Consider ■one such program consisting of m processes which are statically allocated to m nodes jof a m ulticom puter. During the execution of this program, processors communicate using send and receive communications. We assume this message passing scheme to be unblocked send and blocked receive [Kar87]. W ith this scheme, the sender proces sor sends message and the receiver processor waits for the message to arrive. The program is assumed to be deadlock free so th at no two processors at any instant of time wait for messages from each other. 5.2.1 M e ssa g e -P a ssin g S tep s W ithout any loss of generality, we present an example for m = 4 on a 4-node multicom puter as shown in Fig. 5.1. Each processor alternates between computation and communication steps. W ith exchange of messages, processors get synchronized and program execution proceeds in a wave like manner. Message-passing steps in this program dem onstrate different characteristics. For example, the first step is an one-to-one-personalized message exchange between nodes NO and N l. The second step is a many-to-many-personalized message exchange between two sets of nodes, (N1,N2) and (N3,N0). In the third communication step, node NO broadcasts a message to rest of the nodes. Node NO receives personalized messages from all other nodes in the fourth communication step. The last step combines two operations: a 59 multicast (broadcasting a message to selected others) operation to nodes N1 and N2 xom N3 and a many-to-onc operation from nodes N1 and N2 to NO. NO N1 N2 N3 computation • • computation • • computation • • computation • • Sundunsj\N 1) Rccv(msg,N(l) • • • • computation • • • computation • • • • • • • • • Recv(msg,N3) Sun J( msg,N2) Reev(msg,Nl) Send(msg,NO) • • computation • computation • • computation • • • computation • Send(msg,all) Rcc\(msg,NU) Recv(msg,NO) Recv(msg,NO) • computation • • computation • • computation • computation • Scildl m.sg 1 ,N 1) Recv(msg,NO) Recv(msg,NO) Ricv (msg.NO) Send(msg2,N2) Send(msg3,N3) • computation « • • computation • • • • computation • • computation . A Send( msg.NO) Send(msg,NO) Scnd(ni'.g,N 1 ,N2) Recv(msgl,N 1) Recv(msg2> N2) Ri cv(msg N3j Recv(msg,N3) r • • • • • computation computation • * computation • • computation • • - communication steps 'Figure 5.1: Message-passing steps in a sample program for a 4-node m ulticom puter (N0-N3: computing nodes). Depending on source and destination node sets and types of communication, i.e. broadcast or multicast, we categorize message-passing steps into various patterns. Table 5.1 lists all possible message patterns together with their characteristics iden tifiers, C1-C12. Use a notation (Nx,Ny) to represent a single message transfer from •node Nx to node Ny. Notation ((N xl,N yl), (Nx2,Ny2)) represents m ultiple message transfers from a set of source nodes, N xl and Nx2, to a set of destination nodes, 6 0 <Nyl and Ny2 . Message-passing steps in the sample program can now be represented as: Step 1: C 1(M ,IV 0) Step 2: C 2{{N l,N 2),{N 3,N 0)) Step 3: C3((iV0, N l), {NO, N2), {NO, N3)) Step 4: CA{{N0, N l), {NO, N2), {NO, N3)) Step 5: C10{{Nl, NO), {N2, NO)), C8{{N3, N l), {N3, N2)) Table 5.1: Possible Message-Passing Patterns (m = total num ber of nodes, s= number of sources, d— num ber of destinations, 1 < s, d < m — 1 , pers = personalized message, broad = broadcast message). ID Message passing pattern Type No. of source nodes No. of destn nodes No. of distinct messages Total no. of messages No. of copies Cl one-to-one pers 1 1 1 1 1 C2 perm utation pers m m m m 1 C3 one-to-all broad 1 m -1 1 m -1 m -1 C4 pers 1 m-1 m-1 m-1 1 C5 all-to-one pers m -1 1 m -1 m -1 1 C6 all-to-all broad m m -1 m m (m -l) m -1 C7 pers m m-1 m (m -l) m (m -l) 1 C8 one-to-many broad 1 d 1 d C9 pers 1 d d d 1 CIO many-to-one pers s 1 s s 1 C ll many-to-many broad s d s s.d d C12 pers s d s.d s.d 1 5.2.2 P r im itiv e M e ssa g e -P a ssin g P a tte r n s It can be seen that all patterns are not independent. For an example, many-to- many pattern is a subset of all-to-all with selective masking. Thus, it is im portant to identify a set of patterns which are m utually independent and can be used for representing all other patterns. It is obvious that all communication patterns can be 61 represented as multiple steps of one-to-one operation. But, we look for a maximal set of primitive message-passing patterns. Based on Table 5.1, we have: P rim itive patterns = {C l,C 2 ,C 3 ,C 4 ,C 5 ,C 6 ,C 7 } (5-1) All other patterns, C8 -C 1 2 , can be expressed in term s of these prim itive patterns. jEach prim itive pattern is represented as a Communication Graph (CG) as shown in 'Fig. 5.2a. These graphs are unidirectional in nature and represent message passing from a set of source nodes to a set of destination nodes independent of m ulticom puter architecture. An appropriate number of switch boxes are used to provide required number of message transfers. The indegree and outdegree of each switch box is one. The num ber of switch boxes used in CGi are equal to total num ber of messages involved in the pattern Ci, 1 < i < 7. These communication graphs also distinguish between broadcast and personalized types of message exchange. 5.2.3 S h a red -V a ria b le M a ilb o x C o m m u n ic a tio n Consider the communication graphs CG1-CG7 in Fig. 5.2a. Each switching box associated with message transfer can be replaced by a memory mailbox using a 'shared-variable. The sender processor writes a message to this mailbox and the receiver processor reads it back. Depending on the message-passing pattern and the architecture of a given multiprocessor, each of these two steps (a write step followed by a read step) may or may not be implemented in a conflict-free manner. This indirectly depends on allocation of mailboxes to appropriate shared-memory ■modules. Our objective is to allocate these mailboxes in such a way th at maximum ■number of these write and read accesses can be implemented as vector memory accesses. Since vector accesses are efficient, this will effectively lead to minimal communication complexity. 62 C G 3 C G 4 one-to-one (personalized) permutation (personalized) one-to-all (broadcast) one-to-all (personalized) C G 5 C G 6 C G 7 © — ^ all-to-one (personalized) all-to-all (broadcast) all-to-all (personalized) C G I C G 5 ( a ) M G C M G 5 U + M G l r M G S r v ^o processor node personalized message (b) switching ■ box broadcast message b u s O-SSffi — ► message write •M— message read Figure 5.2: (a)Communication Graphs (CGs) for primitive message-passing patterns and (b) transform ation to memory-write and memory-read graphs (MGs). 63 We take a graph-theoretic approach in allocating these mailboxes. Consider the communication graph CGI for one-to-one pattern. This message-transfer can be achieved through shared-memory by creating a single mailbox as shown in Fig. 5.2b. Message-passing is achieved in two steps: the sender processor writes message to the mailbox and the receiver processor reads message from the mailbox. These two steps are represented by associated memory-write graph (M G 1W ) and memory-read (graph (M G lr) corresponding to write and read accesses. For each primitive pattern Ci, 1 < i < 7, we generate its equivalent scalar memory-write and memory-read accesses as M G iw and M G iT graphs. Figure 5.2b illustrates such graphs for pattern C5. Consider the memory graph M G 5r corresponding to all-to-one message transfer. Let the required (n — 1) mailboxes (3 for this example) be allocated to (n — 1) ■memory modules connected to a single bus, which is accessible by the destination processor. This allows the destination processor to perform a vector-read access to these memory modules and read (n — 1) messages in a single vector-read access. In the absence of vector access, this step is implemented by (n — 1 ) scalar accesses with considerable higher access time. Thus, vector access facilitates implementing vector mailboxes with interleaved-memories. Figure 5.3 shows five different memory access types A1-A5 as shown in Table 2 .1 and their corresponding memory-write or memory-read graphs. 5.3 V ecto rizin g C o m m u n ica tio n P a tte r n s W hether a primitive pattern can be implemented on a given multiprocessor as an optimal combination of vector and scalar memory accesses depends on the following parameters: 1. Type of message pattern 64 \ o (a) scalar write (b) scalar read T1 T2 7 ^ 0 7 ^ 0 (c) broadcast T3 (d) vector read T5 (e) vector write T4 bus message write broadcast message message read Figure 5.3: Memory access types and corresponding memory-read and memory-write graphs. 2. Processor-memory interconnection network of the multiprocessor 3. Interleaved memory organization of the multiprocessor This optimality is targeted towards the objective of implementing shared-variable communication with minimal access time. Determining this optim al combination is defined as communication vectorization. We propose a four step methodology to achieve this vectorization: 1. Grouping the memory-write and memory-read, graphs into four memory access sets: inherently scalar access, potential for parallel-scalar access, potential for vector access, and potential for parallel-vector access. 2. Checking for the members of these access sets, which can be implemented on multiprocessor in a conflict-free m anner as single step memory-access. 3. For those members demonstrating conflict, determining alternate memory ac cess types to implement them as m ultiple memory-access steps with minimal increase in access time. 4. Implementing communication steps as a combination of memory-write and memory-read access steps with appropriate synchronization. 65 Consider the memory-write and memory-read graphs: M G iw and M G ir, 1 < ji < 7. We define four memory-access sets M A s, M A ps, M A v, and M A pv represent ing corresponding to scalar, potential parallel-scalar, potential vector, and potential parallel-vector memory accesses, respectively. Consider characteristics of primitive patterns as shown in Table 5.1. Memory-write and memory-read graphs reflect Lending and receiving characteristics of message-passing patterns. For an example, all-to-one-personalized pattern C5 involves (m — 1) sender nodes, 1 receiver node, and (m — 1) total number of messages. This pattern requires (m — 1) mailboxes for jshared-variable communication. Since there are multiple source nodes, M G5W graph has potential for parallel-write access. Since the number of messages transferred per Lending processor is one, this write access does not exhibit potential for vectorization. Thus, M G 5W graph belongs to parallel-scalar memory-access set M A ps. Similarly, a single receiver node makes M G 5r graph to be implemented sequentially. Mul tiple messages per receiver node makes this read access a potential candidate for vectorization. Hence, M G br graph belongs to vector memory-access set M A V. For each primitive pattern, compute indices s ,d ,n s, and rid corresponding to number of sending nodes, number of receiving nodes, num ber of messages being sent per processor, and number of messages being received per processor, respectively. P ro p e rty 1 Grouping memory-write and memory-read graphs to memory-access sets observe the following membership properties: (s > l , n s > 1) or (d > l,rid > I) € M A pv (s = 1 ,n a > 1) or (d = I, rid > I) € M A V (s > 1 , n, = 1) or (d > l,nd = 1) € M A ps (s = 1 ,n s = 1) or (d = 1 ,nd = 1) € M A S D e fin itio n 1 Based on Property 1, memory-write and memory-read graphs are grouped into memory-access sets for scalar, parallel-scalar, vector, and parallel-vector accesses as follows: 66 ~~MAa = {M G lw,M G lr} M A V = {M G3w,MG4:w,M G 5 r} M A p3 = { M G2W , M G 5W , M G2r , M GSr , M G4r } M A p lJ = {M G 6 ,,,, M G 7w,M G 6r,M G 7r} 5.3.1 O p era tio n a l D ig ra p h s Consider a crossbar-connected multiprocessor as shown in Fig. 2.1b. The connec tivity between n processors, n 2 memory modules, and n buses can be expressed by a connectivity graph as shown in Fig. 5.4a. The bidirectional nature of the edges reflects both read and write capabilities. This graph however does not reflect opera tional characteristics of the system. Different scalar and vector memory accesses are possible on this configuration. During each memory access, an electrical connectivity is established between processors, buses, and the associated memory modules. This connectivity constitutes a subgraph of the connectivity graph. For a conflict-free memory access, the associated subgraph is identified as an operational digraph. Consider such an operational digraph. This subgraph represents a conflict-free; read or write memory access. For a read access, processor node has an indegree 1 and memory node has an outdegree 1. For a write access, memory node has an indegree 1 and processor has an outdegree 1. The bus is connected to either a single or multiple (maximum n, where n is the degree of memory interleaving) memory nodes depending on whether the operational digraph represents a scalar or vector access. This leads to the following property: P r o p e r ty 2 An operational digraph satisfies the following constraints: processor node : (indegree = 1) or (outdegree = 1) bus node : (1 < indegree < n, outdegree = 1) or (indegree = 1, 1 < outdegree < n) memory node : (outdegree = 1) or (indegree = 1) 67 (a) Connectivity graph (b) Few operational digraphs i ) - processor ESSjSa - bus Q - memory bred ad /w riteae.dgeS f ' w rite ed 8 e j ' read ed8 e 'Figure 5.4: (a) Connectivity graph and (b) operational digraphs of crossbar- jconnected multiprocessor configuration. Depending on the connectivity, an operational digraph represents one of the fol lowing memory accesses: scalar-read, scalar-write, vector-read, vector-write, parallel- scalar-read, parallel-scalar-write, parallel-vector-read, and parallel-vector-write. Fig ure 5.4b represents few operational digraphs of crossbar-connected multiprocessor. When nodes corresponding to processors, memories, and buses in an operational di graph satisfy the above properties, the related memory access becomes conflict-free. This leads to the following definition: 'D efinition 2 Each operational digraph represents a single step memory access. 5.3.2 T h e V e c to r iz a tio n P r o c e d u r e We now emphasize on steps 2 and 3 of the vectorization methodology in this subsec tion. Step 4 is illustrated in next subsection with an example program conversion. For any multiprocessor configuration, memory graphs belonging to access set M A S axe always implemented as scalar access. For other sets, M A pv,M A v, and M A ps, we -first check whether their members can be implemented as conflict-free parallel-vector, 68 vector, and parallel-scalar accesses, respectively. In case of conflict, we determine alternate memory-accesses with minimal increase in access time. Consider a memory graph with potential for parallel-vector access. There are three possibilities: (a) this graph can be implemented on a configuration in a conflict- free manner, (b) it leads to conflict, or (c) it is not feasible to implement. Let Tpv be the associated access tim e to implement it in a conflict-free manner. In the presence of conflict or in the absence of feasibility, this access can be alternatively implemented as multiple steps of vector, parallel-scalar, or scalar access. Define N v, Nps, and N s being the required number of steps to implement this parallel-vector access as alternate vector, parallel-scalar, or scalar accesses, respectively. Let Tv,Tps, and Ts be the associated access times. Lemma 1 A parallel-vector memory access step can be implemented as multiple steps of vector, parallel-scalar, and scalar access with following time constraints: Tpv < N ps ■ Tps < N v ■ Tv < N s ■ Ts or Tpy < N v ■ Tv < Nps ■ TP s < Ns • Ts. Proof: Consider a parallel-vector memory access involving m processors and each processor accessing a vector of length n. Assume equivalent vector and parallel- scalar accesses are feasible. We have N ps = n and N v = m. If neither parallel-scalar nor vector access is conflict-free or feasible, it can alternatively be implemented as N a = mn number of scalar access steps. Consider the respective access times, estim ated in Table 2.1. We have (a + /3) > r. For m > n, we have the first constraint satisfied. The second constraint holds good for m < n. ■ We identify a memory access as valid if the access is feasible and can be imple m ented in a conflict-free manner. Otherwise, it is denoted as invalid. Lemma 1 leads to the following theorem. Theorem 6 Replacement of an invalid parallel-vector access with multiple steps of valid parallel-scalar or vector access leads to minimal increase in access time. In case 69 both accesses are invalid, a parallel-vector access is replaced with multiple steps oj scalar access. Replacement of invalid parallel-scalar and vector accesses with multiple steps of scalar access leads to minimal increase in access time. Next we provide a vectorization procedure to determine replacement for invalid accesses on a given multiprocessor. Corresponding to a primitive pattern, this step requires allocation of mailboxes to memory modules. W ithout loss of generality, we give priority to write vectorization. This allows us to allocate mailboxes during a write access so that it gets implemented as a vectorized access as far as possible. Depending on this allocation, corresponding read access may or may not dem onstrate vectorization. We introduce four shared-variable communication sets: SCpv, SC ps, SC V, and SC s. These sets encapsulate memory-write and memory-read graphs which can be implemented in minimal tim e using parallel-vector, parallel-scalar, vector, and scalar accesses, respectively. V ectorization P ro ced u re ; Memory-access sets M A pv, M A ps, M A v, and M A S ; with memory-write and memory-read graphs as members are used as inputs. ; Ways to implement these memory graphs with minimal access time are put into ; shared-variable communication sets SCP „, SCps, SCV, and SCs. ; m = number of processors in a memory graph ; n= length of vector per each processor begin for 1 < i < 7 do begin for (graph = MGiw) and (graph = M G ir) do begin ST: determine the memory-access set to which graph belongs to; if access-set is M A S include the graph to SCs\ if access set is M Aps 70 begin if it is a write access replace it with multiple graphs of type T1 (Fig. 5.3); else replace it with graphs of access type T2; allocate memory modules as mailboxes to support required access types; check for the graph being an operational digraph; if it is an operational digraph include the graph to SCps with associated access type; else begin delete the graph from access set M A ps; if (m > n) (Lemma 1) add it into access set M A V\ else add it into access set M A S; goto ST; end end if access set is M A pv or M A V begin if access set is M Apv x = multiple; else x = single; if graph is a write operation begin if the write operation is a broadcast replace it with graphs of access type T3; else replace it with graphs of access type T4; end else replace it with graphs of access type T5; allocate memory modules as mailboxes to support required access type; 71 check for the graph being an operational digraph; if it is an operational digraph begin if access set is M A pv include the graph to SCpv with associated access type; else include the graph to SCV with associated access type; end else begin if access set is M Apv begin delete the graph from access set M Apv; if (m > n) add it into access set M A pa\ else add it into access set M Av; end else begin delete the graph from access set M A V\ if (m < n) add it into access set M A ps\ else add it into access set M A S; end goto ST; end end end end 72 Consider the above vectorization procedure. It determines replacement for in valid memory graphs with minimal increase in access time. Consider an all-to-all personalized message pattern being vectorized on a crossbar-connected m ultiproces sor. This pattern C7 is first transformed into memory graphs M G7W and M G 7T. According to definition 1, both these graphs belong to memory-access set M A pv. Since M G 7W graph corresponds to personalized access, the vectorization procedure replaces this graph with multiple graphs of access type T4. Next mailboxes are ^allocated to support this access type. Memory modules 0 < i ,j < n — 1, are used as mailboxes between processors P t and P j. This allocation results in a valid operational digraph. Hence, M G 7W gets included to SC pv. For M G7Ti mailbox allocation for its corresponding write graph M G7W is used. This allocation leads to a processor reading messages from its column memory mod ules. Since a crossbar-connected multiprocessor does not provide column-oriented vector access, this allocation of mailboxes does not give rise to an operational di graph. Assuming m = n, this graph M G7r is deleted from memory-access set M A pv and added to M A pa. This deletion and addition leads to minimal increase in access time according to Lemma 1 . The algorithm finally includes this graph to set SC ps. Similar analysis can be made for other memory graphs corresponding to primitive [patterns. T h e o re m 7 Memory-write and memory-read graphs corresponding to primitive pat terns can be implemented on a crossbar-connected multiprocessor with minimal time according to the following shared-variable communication sets corresponding to par allel vector, vector, parallel-scalar and scalar accesses, respectively. SC pv = { M G6W,M G 7W } SC V = { M G3W,M GAW } SC ps = { M G2w,M G 2r,M G 3r,M G 4r } and { M G5W , M G 6r , M G 7r } 73 SCs = ' { M G lw,M G lr,MG5r } 'Proof: Consider the communication vectorization procedure analyzing replacement for memory graphs corresponding to primitive patterns on a crossbar-connected (multiprocessor. For each memory-write and memory-read graph, this procedure ter minates and determines the shared-variable communication^ set to which the graph belongs to. This procedure starts its search with appropriate access types demon strating potential for maximum vectorization and (or) parallelism, as determined by Definition 1. The search is prioritized between alternative memory access types. 'During each search, minimum increase in access tim e is guaranteed by Lemma 1. Thus, the final replacement always leads to minimal access time. Any other replace m ent leads to increase in access time. ■ Similar vectorization determines ways to implement memory graphs on OMP and single bus-based multiprocessors with minimal access time. The vectorization results for primitive patterns on three multiprocessors are summarized in Table 5.2. 5.3.3 A n E x a m p le P ro g ra m C o n v ersio n The communication steps of a m ulticom puter program get converted [PH91c, PH91a] to memory write and read steps on a given multiprocessor configuration using the ■following steps: 1. Reduce communication steps of a m ulticom puter program in term s of primitive patterns. 2. Implement communication vectorization of prim itive patterns on a given mul tiprocessor configuration and determine shared-variable communication sets. 3. Convert communication steps as combination of write and read accesses corre sponding to shared-variable communication sets. 4. R estructure memory-write and memory-read accesses by inserting synchroniza tion primitives. 74 Table 5.2: Memory-access Types Leading to Minimal Access Time for Implementing Prim itive Patterns on Three Multiprocessors Using Shared-Variable Communication Vectorization (pv = parallel-vector access, v = vector access, ps = parallel-scalar access, s = scalar access). Access Types Single-bus Crossbar- connected Orthogonally- connected pv V ps s PV V ps s pv V ps s one-to-one, write X X X one-to-one, read X X X perm utation, write X X X perm utation, read X X X one-to-all-broadcast, write X X X one-to-all-broadcast, read X X X one-to-all-personalized, write X X X one-to-all-personalized, read X X X all-to-one, write X X X all-to-one, read X X X all-to-all-broadcast, write X X X all-to-all-broadcast, read X X X all-to-all-personalized, write X X X all-to-all-personalized, read X X X 75 We illustrate these steps by converting the sample program presented in Fig. 5.1 to run on three multiprocessors. First we consider the case for crossbar-connected multiprocessor. The sample program consists of message patterns C l, C2, C3, C4, C8 , and CIO. First the message pattern C8 and CIO are reduced to respective primitive patterns C3 and C5. Theorem 7 leads to communication vectorization for prim itive patterns on crossbar-connected multiprocessor. Memory-write and memory-read accesses related to shared-variable communica tion get implemented in a producer-consumer manner with explicit synchronization between them . We assume a static barrier MIMD hardware synchronization scheme [OD90] on the interprocessor interrupt bus. Each receiver processor executes a wait- sync operation on the sender processor before a memory-read access. Similarly, every sender processor executes an activate-sync operation after each memory-write access. Since we assume our original m ulticom puter program to be deadlock-free, this syn chronization scheme ensures no deadlock in the converted program. These synchro nization primitives are included into the program code before execution starts. We denote these synchronization operations as sync primitives between memory-write and memory-read accesses. Memory graphs are identified by corresponding primi tive patterns with subscripts w and r. Allocation of mailboxes to memory modules are based on shared-variable communication sets derived in Theorem 7. This mail box activity is reflected by identifying corresponding processor and memory module numbers. The communication steps of the sample program are converted as follows1: 1 We use the following notations : (Px — * My) for scalar-write operation from processor Px to memory module My, (Px — ► My , Mz, ..., Mw) for vector-write, ((Pxi — ► Myx), (Px2 — + My2)) for parallel-scalar-write, ((Px — * My,Mz MW),(PX — * My, Mz,..., Mw)) for parallel-vector-write, (My — * Px) for scalar read, (My, Mz,..., Mw — * Px) for vector-read, ((My — ► Px),(M y\ — ► Pxi) for parallel-scalar-read, and ((My,Mz,...,Mw — * Pxi), (My\, Mz\ , ..., Mwi — * ■ P®i)) for parallel- vector-read accesses. 76 A. Crossbar-Connected M ultiprocessor System: Step 1: CIw{Pq M0,x), sync, C lr(Mo,x -> Px) Step 2: C2w((Pi -> M1 < 2 ),(P3 -»• M3,0)), sync, C2 r((M i)2 — > P2), (A^3,o — * Po) Step 3: C3W(P0 - * ■ M0,x, Mq,2, M0,3), sync C lr(M0ll - Px), C lr(M0l2 -> P2), C lr(M0,3 -* P3) Step 4: C4W(P0 -*• itf0,x, M0,2, M0,3), sync, C lr(M0ll - P ^ , C lr(M0,2 - P2), C lr(Mo,3 -+ P3) Step 5: C lw(Px — > Mi,0 ), Cl.u,(P2 — » M2,o), C3W(P3 — ► M3ti, M3t2), sync C lr(Mx,0 - P0), C lr(M2,0 - Po), C lr(M3,x - Px), C lr(M3 )2 - P2) B. Orthogonally-Connected M ultiprocessor System: Step 1: C lw(Po — ► M0,x), sync, C lr(M0,i — * ■ Px) Step 2: C2^((Px -» Mll2),(P 3 -> M3,0)), sync C2r ((M x ,2 -P 2 ),(M 3,0 - P o ) Step 3: C3w(Pq — ► Mo,x,Mo,2 ,M 0,3), sync, C3r((Mo,x — » • Px), (Mo,2 — * ■ P2), (Mo,3 — ► P 3)), Step 4: C4W (P0 -♦ M0)1, M0,2, M0,3), sync, C 4r ((M 0)x — ► P i), (M0,2 — ► P2), (Mo,3 — ► P3)) Step 5: C5W ((P1 - Mj(0), (P2 - M2,0)), C3W (P3 - M3,i,M 3,2), sync, C5r(Mx,o, M2,0 ~+ Po), C3r((M3,x — » ■ P\),{M3^ 2 — * P 2) C. Single Bus M ultiprocessor System: Step 1: Clu^Po — ► Mx), sync, C lr{M\ — > Px) Step 2: C l^ P x -► M2), C 1 w(P3 -* • M0), sync C lT(M2 - P2), CIt (Mq - Po) Step 3: C3u,(P0 M \,M 2,M 3), sync, C lr{M\ -+ Px), C lr(M2 -» P2), C lr(M3 - P3) Step 4: C4lu(P0 — ► M \,M 2,M 3), sync, C lr{M\ -» Px), C1P (M2 -► P2), C lr(M3 P3) Step 5: C lw(Pi — M0), C lw(P2 M0),C3W (P3 -> Mx,M2), sync C lr(M0 Po), C1P (M0 -► P0); C lr(Mx - Px), C lrecv(M2 -> P2) 77 5.4 C o m m u n ica tio n B a n d w id th i i We analyze communication bandwidth of vectorized shared-variable communication between the processors of an n-processor OMP. Consider a message pattern having maxim um num ber of destination processors from any source processor as d, 1 < d <; in. Assume th at processors exchange messages with uniform length of I bytes each. I j i jThis message pattern can be defined as a subset of all-to-all pattern with vector! i length d. Memory sub-system design is assumed to support vector memory accesses with vector length d, 1 < d < n. Hence, this message pattern can be implemented as a block-vector-write access followed by a block-vector-read access. Since an ra-j processor OMP can support parallel memory data transfer by n buses, maximumi t » nd messages with a total capacity of ndl bytes can be transferred with an overhead; of 2(a + {3 -\- {dl — 1 ) t + {I — 1)£) -f 7 , as derived in Table 2.1. The param eter 7 1 reflects bus switching and synchronization overhead. This leads to a communication, bandwidth of i((^ j ZV^ {dT+ S)i)^ bytes/sec. 1 Consider the design param eters used in designing orthogonal multiprocessor pro totype at USC [HDP+90]. Table 5.3 shows the available bandw idth on a 32-processor OMP for varying message length and num ber of destinations. It has potential to provide a raw bandwidth of 284 M bytes/sec for message length of 4K bytes. Thej above expression has a fixed overhead of (a + /3 — r — 6 ). For large values of /, J pipelined transfer overhead (dr + 8)1 becomes significant. This leads to an u p p er[ i bound on the available bandwidth: (5.2) I ! I I ..78. J 71 dl Peak O M P Bandwidth > C 2 ((a + /? — r — 8) + (dr + 8l)*y nd 2 {dr + 8) fo r large I |Table 5.3: Estim ated Vectorized Communication Bandwidth in M bytes/sec on a 32 x32 Orthogonal Multiprocessor for varying message lengths (d=m axim um number ■ o f destinations in a message pattern). d message length I in Bytes 16 64 256 1024 4096 1 52.24 60.59 63.11 63.78 63.94 2 89.82 101.89 105.43 106.36 106.59 4 140.27 154.57 158.61 159.65 159.91 8 195.05 208.45 212.09 213.02 213.26 16 242.37 252.45 255.10 255.78 255.94 32 275.82 282.24 283.89 284.31 284.41 2 Since d < n, the peak bandwidth is proportional to 2(n!r+ s) • communicati° n bandwidth increases with message length I and number of destinations d in a mes sage pattern. Communication bandwidth for different OMP size are estim ated in Table 5.4. The peak bandwidth increases as 0 (n ) for large n. For each d pipelined cycles, there is an overhead of 8 for changing the next address. W ith a fixed 8, vectorized communication becomes efficient for larger values of d. Table 5.4: Estim ated Communication Bandwidth in M bytes/sec for Different OMP Sizes (n=O M P size, d=m axim um num ber of destinations in a message pattern, d < n, NA =Not Applicable). d OMP size n 4 8 16 24 32 1 8 .0 0 16.00 32.00 48.00 64.00 2 13.33 26.67 53.33 80.00 106.67 4 2 0 .0 0 40.00 80.00 1 2 0 .0 0 160.00 8 NA 53.33 106.67 160.00 213.33 12 NA NA 1 2 0 .0 0 180.00 240.00 16 NA NA 128.00 192.00 256.00 20 NA NA NA 2 0 0 . 0 0 266.67 24 NA NA NA 205.71 274.29 28 NA NA NA NA 280.00 32 NA NA NA NA 284.44 79 C h a p ter 6 P ro g ra m C o n v ersio n F rom M u ltic o m p u te r s to M u ltip ro cesso rs 6.1 In tr o d u c tio n Communication vectorization provides a mechanism to implement interprocessor communication steps of a parallel program as vectorized shared-variable memory ac cess steps. This allows a multiprocessor system to convert and implement distributed- memory parallel programs by using shared-memory mailbox communication. It is true th at such program conversion makes a shared-memory system a flexible archi tecture by allowing it to support both shared-memory and message-passing m odels1 of parallel computation. However, question arises whether such program conversion using communication vectorization is beneficial in term s of reducing communica tion complexities. In this chapter, we emphasize on this aspect of communication complexity reduction. First, we discuss issues in such program conversion. Since hypercube is a popular m ulticom puter architecture, we take representative hyper cube programs, convert them onto crossbar-connected and orthogonally-connected multiprocessors, and compare the respective communication complexities involved, j i We present analytical and simulation results which confirm a significant reduction 80 in communication complexities for programs using dense interprocessor communica tion. Besides hypercube, we also discuss about converting mesh programs to run on orthogonally-connected multiprocessor. 6.2 C o n v ertin g H y p e r c u b e P ro g ra m s 6.2.1 V e c to r iz in g C o m m u n ic a tio n P a tte r n s We consider a hypercube with circuit-switched routing [Bok91b]. Typical hypercube system supporting this routing scheme is Intel iPSC-860 [Bok91a]. Since single bus multiprocessor demonstrates lim ited vectorization capability, we consider only crossbar-connected and orthogonally-connected multiprocessors. For simplicity, we refer the com putational units of a hypercube as nodes and those of a multiprocessor as processors. Programs developed to run on a m-node hypercube are converted to ■run on an n-processor multiprocessor, where m > n. For m = n, computational task of each hypercube node is m apped to a single processor of the multiprocessor. Inter-node message-passing steps are vectorized and mapped onto shared-memory ■mailboxes. For m > n, we group m hypercube nodes into n clusters with (m /n ) nodes assigned to each cluster. Assume m = 2P and n = 2q for some integer p and q. Consider a mapping of com putational tasks (associated with respective data allocation) of each cluster consisting of 2p~q nodes onto a single processor. This collapses hypercube communication corresponding to (p — q) dimensions as intra cluster communication. These intra-cluster communication steps are implemented as memory-write and memory-read accesses to local memory attached to the processor. By restructuring computational tasks of (m /n ) processes into a single process, these 81 local memory communication steps get integrated to com putation. This leaves inter- jcluster communication steps belonging to q dimensions to be implemented using shared memory. Figure 6.1 illustrates different clustering options. Based on the program characteristics, frequently used (p — q) dimensions may be m apped as intra cluster communication to achieve minimal communication complexity. 14 20 22 28 30 29 23 31 26 24' 27 17 19 A 32-node hypercube +xl -x2 -xO +x0 +x3 -x3 +x4i link dimensions Different Clustering Options Options In tra-cluster Inter-cluster 1 +x0, -xO +xl, -xl, +x2, -x2, +x3, -x3, +x4, -x4 2 +xl, -xl +x0, -xO, +x2, -x2, +x3, -x3, +x4, -x4 3 +x2, -x2 +x0, -xO, +xl, -xl, +x3, -x3, +x4, -x4 4 +x3, -x3 +x0, -xO, +xl, -xl, +x2, -x2, +x4, -x4 5 +x4, -x4 +x0, -xO, +xl, -xl, +x2, -x2, +x3, -x3 •Figure 6 .1 : Different clustering options for a 32-node hypercube to convert its pro gram to run on a 16-processor multiprocessor. We consider hypercube messages to have uniform length of I bytes each. Com munication vectorization implements message-passing through shared-memory mail boxes. As derived in Table 5.2, each primitive pattern gets implemented as a pair of memory-write and memory-read accesses. For m = n, these accesses are done in 82 blocks of I bytes. For m > n, intra-cluster communication accounts for all possible message exchange between the nodes associated with the clusters. Thus, effective message length becomes m ultiple blocks of I bytes. This effective message length depends on primitive pattern and derived later in this section. For larger message lengths, we use block accesses as shown in Table 2.1. Total time to implement a memory graph depends on whether the access can be imple mented as a single step or multiple steps. For an example, consider memory graphs corresponding to all-to-all-personalized pattern. As shown in Table 5.2, the write graph gets implemented as parallel-vector access both on crossbar-connected and orthogonally-connected multiprocessors. Since all processors work in parallel, the tim e to implement this access is same as that of a vector access, as shown Ta ble 2.1. Its corresponding read graph gets implemented as a parallel scalar access on crossbar-connected multiprocessor. Since (n — 1 ) processors are involved in this parallel-scalar access, it gets implemented as (n — 1 ) scalar accesses. Thus, total tim e to implement this access is (n — 1) times of th at of a scalar access. For each memory graph, we use this approach in determining total access time. Table 6.1 shows equivalent memory access steps to implement prim itive message patterns on crossbar-connected and orthogonally-connected multiprocessors. 6 .2 .2 H y p e r c u b e C o m m u n ic a tio n C o m p le x ity Consider prim itive patterns being implemented on a m-node hypercube using circuit- switched communication. We consider message transfers with unblocked send and blocked receive characteristics [Bok91a]. Time complexity to communicate a message of I bytes long to a node at a distance of d hops is: Th = th + lrh + dv (6.1) 83 Table 6.1: Equivalent Memory Accesses to Implement Prim itive Message Patterns on Multiprocessors by Communication Vectorization (entries against to each access type indicates num ber of required memory access steps and the presence of parallel access by the processors). Commn Patterns Crossbar-Connected Orthogonally-Connected One-to-one block-scalar-write (1 ) + block-scalar-read (1 ) block-scalar-write (1 ) + block-scalar-read (1 ) Perm utation block-scalar-write (l,par) -f block-scalar-read (l,par) block-scalar-write (l,par) + block-scalar-read (l,par) One-to-all (broadcast) block-broadcast (1 ) + block-scalar-read (n-1 ) block-broadcast (1 ) + block-scalar-read (l,par) One-to-all (personalized) block-vector-write (1 ) + block-scalar-read (n-1 ) block-vector-write (1 ) + block-scalar-read (l,par) All-to-one block-scalar-write (l,par) + block-scalar-read (n-1 ) block-scalar-write (l,par) + block-vector-read (1 ) All-to-all (broadcast) block-broadcast (l,par) + block-scalar-read (n-1 ,par) block-broadcast (l,par) + block-vector-read (l,par) All-to-all (personalized) block-vector-write (l,par) + block-scalar-read (n-1 ,par) block-vector-write (l,par) + block-vector-read (l,par) 84 where th is the message start-up time, Th is data transfer tim e per byte, and v is aver age circuit-switching tim e per each hop. These param eters on iPSC/860 hypercube, empirically derived by Bokhari [Bok91a], are: th = 95.0, Th = 0.394, and v = 10.3 microseconds. We use these param eters in our analysis. This communication time Th can be expressed as a function (ch + <4,Z ), where C h reflects fixed overhead inde pendent of message length I. The param eter dh reflects variable overhead per each byte of message. Table 6.2 shows communication complexities for prim itive patterns. For one-to- one pattern, we assume communicating nodes to be farthest apart with a distance of logm hops. At least two communicating nodes with a distance of logm hops are assumed in case of permutation. One-to-all communication for broadcast type of message is implemented as a tree structure with logm height. This communication for personalized type of message exchange is implemented as a sequence of (m — 1 ) one-to-one message-passing substeps. The message length remains same in each hop of this pattern. This leads to total number of hops traversed in (m-1) substeps as m log m /2. Average number of hops traversed per each substep thus becomes ■ ^ ^ 8 ™. All-to-one pattern also gets implemented as a hierarchical tree structure. Nodes at a level merge messages from their two children and send to their respective parents. Depending on the com putation involved, merging may not increase message length in each substep. One such example is histogramming, which we consider in the following section in our simulation experiment. There are logm communication substeps involved to implement this pattern. Each communication substep covers only a distance of one hop. An optimal-circuit-switched procedure, developed by Bokhari [Bok91b], is used to implement all-to-all communication for both broadcast and personalized types of message exchange. This procedure gets implemented in (m — 1) substeps. During eth ________ 85 'substep, 1 < i < m —1 , node j , 0 < j < m — 1 sendsm essagetonode j® i, where ® is bitwise exclusive-or operation. This procedure leads to conflict-free communication ■ in each substep. The message length remains unchanged in each substep. All pairs of processors also rem ain at identical distances from each other. This leads to an average num ber of hops in each transmission substep as hops. Table 6.2: Communication Complexities for Prim itive Patterns on a m-node Hy percube using Circuit-Switched Communication (m = num ber of nodes, th — com munication start-up time, Th = data transfer tim e per byte, v = circuit-switched communication tim e per hop, I = uniform message length, C h — fixed overhead, dh = variable overhead per byte of message transfer). Prim itive Patterns Total Complexity Param eters C h and dh One-to-one th + lrh + log m - v ch = th + logm ■ v dh = rh Perm utation th + lrh + log m - v C h = th + log m • v dh ~ Th One-to-all (broadcast) log m (th + lrh + v ) ch = log m (th + v) dh = log m • rh One-to-all (personalized) (m - 1)(<» + lr„ + ^ ■ v) ch = ( m - l ) ( t h + ^ ^ - v ) dh = ( m - 1 )rh All-to-one log m (th + lrh + o) ch = log m (th + v ) dh = log m • Th All-to-all (broadcast) (personalized) (m - l)(i* + In + ^ ■ v) ch = log m (th + v ) dh = log m • Th 6.2.3 V e c to r iz e d C o m m u n ic a tio n C o m p le x ity Consider implementing primitive patterns of a m-node hypercube with message 'length /. Communication vectorization implements these patterns as pairs of mem ory accesses as shown in Table 6.1. Consider a hypercube program solving a problem with an initial data allocation to each of the m-nodes. In order to implement and ■map the same program on an n-processor system, we cluster and map ( ^ ) nodes to 86 !a processor. This mapping allocates data associated with ( ^ ) nodes to a single pro- Jcessor. Shared-variable communication between processors now reflect inter-cluster lypercube communication. This leads to different message lengths for > 1). We denote this new message length as effective message length L. For example, we have L = (~ )2l for one-to-all, personalized communication. Similar to expressing hypercube communication complexity by two param eters C h and dh, we introduce param eters cc,dc,c0, and da for crossbar-connected and orthogonally-connected multiprocessors, respectively. The param eters cc and c0 re- jflect fixed overheads; while dc and d0 reflect variable overhead per byte of of message length. This leads to Tc = cc + dcl and Ta = c0 + dQ l, where Tc and T0 denote total tim e complexity to implement a pattern on crossbar-connected and orthogonally- connected multiprocessors, respectively. Table 6.3 shows complexities in implement ing prim itive patterns on the two multiprocessors with communication vectorization. The entries are computed from the results derived in Tables 2.1, 5.2, and 6.1. 6 .2 .4 R e d u c tio n in C o m m u n ic a tio n C o m p le x ity 'Now we compare communication complexities of hypercube communication with jthose of vectorized shared-variable communication. The following theorem deter mines reduction in communication complexities associated with vectorized shared- variable communication. T h e o re m 8 Vectorizing primitive patterns of an n-node hypercube program to run on n-processor crossbar-connected and orthogonally-connected multiprocessors leads \to asymptotic relative reductions in communication complexity of {{dh — dc)/ dh) and [{dh — d0)/dh), respectively. For shorter messages, respective relative reductions are ((ch - cc)/c h) and {{ch - c0)/c h)). P ro o f: Compare communication complexities derived in Tables 6.2 and 6.3. Each time complexity has a fixed overhead and a variable overhead. The param eters Ch, cc, 8 L „ r i i i ! i t i ,Table 6.3: Time Complexities and Associated Param eters to Im plement Prim itive, Patterns on Two Multiprocessors using Communication vectorization (m = size of. hypercube, n = size of multiprocessors, I = message length in hypercube, L = effective message length on multiprocessors, cc, c0 = fixed overhead, dc, d0 = variable overhead, a = constant access tim e, 0 = memory access and bus transfer tim e, r = pipelined bus transfer tim e, 6 = address change overhead in block access, and 6 = .synchronization and bus-switching tim e). Prim itive Patterns L Crossbar-connected Multiprocessors Orthogonally-connected Multiprocessors One-to-one I cc = 2a + 7 dc = 2 0 c0 = 2 a + 7 do = 20 Perm utation m f n cc = 2 a + 7 dc = 2 ^ 0 c0 = 2 a + 7 dQ = 2 */J One-to-all (broadcast) —l n cc = n a + 7 dc = rnfd c0 = 2 a + 7 = 2 ^ 0 One-to-all (personalized) n* cc = n a + 0 — r — 5 + 7 4 = ^ r + 3 ^ + ( n - 1 ) ^ / 3 ca = 2a + r — <5 + 7 J » 7l 2 — I m 2 I T O 2 /3 = — r + -^ 6 + -zr/ 3 All-to-one l cc = -na + 7 dc = n0 c0 = 2a — t — 6 + j d0 = 0 + n r + 6 All-to-all (broadcast) m i n cc = n a + 7 dc = m 0 c0 = 2a + 0 + j — r — 8 do = ^ 0 + m r + £ 6 All-to-all (personalized) m* 1 n 2 1 cc = na + 0 — t — 6 + 7 4 = * ^ + £ « + ( » - ! ) £ / * c0 = 2 (a + 0 - r — 6 ) + 7 d0 = 2 — r + 2 ^ 6 0 n 1 W* 8 8 . . . and c0 reflect fixed overheads for hypercube, crossbar-connected multiprocessor, and orthogonally-connected multiprocessors, respectively. The param eters, dh,dc, and d0 multiplied with message lengths determine the respective variable overheads. Consider vectorizing a hypercube message-passing step on orthogonally-connected multiprocessor. Total reduction in communication complexity after vectorization is ((c/t — c0) + (dh — d0)l). Relative reduction is ((ch — c0) + (dh — d0)l)/(ch + dhl). For long messages (large I), we have (dhl Ch), (d0l c0), and (dh — d0)l » (ch — c0). This leads to an asym ptotic relative reduction of ((dh — d0)jdh). For short messages (small I, I < 16), fixed overhead dominates and leads to a relative reduction of ((cfc — ca)/ch)). Similar arguments hold good for crossbar-connected multiprocessor too. ■ Table 6.4 estim ates percentage reductions in complexity for m = n = 16. Hy percube communication param eters used to calculate these reductions are based on iPSC-860 hypercube system design [Bok91a] Memory access param eters used in this estim ation are conservative param eters used in the prototype design of an orthogonally-connected multiprocessor [HDP+90]. The comparison results are sen sitive to memory access param eters a, 0,6, and r. However, for large I, it entirely depends on r. We have taken a conservative approach in using r = 50 nsec cor responding to 20 MHz data transfer on the bus. Current bus technology supports considerably higher data transfers. Such high-performance bus designs will lead to still higher reductions. For all patterns, significant reductions are indicated for small I. This reduc tion is mainly due to large communication start-up (th) tim e in hypercube com pared to a small memory-access start-up tim e in multiprocessors. These reductions vary for large I. Both one-to-one and permutation take almost equal overheads in hypercube and multiprocessors. One-to-all-broadcast pattern indicate reduction 89 I Table 6.4: Analytically Estim ated Asymptotic Percentage Reductions (negative (entries corresponds to an increase) in Communication Complexity by Converting :a 16-processor Hypercube Program Onto 16-processor Multiprocessors (CCM = i Crossbar-Connected Multiprocessor, OMP = Orthogonally-connected Multiproces sor, I = message length). Prim itive Patterns large I small I CCM OMP CCM OMP One-to-one -1.5 -1.5 94.7 94.7 Perm utation -1.5 -1.5 94.7 94.7 One-to-all (broadcast) -81.0 74.6 87.0 98.5 One-to-all (personalized) 36.5 80.5 96.0 98.9 All-to-one -90.4 27.0 86.3 95.3 All-to-all (broadcast) 49.0 79.8 96.7 98.9 All-to-all (personalized) 32.0 6 6 .0 95.8 98.2 ]on orthogonally-connected multiprocessor and an increase on crossbar-connected multiprocessor. This increase is due to memory-read, access being implemented as (n — 1) scalar accesses on crossbar-connected multiprocessor. Similar explanation, ’holds good for personalized type of one-to-all message exchange. Communication, | vectorization leads to significant reductions for all-to-all pattern. For personalized (type of message, both write and read steps can be vectorized on the orthogonally- connected multiprocessor. This leads to a 66 % reduction in communication com plexity. On crossbar-connected multiprocessor, the write step is vectorized and the j read step is implemented as parallel-scalar access. This leads to a lim ited 32 % re- I iduction. For broadcast message pattern, these reductions are 79.8 % and 49.0 %, i Respectively. This indicates that this pattern can be implemented on multiproces- 'sors with a communication overhead which is two to five times less than th at of a ! i hypercube system. This table illustrates the advantage of shared-variable vectorized communication for different message-passing patterns. | | ' Next we consider reductions in communication complexity while converting hy-' percube programs for m ^ $ > n. Consider vectorizing communication steps of a m- t ' inode hypercube program to run on an n-processor orthogonally-connected multi-j ‘ I Iprocessor. There exists communication reductions iff for a given n, ((c^ — c0) +. , {dh — d0)l) > 0. Similar argument holds good for crossbar-connected multiprocessor I iiff for a given n, ((C h — cc) + (dh — dc)l) > 0. In this conversion, computational, itask of (^)-nodes are m apped to a single processor of the multiprocessor system. ! This leads to an increase in com putational complexity. This increase affects total x J complexity based on the ratio of com putation to communication steps in that pro^ 1 \ gram. Hence, the effectiveness of communication vectorization for m > n varies w ith1 I message-passing patterns. We emphasize on this aspect in the following section. i I 1 6.3 S im u la tio n E x p e r im e n ts an d R e su lts i i j We have carried out simulation experiments to support our theoretical findings. Sev- | eral programs involving primitive patterns were executed on a simulated hypercube 1 system. Then these programs were converted to run on multiprocessors using com munication vectorization. These converted programs were executed on a simulated ■ crossbar-connected and on a simulated orthogonally-connected multiprocessors. The respective communication complexities obtained from simulation are reported below. [6.3.1 S im u la tio n E x p e r im e n ts P erfo rm ed 1 ! The following five problems were considered by us in our experimentation: i j • Row-shuffle permutation of a (P x P) m atrix on a m-node hypercube system. The problem uses permutation communication. Initially, (P /m ) consecutive 91 , rows of the m atrix are assigned to each node. Rows associated with the pro- j J < cessors are distributed among the nodes based on a perfect shuffle of the n o d e, indices. • Solving A Y — B on a m-node hypercube system using gaussian-elimination | | with back-substitution algorithm. The vector Y consists of P variables. Each node is associated with (P/m) linear equations and (P/m) components of j Y. The communication steps used in this algorithm for both elimination and , back-substitution phases are of one-to-all-broadcast type. • Performing histogram of a (1024 x 1024) image with B grey levels on a m-node ! hypercube system. Each node is associated with (P/m) rows of the im age., j Partial histograms are computed first at each node, in parallel. These partial J histograms are then merged by performing all-to-one communication. We use' a tree structure communication to implement this merge. This involves log m substeps of near-neighbor communication. • Multiplying two matrices A x B = C on a m-node hypercube system. The ! matrices are of dimension (P x P) each. (P/m) rows of A and (P/m) columns ! of B are distributed to each node initially. Partial results are computed first at ! each node, in parallel. The columns of B are then communicated to all other nodes in an all-to-all-broadcast manner. • Transposition of a (P x P ) matrix on a m-node hypercube system. (P/m) rows of the m atrix are distributed initially to the nodes. After transposition, each node receives (P/m) columns. Two sets of experiments were carried out. The first set considered identical sys tem size and different problem sizes. The objective was to observe reduction in communication complexities for varying message lengths by keeping com putational complexities same for all systems. The programs were executed on a simulated 16-node hypercube, a simulated 16-processor crossbar-connected system and a sim ulated 16-processor orthogonally-connected system. Experiments were conducted I for varying m atrix sizes; P — 32, 64, 128, 256, 512, and 1024. For histogramming problem, we varied the number of grey levels B in a similar manner. Smaller P (B) 9 2 resulted in smaller message length during the communication steps. For 16-processor j i systems, a value of P = 1024 led to a message length of 64K bytes for several mes-' i *sage patterns. This reflected the effect of communication vectorization on longer I messages. Totally 84 experiments were performed in this set of experim entation. i | During the second set of experiments, we kept the problem size same and var ied system sizes. We chose the largest problem size for P(B ) = 1024. The effect I ■ ;of communication vectorization in converting programs from large hypercube sys- l j terns to run on smaller size multiprocessors were observed. For each problem, we( Sexecuted programs on simulated hypercubes with 32, 64, 128, 256, 512, and 1024! i , i processors. Converted programs were executed on simulated crossbar-connected and orthogonally-connected multiprocessors with 16 and 32 processors. Additional ! 38 experiments were performed in this set of experimentation. Associated with these program conversions, we observed tradeoffs in reduction of communication complex ity and increase in com putational complexity. i i ■ 6 .3 .2 S im u la tio n R e s u lts an d Im p lic a tio n s i I J Figure 6 .2 a shows the effect of communication vectorization for permutation type of communication. Since this pattern is inherently scalar in nature, we could not vec- t torize it. The communication complexities were found to be almost equal for all three systems. Figure 6.2b compares communication complexities for one-to-all-broadcast1 I communication. Compared to hypercube, OMP performed very well in reducing communication complexity. As message length increased with problem size, com-' munication complexity in crossbar-connected multiprocessor increased. This was due to conflict in memory-read access to column-oriented vector mailboxes. This access was implemented as a sequence of (n — 1) scalar accesses. For longer mes sages, communication vectorization was not found effective on crossbar-connected multiprocessor for this message pattern. Comparison of communication complexities for all-to-all-broadcast communica- ;ion is shown in Figure 6.2c. Both multiprocessors implemented memory-write ac cess as a parallel-broadcast operation. Corresponding memory-read access got im plemented on OMP in a vectorized manner. This led to a significant reduction in communication complexity. The crossbar-connected multiprocessor implemented ;his memory-read as a parallel-write access. This led to less reduction in compu- :ation complexity. Figure 6 .2d compares communication complexities for all-to-all- personalized dense message pattern. This pattern got implemented as parallel-vector- jtunYe followed by vector-read access on orthogonally-connected multiprocessor. It got implemented as parallel-vector-write followed by parallel-scalar-read access on CCM. Significant reduction in communication complexity was observed on OMP compared to CCM. Figure 6.2e compares communication complexities for all-to-one message pattern. This was implemented as a single step parallel-scalar-write followed by a vector-read access on orthogonally-connected multiprocessor. A significant reduction in com munication complexity was observed in this case. For CCM, the receiving processor performed (ra —1 ) scalar-read accesses to read messages from column-oriented (n — 1 ) mailboxes. As message length increased, this resulted in higher communication com- Dlexity. The results obtained in first set of simulation experiments are summarized in Table 6.5. Experiments with P (B ) = 32 represented problems with smaller mes sage lengths. Experiments with P {B ) = 1024 involved considerably larger message engths. Consider Tj£, T /, and T* representing communication complexities derived 94 permutation one-to-all-broadcast (a) 400 360 320 280 240 200 160 120 80 40 32 64 H28 256 5i2 1&24 m atrix size (P) row-shuffle perm utation all-to-all-broadcast 0 4 * - 32 64 128 256 512 1024 m atrix size (P) (c) m atrix multiplication all-to-one 64 128 256 512 1024 m atrix size (P) (b) gaussian-elimination all-to-all-personalized 128 256 512 1024 m atrix size (P) (d) m atrix transpose Y-axis: communication tim e in milliseconds o : OMP CCM * : Hyp 16 32 64 1282565121024 grey levels (B) (e) histogramming Figure 6.2: Comparison of communication complexities on 16 processor systems for various problem sizes (OMP = orthogonal multiprocessor, CCM = crossbar- connected multiprocessor, Hyp = hypercube). j 9 5 . J from simulation experiments for hypercube, crossbar-connected, and orthogonally- connected multiprocessors, respectively. We calculated percentage reductions as (T£ — T*)/Tfl and (T1 / — T*)/T£. These reductions, summarized in Table 6.5, closely m atch with asym ptotic relative reductions estim ated analytically in Table 6.4. How ever, there was a variation with gaussian elimination example. While deriving the results in Table 6.4, we assumed that the effective message length becomes (m /n )/ after clustering (m /n ) hypercube nodes to a single proces sor. This corresponds to combining messages from all (m /n ) nodes into a single i large message. However, gaussian elimination algorithm works differently. Messages j corresponding to each of the (m /n) nodes are sent separately during row elimination I i and back-substitution phases. Hence there were (m /n ) broadcast operations. T h e 1 I i broadcast operation corresponding to j th row elimination broadcasted a message of I I length (P — j + 2),2 < j < P — 1. During back-substitution phase, message length: remained restricted to one byte only. Combination of several non-uniform message i communication led to this variation from analytically predicted result. i } i Table 6.5: Percentage Reductions (negative entries corresponds to an increase) in [ Communication Complexity Derived by Simulation while Converting a 16-processor i Hypercube Program to run on 16-processor Multiprocessors (CCM = Crossbar- j Connected Multiprocessor, OMP = Orthogonally-connected Multiprocessor, I = ■ message length). Prim itive Patterns large I small I CCM OMP CCM OMP Perm utation -1.3 -1.3 81.8 81.8 One-to-all (broadcast) -8.1 87.7 89.1 98.7 All-to-one -72.2 39.5 88.4 97.1 All-to-all (broadcast) 42.7 77.6 8 8 .8 95.5 All-to-all (personalized) 52.4 67.2 98.2 99.3 6.3 .3 T rad eoffs in C o m p u ta tio n s an d C o m m u n ic a tio n s 1 l Consider converting programs for a m-node hypercube to run on an n-processor' multiprocessor for m > n. As we discussed earlier, this conversion leads to an increase in com putational complexity by a factor of (m /n ). Now the question is, w hether such conversion leads to significant reduction in communication complexity iwhich has potential to reduce total execution tim e complexity. We observed this [tradeoff in our second set of simulation experiments. I j Consider matrix-row-shuffle problem requiring permutation communication. Fig- jure 6.3a indicates tim e complexities for various system sizes. This problem is ; communication-intensive. W ith larger hypercubes, perm utation was observed to ’ be implemented faster. Hence, the program conversion was found not to be advan-: tageous in this case. j j Figure 6.3b shows tradeoffs for gaussian-elimination problem requiring one-to-all I ! jcommunication. W ith larger hypercubes, there was an increase in communication' [complexity and decrease in com putational complexity. Sixteen processor OMP and j CCM performed better than 32-processor hypercube. Compared to a 1024-node hypercube, OMP 16 reduced communication complexity by a factor of 21. How ever, com putational complexity increased by a factor of 59. This was expected because each processor on OMP-16 performed com putational tasks of 64 hypercube inodes. However, total execution tim e increased by a factor of 3.9 only. Considering I :(processor x tim e ) complexity measure, program conversion in this case was found i !to be very effective. ! I Tradeoffs in implementing all-to-one communication in our histogramming prob lem is shown in Figure 6.3c. Converting programs for 1024 hypercube onto 16- processor OMP led to reduction in communication complexity by a factor of 4.1. Since this problem is computationally-intensive, this communication reduction did 97 a lo.. ( p e r m u ta tio n ) 1 5 5 5 1 o v H T f VO X n r — i 3 3 v o r < j * 0 r ~ * rJ W * > 33 3 3 s © V O o m o VO I U in m U (a) permuting rows of a (1024 x 1024) matrix in a shuffle fashion 100 - - ( o n e - t o - a l l - b r o a d c a s t) 80 -- 8 o > 8 1 4 0 - 3 5 E S BSE a o IN IN tn U U (b) gaussian-elimination with back-substitution for 1024 variable linear equations 100 75 -|- n 2 §50 Z J 25 ( a ll - to - o n e ) o o IN a o V C I N N - C N V) a VO IN « IN t * co r-i K K f f i K E C W f f i O O U U (c) histogramming a (1024 x 1024) image with 1024 grey levels | | - Computation Communication Wait time in Hyp - Synchronization on OMP and CCM H 16 - 16 processor hypercube, O- OMP, C - CCM Figure 6.3: Comparison of timing complexities for three problems on different hy- j percube, orthogonal multiprocessor, and crossbar-connected multiprocessor config-; urations. : 98 , S J not help in reducing total execution tim e complexity. Tradeoffs for all-to-all-broadcast\ communication, used in m atrix m ultiplication problem, is shown in Fig. 6.4a. This tradeoff was found similar to that of histogramming example. Though there was significant reduction in communication complexity, this got hidden due to increase jin com putational complexity. However, considering (processor x tim e ) complexity ^measure, this program conversion was found to be beneficial. Fig. 6.4b shows trade- I offs for m atrix transposition involving all-to-all-personalized communication. This problem is more communication-intensive. For transposing a (1024 x 1024) m atrix, 64-node hypercube was found optimal. Both OMP 16 and CCM 16 performed well compared to this optim al hypercube size. OMP 32 performed still better. Compared to a 1024 hypercube, OMP 16 reduced communication complexity by a factor of 17.7 and com putational complexity by a factor of 7.8. This led to an overall reduction of total execution tim e by a factor of 17.6. Program conversion was found to be most effective for this dense message pattern. I 6.4 C o n v ertin g M e sh P ro g ra m s jin this section, we present conversion of mesh programs to run on OMP. We empha-j jsize on mapping computational tasks to processors and vectorizing communication! I |steps by allocating mailboxes to shared-memory modules. Reductions in communi- I j cation complexity are not analyzed. These reductions can be determined analogous jto hypercube program conversion. ! 6.4.1 M e sh w ith B o u n d a ry W rap -arou n d We consider mapping a mesh with m nodes to an n-processor OMP. For m = n, the mapping is straightforward. There is one-to-one mapping between the mesh nodes and OMP processors. Memory modules < i ,j < n — 1, are used as 99 35 30- 25 K /i 8 20 “ 15 10 5 0 . 15 -a i< h (all-to-all-broadcast) V O < N C O tT V O 00 ( N VO 04 V i v - 4 ra V ) ^r r4 o v o < N ro VO C 'l CO C C E C S C cc o o (a) multiplying two (1024 x 1024) matrices (all-to-all-personalized) VO es ro ■ v i v o (b)transposing a (1024 x 1024) matrix - Computation - Communication H 16 - 16 processor hypercube, O- OMP, C - CCM Figure 6.4: Comparison of timing complexities for two problems on different hyper cube, orthogonal multiprocessor, and crossbar-connected multiprocessor configura tions. 100 mailboxes for communication between processors P, and Pj. Conversion of mesh programs for m = n 2 is interesting. Consider an (n x n) mesh with wrap-around torus interconnections. Four primitive communication steps are identified as east, west, north, and south shift with rotation. The n2 nodes are first grouped into n clusters either by rows or by columns. Each cluster is allocated to a single processor of the OMP. Figure 6.5(a) shows an example of clustering by columns. The 16 nodes axe grouped into 4 clusters and allocated to 4 processors of the OMP. If nodes belonging to a column (or row) are grouped into a single cluster, all jintra-column (or intra-row) communication reduces to intra-cluster communication t on the OMP. The other communication patterns are implemented as inter-cluster communication. W ith the example shown, all north and south communications are reduced to intra-cluster communication. The east and west communication steps are implemented as inter-cluster communication. Similarity of a mesh structure with that of an orthogonally-connected memory organization leads to a simplified scheme to implement these inter-cluster com munication steps. Similar to the IRW step discussed in section 2.4, we intro duce an Interleaved-Write-Read (IWR) access. An interleaved-write followed by an< I interleaved-read constitutes this access. Fig 6.5(b) illustrates an example of imple-' i | jmenting a +2 east communication in two IW R steps. Let each mesh node contains, 1 a single data identified by its node number. In column-oriented clustering, each, orthogonal processor contains n data associated with its column. During the first ; i |IW R step, processors in parallel, perform an interleaved-write column access fol- ! i (lowed by an interleaved-read row access. We identify these interleaved accesses as ( i ' [column-write and row-read accesses, respectively. During the second step, processors i i Iperform a row-write access followed by a column-read access. During this row-write 101 [access, on-the-fly indexing is used to m anipulate data based on the desired east or, west communication. Cluster 0 Cluster 1 Cluster 2 Cluster 3 (a) Grouping nodes in a column into a cluster C p C p C p C ^ ) 4 0 1 1 4 2 4 3 4 4 45 46 47 | 4 8 4 * 41 0 4 1 1 I1 2 J l 3 j l 4 4 1 5 p ©- G> ©- d>- 0 1 2 3 2 3 0 1 4 V 6 7 6 7* “t 5 * 10 10 11 8 ~9* 12 13 14 T s — * 14 15 12 13 + ( p ( p C j D ( p t 2 1 3 t o l l t 6 p t 4 t 5 f l O t 1 1 T 8 t 9 T l 4 f ! 5 1 1 2 1 1 3 cotumn-write row-read row-write column-read IW R step 1 IW R step 2 (b) Implementing a +2 ea st communication in two IW R step s Figure 6.5: Converting a (4 x 4) mesh with wrap-around connections onto a 4-' processor OMP (P = processor node). [ i This vectorized communication provides k, 1 < | k |< n — 1, east or west commu- i ! i jnication steps to be implemented with two IW R steps. During column-write access I. •in first IW R step, the processors also have flexibility to use on-the-fly indexing. This' I ; allows up to k column and k row operations, k, 1 < | k |< n — 1 , to be combined in i [two IW R steps. i ■ Communication vectorization allows converting variations of mesh architectures’ i such as mesh with broadcast [Bok84, Sto83], mesh with m ultiple broadcast [KR87], •and generalized mesh [BA84], Meshes with broadcast capability can be directly' converted to OMP due to the flexibility of data duplication associated with on-the- fly index m anipulator. 1 0 2 6 .4 .2 M e sh w ith G e n era lized W rap -arou n d ^ Consider a generalized mesh architecture as shown in Fig. 6 .6 . In addition to regular' m esh links, the architecture provides all possible interconnections between the nodes i jwithin each row and within each column. This generalized mesh architecture is a •special case of the generalized hypercube, developed by Bhuyan and Agrawal [BA84]. • i i Consider partitioning a (n x n) generalized mesh into n clusters by columns las shown in Fig. 6 .6 . Let these n clusters be allocated to n processors on O M P.' j The communication on extra column links collapses to intra-cluster com m unication.; j jUsing on-the-fly indexing associated with data buffer, communication on extra row jlinks can be emulated in two IW R steps as discussed in previous subsection for m esh.' j Similar results also hold good for partitioning and clustering the generalized mesh i _ _ . 'by rows. Thus, a (n x n) generalized mesh program can be converted efficiently to run on an n-processor OMP using communication vectorization. i Cluster 0 Cluster 2 Cluster 1 Cluster 3 12 Figure 6 .6 : Converting a (4 x 4) generalized mesh program onto a 4 processor OMP iby mapping each column of 4 nodes to run on a single processor of the OMP. I 103 ___j C h a p ter 7 C o n clu sio n s an d S u g g e ste d F u tu re R ese a r ch i \ i I j 7.1 S u m m a ry o f R esea rch C o n trib u tio n s j Shared-memory multiprocessors with memory-interleaving support vectorized data access between processor and memory subsystem. Besides restricting the use of this t access for computation, our goal in this research has been to explore possibility of I supporting data m anipulation and interprocessor communication using vectorized i memory access. The key original contributions of this thesis are summarized below:. • Dem onstrating the feasibility of memory-based vectorized interprocessor com -' munication. Similar to vectorizing com putational steps on vector supercom puters, we dem onstrated that interprocessor communication steps of a parallel program can be implemented as vector read and write steps on shared-memory | multiprocessors with interleaved memory organizations. Chapter 5 emphasized: on this aspect. ( • Dem onstrating communication vectorization to be more efficient with two-; dimensional interleaved memory organization in orthogonally-connected m ulti- 1 processor than one-dimensional organization in crossbar-connected and single- > bus based systems. This conclusion is based on analytical and simulation! results presented in Chapter 6 . Developing a communication vectorization procedure to convert and imple ment distributed-m emory m ulticom puter programs on shared-memory mul tiprocessors with significant reductions in communication complexity. This procedure is described in section 5.3.2. • Introducing a new concept of atomic vector-read-modify-write access in mul tiprocessing. In Chapter 4, we dem onstrated its use with interleaved memory organization to implement efficient processor-memory data movement. • Design and development of an on-the-fly index m anipulator to work with in terleaved bus organization. This data m anipulator, working with interleaved) memories implement functionality of a generalized interconnection switch box 1 proposed by Thompson [Tho78] as presented in Chapters 3 and 4.. • Development of a new hardware-based approach to fast data movement and ] m anipulation in shared-memory multiprocessors. Dem onstrating a two di- ■ mensional interleaved memory organization being more capable of memory-j to-memory data movement than one-dimensional organization. We presented) | theoretical, analytical, and simulation results in Chapter 4 to support this claim. • Design and development of vector register windows to work as data buffers at tached to processors. These data buffers has potential to be used as a replace m ent to cache in a partially-shared or non-uniform shared-memory system. Chapter 3 discusses its organization, functionality, and potential. 7.2 S u g g e stio n s for F u tu re R esea rch No research is complete at any time. During the course of this research, we havei encountered several interesting problems. We have addressed some of these problems < in this thesis. At this stage of thesis completion, we provide a list of suggestions for, future research. Some of these can be treated as a continuation to the present work, i < The remaining ones are long term problems and need sufficient research. A. Short-Term Problems: • Proving the concept of communication vectorization over a large set of appli- j cations. In C hapter 6, we used representative applications involving primitive! message patterns to dem onstrate the feasibility and efficiency of communi- j cation vectorization. It will be interesting to observe program behavior and j develop a vectorized communication model. This model can be used to pre- j diet reduction in communication complexity for programs using mixed and I irregular message patterns. j i • Investigating the use of page-mode DRAMs to support virtual memory in ter-! I leaving, on-the-fly data m anipulation, and communication vectorization. A j 1 . 1 I page-mode DRAM with its page mode access supports pipelined data transfer j similar to an interleaved bus. This indirectly leads to virtual memory inter-1 leaving, where the degree of memory interleaving is equal to the size of aj page. Consider a multiprocessor with n processors, n buses, and n memory j i modules using page mode DRAMs. This architecture is significantly less com- j plex compared to a crossbar-connected multiprocessor. We predict that data j m anipulation and communication vectorization can be implemented on this \ multiprocessor with a performance identical to th at of a crossbar-connected j multiprocessor. j I • Analyzing feasibility and associated gain of communication vectorization on hierarchical multiprocessor systems with hierarchical buses or cluster-based or ganizations. Both data m anipulation and communication vectorization can be i implemented on these multiprocessors using local and global data exchanges. i !B. Long Term Problems: i • Using vector-read-modify-write cycle to achieve register or memory-based vec torized synchronization in multiprocessors. Considering a large scale mul-| tiprocessor system, processors can write m ultiple synchronization variables to shared memory modules. These variables can be compared with existing count variables to implement fast barrier synchronization. Using few barrier j variables, this scheme will allow arbitrary many-to-many synchronization effi- | ciently. 106 • Considering the use of vector register windows and its associated data ma nipulator in providing vector support for RISC processors. The present day RISC processors depend on the internal cache organization to support indirect vector computation. However, the block size of an internal cache limits the vector length. The vector register windows presented in this thesis supports large vector length due to its reconfigurability feature. Hence, this can be used to alleviate problems associated with internal cache. • Using on-the-fly indexing scheme with link-based data transfer. Our proposed index m anipulator in Chapters 3 and 4 do indexing of data buffers associated with processor instead of the memory modules. Hence, this can be used at any place where data transfer takes place in a streaming manner. It will be interest ing to analyze the capability of this m anipulator to work with multicom puters such as hypercube, mesh, and pyramids. • Investigating the potential of mixed-mode communication (shared-memory and message-passing) in scalable shared-memory systems. The design of seal- I able shared-memory systems is taking a shape where clusters of shared-memory j multiprocessors are linked through communication links to achieve scalability. Our proposed communication vectorization fits well into this research dom ain.: Vectorization technique allows an user to start with a m ulticom puter program ,, break into inter-cluster and intra-cluster communication steps. Inter-cluster ! communication steps can be implemented by message passing. Intra-cluster I steps can be converted to vectorized shared variable communication. Another | alternative is to have intra-cluster steps as message passing and inter-cluster | steps as vectorized communication. It will be interesting to observe the inter play between the design of scalable shared-memory systems and their commu- * nication models. 107 R eferen ce L ist i [BA84] [Bai87] [Bal84] [BC90] [BM89] \ i i [Bok84] I l l[Bok91a] l[Bok91b] [CE90] i ! j L. N. Bhuyan and D. P. Agrawal. Generalized Hypercube and Hyperbus j Structures for a Com puter Network. IE E E Transactions on Computers, ! C-33(4):323-333, April 1984. D. H. Bailey. Vector Computer Memory Bank Contention. IE E E Trans- ] actions on Computers, C-36:293-298, Mar 1987. \ j R. V. Balakrishnan. The Proposed IEEE-896 Futurebus - A Solution to j the Bus Driving Problem. IE E E Micro, pages 23-27, Aug 1984. | I. Y. Bucher and D. A. Calahan. Access Conflicts in Multiprocessor! Memories: Queuing Models and Simulation Studies. In Proc. of A C M International Conference on Supercomputing, Amsterdam, The Nether lands, pages 428-438, June 11-15 1990. j A. Baratz and K. McAuliffe. A Perspective on Shared-memory and Message-memory Architectures. In J. L. C. Sanz, editor, Opportuni ties and Constraints of Parallel Processing, pages 9-10. Springer-Verlag, 1989. S. H. Bokhari. Finding Maximum on an Array Processor with a G lobal; Bus. IE E E Transactions on Computers, C-33(2): 133-139, Feb 1984. ! i S. H. Bokhari. Complete Exchange on the iPSC-860. Technical R e p o rt' 91-4, Institute for Computer Applications in Science and Engineering, I NASA Langley Research Center, Jan 1991. I S. H. Bokhari. Multiphase Complete Exchange on a Circuit Sw itched: Hypercube. Technical Report 91-5, Institute for Com puter Applications i in Science and Engineering, NASA Langley Research Center, Jan 1991. ; S. C hittor and R. Enbody. Performance Evaluation of Mesh-Connected W ormhole-Routed Networks for Interprocessor Communication in Mul ticomputers. In Proceedings of the Supercomputing ’ 90, New York, pages j 647-656, Nov 1990. [Gha8~9j [HC91] [HDP+90] [HJ89] [HK88] [HP91] [HTK89] [Kar87] [KR87] [Kum88] [LA81] K. Gharachorloo. Towards More Flexible Architectures. In J. L. C. Sanz, editor, Opportunities and Constraints of Parallel Processing, pages 49- 53. Springer-Verlag, 1989. K. Hwang and C. M. Cheng. Simulated Performance of a RISC-Based Multiprocessor with Orthogonal-Access Memory. Journal of Parallel and Distributed Computing, Sep 1991. K. Hwang, M. Dubois, D. K. Panda, S. Rao, S. Shang, A. Uresin, W. Mao, H. Nair, M. Lytwyn, F. Hsieh, J. Liu, S. M ehrotra, and C. M. Cheng. OMP: A RISC-based Multiprocessor using Orthogonal-Access Memories and M ultiple Spanning Buses. In Proc. o f A C M International Conference on Supercomputing, Amsterdam, The Netherlands., pages 7- 22, Jun 1990. C. T. Ho and S. L. Johnsson. Optim um Broadcasting and Personal ized Communication in Hypercubes. IE E E Transaction on Computers, 38(9):1249-1268, Sept 1989. K. Hwang and D. Kim. Generalization of Orthogonal Multiprocessor for Massively Parallel Com putations. In Proceedings o f the Conference j on Frontiers o f Massively Parallel Computations, Fairfax, Virginia, Oct I 1988. j K. Hwang and D. K. Panda. The USC Orthogonal Multiprocessor fo r: Image Understanding. In V. K. Prasanna Kumar, editor, Parallel A r chitectures and Algorithms fo r Image Understanding. Academic Press, 1991. j K. Hwang, P.S. Tseng, and D. Kim. An Orthogonal Multiprocessor for Parallel Scientific Com putations. IE E E Transactions on Computers, C - 38(1):47— 61, Jan 1989. A. H. Karp. Programming for Parallelism. IEEE Computer, pages 43- 57, May 1987. V. K. Prasanna Kum ar and C. S. Raghavendra. Array Processor with M ultiple Broadcasting. Journal of Parallel and Distributed Computinq, 4:173-190, 1987. Manoj Kumar. Supporting Broadcast Connections in Benes Networks. Technical Report RC 14063, IBM Research, May 1988. B. Lint and T. Agerwala. Communication Issues in the Design and Anal ysis of Parallel Algorithms. IE E E Transactions on Software Engineering, SE-7(2):174-188, Mar 1981. _________________________ _ _ _ 109 J |[LEN90] i[LN90] [LS90] [Map90] i :[MCH+90] [NS81] [OD90] [PH90] ! [PH91a] i I l ■[PH91b] [PH91c] Y. Lan, A. H. Esfahanian, and L. M. Ni. M ulticast in Hypercube Multi^] processor. Journal of Parallel and Distributed Computing, 8:30-41, 1990. j I X. Lin and L. M. Ni. M ulticast Communication in M ulticom puter Net- j works. In Proc. International Conference on Parallel Processing, pages : 111:114-118, 1990. j S. Lee and K. G. Shin. Interleaved All-to-all Reliable Broadcast On I i Meshes and Hypercubes. In Proceedings of the International Conference i on Parallel Processing, pages III: 110-113, Aug 1990. | C. Maples. A High-Performance, Memory-Based Interconnection System | for M ulticom puter Environments. In Proceedings of the Supercomputing j ’ 90, New York, pages 295-304, Nov 1990. > I S. M ehrotra, C. M. Cheng, K. Hwang, M. Dubois, and D. K. Panda, j Algorithm-Driven Simulation and Projected Performance of the USC Orthogonal Multiprocessor. In Proc. o f ICPP, St. Charles, IL., pages : III: 244-253, Aug 1990. j D. Nassimi and S. Sahni. D ata Broadcasting in SIMD Computers. IEEE\ Transactions on Computers, C-30(2):101-106, Feb 1981. ; M. T. O’Keefe and H. G. Dietz. Hardware Barrier Synchronization: j Static Barrier MIMD (SBM). In Proceedings o f the International Con- j ference on Parallel Processing, pages I: 35-42, Aug 1990. j D. K. Panda and K. Hwang. Reconfigurable Vector Register Windows { for Fast M atrix M anipulation on the Orthogonal Multiprocessor. In Pro ceedings of the International Conference on Application Specific Array Processors, Princeton, New Jersey,, pages 202-213, Sep 5-7, 1990. D. K. Panda and K. Hwang. Communication Vectorization in M ulti processors with Interleaved Shared Memories. IE E E Transactions on Parallel and Distributed Systems, 1991. (under review). i D. K. Panda and K. Hwang. Fast D ata M anipulation in Multiprocessors ; Using Parallel Pipelined Memories. Journal o f Parallel and Distributed j Computing, Special Issue on Shared-Memory Systems, pages 130-145, j June 1991. J D. K. Panda and K. Hwang. Message Vectorization for Converting Mul ticom puter Programs to Shared-Memory Multiprocessors. In Interna- j tional Conference on Parallel Processing, pages 1:204-211, Aug 1991. 110 j [PHRH90] [RK86] i i I |[Sch86] [SM89] [Sto83] [Tho78] [WB91] [YTL87] I i J ' [Zho90] D. K. Panda, F. Hsieh, S. Rao, and K. Hwang. OMP Processor Board' Design Specification. Technical report, Laboratory of Parallel and D is-! tributed Computing, Dept, of Electrical Engineering-Systems, Univ. of Southern California, Los Angeles, CA, Mar 1990. C. S. Raghavendra and V. K. Prasanna Kumar. Perm utations on Illiac IV-Type Networks. IE E E Transactions on Computers, C-35(7):662-669, Jul 1986. H.D. Schwetman. CSIM: A C-Based, Process-Oriented Simulation Lan guage. In Proceedings of the 1986 W inter Simulation Conference, pages. 387-396, 1986. I.D. Scherson and Y. Ma. Analysis and Applications of the Orthogonal Access Multiprocessor. Journal of Parallel and Distributed Computing, 7(2):232-255, Oct 1989. Q. F. Stout. Mesh-Connected Computers with Broadcasting. IE E E Transactions on Computers, C-32(9):826-830, Sep 1983. C. D. Thompson. Generalized Connection Networks for Parallel Proces sor Interconnections. IE E E Transaction on Computers, C-27(12):1119- 1125, Dec 1978. K. H. W arren and E. D. Brooks. Gaussian Elimination: A Case Study on Parallel Machines. In Compcon, pages 57-61, 1991. P. C. Yew, N. F. Tzeng, and D. H. Lawrie. D istributed Hot-spot Ad dressing in Large-scale Multiprocessor. IE E E Transactions on Comput ers, pages 388-395, Apr 1987. J. X. Zhou. A Parallel Com puter Model Supporting Procedure-Based Communication. In Proceedings of the Supercomputing ’ 90, New York, pages 286-294, Nov 1990. Appendix A Architecture M odeling CSIM Macros: 1. Bus-based Crossbar-Connected Multiprocessor and Orthogonally-connected Multiprocessor 2. Hypercube m ulticom puter 113 CSIM Modeling macros for OMP and CCM /(include <stdio.h> #include <math.h> //include “ csim.h” ^ .A .A A A A A A A A A A .A A A .A A A A A .A A .A A A A A A .A A A ,,...., Time Constants ■ Worst case values Units of simulated time = microseconds ^define L M ACCESS 0.150 /* Local Memory access time */ //define SCAL_ACC_COST 1.0 /* Scalar access */ //define in c rJa c c _c o s t 0.050 /* Time for accessing succesive items */ /* after a FIXED_ACC_COST */ //define FIXED_ACC_COST 0.800 /* fixed memory access cost */ #define ACC COST 0.2 /* memory read/write access cost*/ #define BLK_ACC_COST 0,2 /* cost for changing memory */ /* address in */ 1 * block vector access */ //define SBUS_BRDCST 0.300 1 * Time for doing a SBUS broadcast */ r reflects hardware synchronization */ //define RISC_OP 0.0303 /* Time for int or fp RISC op on i860 */ /* 33 Mips rating for 40MHz chip */ //define BUFACC 0.050 !* Time to read (write) one data from(to)* I* processor board data buffer or */ /* communication data buffer */ r U .u ..M UM.MU i,u UU,UM , « M m ,.u ,.MUUIu Ua U^ u U, CSIM Representation for OMP components and machine state FACILITY om(NCPUS*NCPUS]; /* orthogonal memory */ FACILITY s_bus; 1 * synchronization bus */ FACILITY r_bus[NCPUS]; /* row buses for crossbar-connected multiprocessor*/ EVENT row_mode[NCPUS]; 1 * EVENT var for row-access mode */ EVENT col_mode[NCPUS]; /* EVENT var for col-access mode */ EVENT g_done; 1 * global EVENT decl to detect end */ #define CNTSIZ16 1 * Number of synchronization counters */ /* Array of pointers to Synchronization semaphore structure */ struct t_COUNTER { int c_num; /* number of processes to be synch’ ed */ int init; /* flag to show initialization status */ } *cntr[CNTSIZ]; EVENT c_set[CNTSIZ]; /* EVENT variable to enforce synch */ y ^ A M A A ttU ^ A A A ^ A A A a ^ A A A A A A A A A A A A A A A A A A A .**.* Simulator variables A A A A A A A A A A A A A M ^ .A A A A A A A A A A A A A A .A A . A A .A A A A A A ............. ; int act, i; float float float float #ifdef #define #define #endif #ifdef #define #define #endif /* structure for accumulating execution statistics for a process (in addition to CSIM’ s built-in mechanisms) */ struct pstat { long num_om; /* orthogonal memory accesses */ long numjm; /* local memory accesses */ long num_buf; r data buffer accesses */ long num_inst; * number of RISC instructions executed */ long num_inst_comp; /* number of inst used for computation */ long num inst comm; /* number of inst used for communication */ long num_inst_sync; /* number of inst used for synchronization */ long num_synch; /* number of synchronizations done */ float p_time; /* total time spent by the processor */ float p_synctim; /•process synchronization time */ float p_commtim; /* process communication time */ float p_comptim; /* process computation time */ float p_wtcommtim; /* process write communication time */ float p_rdcommtim; / * process read communication time */ } pstat_ar[NCPUS]; /A A .A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A CSIM macros - Verbs for simulation only A .A A A A A A A A A AA^A.AM A A A A A A A A A A A A .K A A * , ^ I* access Data Buffer for single datum on proc x #define _ACC_BUF(x) asm(CLOCK_OFF); \ ts1[NCPUS], ts2[NCPUS]; tc1[NCPUS], tc2[NCPUS]; tsum; t_hold[NCPUS]; /* variable for consolidating holds */ SUN4 CLOCKON “!#clock_on" CLOCK_OFF “!#dock_of' SUNS CLOCKON “|#clock_on” CLOCK_OFF “ |#clock_or 114 if (NCPUS > 1) - £ \ hold(BUF_ACC); \ pstat_ar[x].p_compfim += BUF_ACC; \ pstat_ar[x].num_inst_comp += 1; \ pstat_ar(x].num_buf += 1;} \ else {\ hold(LM_ACCESS);\ pslat_ar[x],p_complim += LM_ACCESS; \ pstat_ar[x].num_inst_comp += 1; \ pstat_ar[x].num_lm += 1;}\ asm(CLOCKON) /* access Local Mem for processor x * [ #define _ACC_LM (x) asm(CLOCK_OFF); \ hold(LM_ACCESS); \ pstat_ar[x].p_comptim += LM_ACCESS; \ pstat_ar[x].num_lm + = 1; \ pstat ar[x].num inst_comp += 1; \ asm(CLOCKON) /* account for Local Mem data fetch */ #define _FETCH_LM (x) asm (CLOCKOFF); \ t_hold[x] += LMACCESS; \ pstat_ar[x].p_comptim += LM_ACCESS; \ pstat_ar[x].num_lm += 1; \ pstat_ar[x].num_inst_comp += 1;\ asm(CLOCK_ON) I * account for VRW data fetch */ #define _FETCH_VRW(x) asm(CLOCK_OFF);\ t_hold[xJ += BUF_ACC; \ pstat_ar[x].p_comptim += BUF_ACC; \ pstat_ar[x].num_buf += 1; \ pstat_ar[x].num_inst_comp += 1; \ asm(CLOCK_ON) /* advance simulation time by n to account for compute activity on proc x */ #define ADV TM (n,x) asm(CLOCK_OFF); \ t~ hoid[x] += n*RISC_OP; \ pstat_ar[x].p_comptim += n*RISC_OP; \ pstat_ar[x].num_inst_comp += n; \ asm(CLOck_ON) /* advance simulation time by n to account for communication activity on proc x */ #define ADV TWC(n,x) asm(CLOCK OFF); \ t~ hold[x] += n*RISC_OP” \ pstat_ar[x).p_commtim += n*RISC_OP; \ pstat ar[x].num inst comm += n; \ asm(CLOCK_ON) /* advance simulation time by n to account for synchronization activity on proc x */ Adeline _ADV_TMS(n,x) asm(CLOCK OFF); \ t_hold[x]+=n*RISCOP~\ pstat_ar[x].p_synctim += n*RISC_OP; \ pstat_ar[x].num_inst_sync += n; \ asm(CLOCKON) I * increment index i for synchronization modulo CNTSI2 */ #define _INC(i) asm(CLOCK_OFF); \ i * (i s s (CNTSIZ -1)) ? 0 ; i+1; \ asm(CLOCK_ON) I * initiate simulation run with model name mn */ #define _SIM JNIT(mn) asm(CLOCK_OFF); \ simO { \ int $$idxx; \ createfsim’ ); \ set_model_name(“ mn’ ) ; \ csim_initO; \ asm(CLOCK_ON) Sdefine REPORT reportO; \ omp_reportO ;\ mdlstatO;\ omp_sum_commtimeO { int $$idxx; tsum = 0.0; for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) tsum += pstat_ar[$$idxx].p_wtcommtim; } tsum += pstat_ar[NCPUS-1].p_rdcommtim; printf(“ \n\n Total communication time\n” ); printf (“ % 12,3f”, tsum);_________________ 115 ccm_sum_commtime() < int $$idxx; tsum = 0.0; for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) { tsum += pstat ar[$$idxx].p wtcommtim; } tsum += pstat_ar[NCPUS-l].p_rdcommtim; tsum += pstat_ar[0].p_rdcommtim; tsum += ((4 * SIZE) + 2) * SBUS_BRDCST; printf(“ \n\n Total communication time\n” ); printf<“ %12.3f’, tsum); omp reportO { int $$idxx; for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) { pstat_ar[$$idxx].num_inst = pstat_ar[$$idxx].num_inst_comp + pstat_ar[$$idxx].num_inst_comm + pstat_ar[$$idxx].num_inst_sync; pstat_ar[$$idxx].p_time = pstat_ar[$$idxx].p_comptim + pstat_ar[$$idxx].p_commtim + pstat_ar[$$idxx].p_synctim; } ; printf(“ \n\n---------------------Execution Statistics--------------------“ ); printf(‘ \n\n PID totaljnst com pjnst com minst sync inst o m a c c lm_acc buf acc no_synch\n’ }; for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) { printf(“%3d %8d %10d %8d %9d %8d %Sd %6d %6d \n", $$idxx, pstat_ar[$$idxx].num_inst, pstat_ar[$$idxx].num_inst_comp, pstat_ar[$$idxx] .num _inst_com m, pstat_ar[$$idxx].num_inst_sync, pstat_ar[$$idxx].num_om, pstat_ar[$$idxx].num_lm, pstat_ar[$$idxx].num_buf, pstat_ar[$$idxx].num_synch); } printf(‘ \n” ); printf(‘ Vi” ); printf(“ \n\n PID totaljim e com pjim e synch_time com mjim e wtcommjime rdcomm_time\n” ) for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) { printff%3d %13.3f %13.3f %12.3f %12.3f %12.3f %12.3f\n”, $$idxx, pstat_ar[$$idxx).p_time, pstat_ar[$$idxx].p_comptim, pstat_ar[$$idxx].p_synctim, pstat_ar[$$idxx).p_commtim, pstat_ar[$$idxx].p_wtcommtim, pstat_ar[$$idxx].p_rdcommtim); } printf('V)” ); } CSIM macros - Verbsr /* create a parallel code segment named “k” for execution on i860 */ #define CRT_TSK(k) asm(CLOCK_OFF); \ create (“ k’ ) ; \ asm(CLOCKON) I * terminate parallel code segment on processor pn, record simulation time for process, and set global completion flag if needed */ #define END_TSK(pn) asm(CLOCK_OFF); \ hold(t_ho!d[pn]);\ t_hold[pn) = 0.0;\ act-; \ if (act == 0) set(g_done); \ asm(CLOCK_ON) I * set processor x to row-mode; count instructions and expend time, “set” or “clear” =2 i860 instructions*/ ,#define SET_ROWACC(x) asm(CLOCK_OFF);\ if (NCPUS >1)(\ set(row_mode[x]); \ _ADV_TMC(2,x); \ clear(col_mode[x]); /* comment */ \ _ADV_TMC(2,x);}\ asm(CLOCK_ON) I * set processor x to col-mode; count instructions and expend time “ set" or “ clear” =2 i860 instructions. */ #define SET_COLACC(x) asm(CLOCK OFF); \ if (NCPUS >1){\ set(col_mode[x]); \ _ADV_TMC(2,x); \ 116 I ______ clear(row_mode[x]); \ ADV TMC(2,x);}\ asm(CLOCK_ON) r clear processor x from row-mode */ ;#define CLR_ROWACC(x) asm<CLOCK_OFF); \ if (NCPUS >1){\ clear(row_mode[x]); \ _ADV_TMC(2,x);} \ asm(CLOCK_ON) I * clear processor x from col-mode */ #define CLR_COLACC(x) asm(CLOCK_OFF); \ if {NCPUS >1H\ clear(col_mode[x]); \ _ADV_TMC(2,x);}\ asm(CLOCK_ON) /* set semaphore structure #k’ s counter to n and clear assoc EVENT flag if no one else has done so yet (irtit = 0), then signal this fact (by setting inft = 1). Do this for process x. */ #define TSETCNT(k, n, x) asm(CLOCK_OFF); \ if (NCPUS >1){\ int $$i; \ hold(SBUS_REQ+t_hold(x]); \ t_hold[x] = 0.0;\ pstat_ar[x3.num_inst_sync += 1; \ pstat_ar[xj.p_synctim += RISC_OP; \ $$i = reserve(sbus); \ hold(SBUS_RMW); \ pstat_ar[x].num_inst_sync += 2; \ pstat_ar[x].p_synctim += 2*RISC_0P; \ if (cntr[k]->init == 0) { \ hold(SBUS_RMW); \ pstat_ar[x].num_inst_sync += 2; \ pstat_ar[x].p_synctim += 2*RISC_0P; \ cntr[k]->c_num = n; \ pstat_ar[x].num_inst_sync += 2; \ pstat_ar[x].p_synctim += 2*RISC_0P; \ clear(c_set[k]); \ cntr[k]->init = 1; \ } ;\ release(s_bus);\ } ;\ asm(CLOCK_ON) /*synchronize processes (or processors) on synch structure #k, Once synchronization completes, this structure will be re-initialized and available for reuse (init = 0, c_set[k] cleared). To be used for process x. “ hold (SBUS_REQ)” = 1 i860 instruction (stretched), instruction count for “ wait(c_set)”. These are computed by timing the wait instruction counting during synchronization time. */ #define SYNCH(k,x) asm(CLOCK_OFF); \ if (NCPUS > 1) \ n int$$i;\ hold(SBUS_REQ+1_hold[x]); \ t_hold[x] = 0.0; \ pstat_ar[x].num_synch += 1 ;\ $$i = reserve(s_bus); \ hold(SBUS_RMW);\ cntr[k]->c_num-; \ pstat_ar[x].num_inst_sync += 2; \ pstat_ar[x].p_synctim += 2*RISC_OP; \ if (cntr[k]->c_num == 0) \ (\ set(c_set[k]); \ pstat_ar[x].num_inst_sync += 2; \ pstat_ar(x].p_synctim += 2*RISC_OP; \ hold(SBUS_BRDCST); \ pstat_ar(x].num_inst_sync += 2; \ pstat_ar[x].p_synctim += 2*RISC_OP; \ cntr[k]->inH = 0; \ release (s bus);\ } \ e lse \ { \ release(s_bus); \ ts1[x] = simtime(); \ wait (c_set[k]); \ ts2[x] = simtimeO; \ pstat_ar[x].p_synctim += (ts2[x) -ts1[xj); \ } \ asm(CLOCK_ON) I * simulate a column oriented scalar read operation from OM in a Crossbar-connected multiprocessor system. Similar to PIPEJN operation of the OM P. rb indicates requests to the row bus ‘rb’ and mm indicates access to the memory module ‘ mm’ on the row bus. #define CROSS_ROW_IN(x,rb,mm) asm(CLOCKOFF); \ hold(t_hold[x]); \ t_hold[x] = 0; \ 117 psta1_ar[x].numJnst_comm += 4; \ pstat_ar[x].p_commtim += 4*RISC_0P; \ pstat_ar[x].num_om += 1; \ reserve(r_bus[rb]); \ reserve(om[rb*NCPUS+mm]); \ hold(SCAL_ACC_COST); \ pstat_ar[x).p_commtim += SCAL_ACC_COST; \ release(om[rb*NCPUS+mm]); \ release(r_bus[rb]); \ asm(CLOCK_ON) /* simulate a column oriented blk_scalar read operation from OM in a CRossbar-connected multiprocessor system. Similar to CROSS_ROW_IN operation except that it reads Ig data elements from a memory module in a single access, rb indicates requests to the row bus 'rb’ and mm indicates access to the memory module ‘ mm’ on the row bus. #define CROSS_ROW_BLK_IN(x,rb,mm,lg) asm(CLOCK_OFF); \ hold(t_ho!d[x]); \ t_hold[x] = 0; \ pstat_ar[x].num_inst_comm += 4; \ pstat_ar[x].p_commtim += 4*RISC_OP; \ pstat_ar[x].p_rdcommtim +- 4*RISC_OP; \ pstat_ar[x].num_om += 1; \ tc1[x] = simtime();\ reserve(r_bus[rb]); \ reserve(om[rb*NCPUS+mm]); \ hold(FIXED ACC COST);\ hold(lg*ACC_COST); \ release(om[rb*NCPUS+mm]); \ release(r_bus[rb]); \ tc2[x] = simtimeO; \ pstat_ar[x].p_commtim += (tc2[x] - tc1 [x]); \ pstat_ar[x].p_rdcommtim += (tc2[x] - tc1[x]); \ asm(CLOCK_ON) I * simulate a column oriented scalar write operation to OM in a Crossbar-connected multiprocessor system. Similar to PIPE_OUT operation of the OMP. rb indicates requests to the row bus ‘ rb’ and mm indicates access to the memory module ‘ mm’ on the row bus. #define CROSS_ROW_OUT(x,rb,mm) asm (CLOCKOFF); \ hold (t_hold[x|); \ t_hold[x] = 0.0; \ pstat_ar[x].num_inst_comm += 4; \ pstat_ar[x].p_commtim += 4*RISC_OP; \ pstat_ar[x].num_om += 1; \ reserve(r_bus[rb]); \ reserve(om[rb*NCPUS+mm]); \ ________ hold(SCAL ACC COST); \ _______ _______________________ pstat_ar[x].p_commtim += SCAL_ACC_COST; \ release(om[rb*NCPUS+mm]); \ release(r_bus[rb]); \ asm(CLOCK_ON) I * simulate a row-read vector operation from OM in a Crossbar-connected multiprocessor system. Similar to VECTOR _IN operation with the exception that processors can access row buses in a permutation manner with out any conflict, rb indicates requests to the row bus ‘ rb’. ffdefine CROSS_VECTOR_IN(x,rb) asm(CLOCKOFF); \ hold(t_hold[xJ); \ t_hold[x] = 0,0;\ { \ int $$idxx;\ pstat_ar[x].num_inst_comm += 3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ reserve(om{rb*NCPUS+0]); \ hold(FIXED_ACC_COST+ACC_COST); \ pstat_ar[x].p_commtim += (FIXED_ACC_COST + ACC_COST); \ release(om[rb*NCPUS+0]); \ pstat_ar[x].num_om += 1; \ pstat_artx].num_inst_comm += 1; \ pstat_ar[x].p_commtim += RISC_OP; \ for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \ U reserve(om[rb*NCPUS+$$idxx]); \ hold(INCRACCCOST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[rb*NCPUS+$$idxxl); \ } ; \ y,\ asm(CLOCKON) /* simulate a row-read operation from OM in a Crossbar-connected multiprocessor system. Similar to VECTOR_OUT operation with the exception that processors can access row buses in a permutation manner with out any conflict, rb indicates requests to the row bus ‘rb’. #define CROSS_VECTOR_OUT(x,rb) asm(CLOCK_OFF);\ hold(t_hold[x]);\ t hold[x] = 0.0;\ f\ int $$idxx;\ pstat_ar[x],num_inst_comm += 3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ reserve(om[rb*NCPUS+0]); \ ________________________________ hold (FIXED_ACC_CO STf ACC_COST); \ _ _ _ _ _ _ _ _ _ 118 pstat_ar[x].p_commtim += (FIXED_ACC_COST + ACC_COST); \ release(om[rb*NCPUS+0]); \ pstat_ar[x].num_om += 1; \ pstat_ar[x].numJnst_comm += 1; \ pstat_ar[x].p_commtim += RISC_OP; \ for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \ { \ reserve(om[rb*NCPUS+$$idxx]); \ hold(INCR_ACC_COST); \ pstaf_ar[x].p_commtim += INCR_ACC_COST; \ release(om[fb*NCPUS+$$idxx]); \ } ;\ } ;\ asm(CLOCKON) /* simulate a row-read block vector operation from OM in a Crossbar-connected multiprocessor system. Similar to BLK_VECTOR_IN operation with the exception that processors can access row buses in a permutation manner with out any conflict. x=processor, rb=requests to the row bus 'rb', l=number of vectors, NCPUS=degree of memory interleaving. 7 #define CR0SS_BLKVECTOR_IN(x,rb,l) asm(CLOCK OFF); \ hold(t_hold[x]); \ t hold[x] = 0.0;\ f t int $$idxx,$$idyy;\ pstat_ar[x].numJnst_comm += 3; \ pstat_ar[x].p_commtim += 3*RISC_0P; \ hold (FIXED_ACC_COST + ACC_COST); \ pstat_ar[x].p_commtim += FIXED_ACC_COST + ACC_COST; \ for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \ { \ reserve(om[rb*NCPUS+$$idxx]); \ hold{INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[rb*NCPUS+$$idxx]); \ } \ pstat_ar[x].num_om += I ; \ for ($$idyy=1; $$idyy < I ; $$idyy++) \ ft for ($$idxx=0; $Sidxx < NCPUS; $$idxx++) \ {\ reserve(om[rb*NCPUS+$$idxx]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[rb*NCPUS+$$idxx]); \ } \ hold(BLK_ACC_COST); \ __________________________ pstat arfxj.p commtim + = BLK ACC COST; \ _________________ } ;\ asm(CLOCK_ON) r simulate a row-write block vector operation to OM in a Crossbar-connected multiprocessor system. Similar to BLK_VECTOR_OUT operation with the exception that processors can access row buses in a permutation manner with out any conflict. x=processor, rb=requests to the row bus ‘rb’, l=number of vectors, NCPUS=degree of memory interleaving*/ #define CROSS_BLK_VECTOR_OUT(x,rb,l) asm(CLOCK_OFF); \ hold(t_hold[x]); \ t_hold[x] = 0.0; \ int $$idxx,$$idyy; \ pstat_ar[x],num_inst_comm += 3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ hold (FIXED_ACC_COST + ACCCOST); \ pstat_ar[x].p_commtim 4= (FIXED_ACC_COST 4 ACC_COST); \ for ($$idxx=0; $$idxx < NCPUS; $$idxx44) \ { \ reserve(om[rb*NCPUS4$$idxx]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim 4= INCR_ACC_COST; \ release(om[rb*NCPUS4$$idxx]); \ } ;\ pstat_ar[x].num_om 4= I ; \ for ($$idyy=i; $$idyy < I; $$idyy44) \ ft for ($$idxx=0; $$idxx < NCPUS; $$idxx44) \ reserve(om[rb*NCPUS4$$idxx]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim 4= INCR_ACC_COST; \ release{om[rb*NCPUS4$$idxx]); \ } ;\ hold(BLK_ACC_COST); \ pstat_ar[x].p_commtim + = BLK_ACC_COST; \ } ;\ asm(CLOCK_ON) /* simulate a BLK_SCALAR_OUT access to the OM on row-mode, ‘ mm’ indicates the memory module to be written into. ‘ Ig’ indicates message length in words */ (fdefine BLK_SCALAR_OUT(x,mm,lg) asm(CLOCK_OFF); \ hold (t_hoid [x]); \ tholdfx] = 0; \ pstat_£ir[x].num_inst_comm + = 4; \ pstat_ar(x].p_commtim 4= 4 * RISC_OP; \ 119 pstat_ar[x].num_om += 1; \ reserve(om[x*NCPUS+mm]); \ hold(FIXED_ACC_COST+lg*ACC_COST);\ release(om[x*NCPUS+mm]); \ pstat_ar[x].p_commtim += (FIXED_ACC_COST+lg*ACC COST); \ asm(CLOCK_ON) /* simulate a BLK_SCALAR_IN access from the OM in column-mode, ‘mm’ indicates the memoty module from which data is read.'lg' indicates m essage length in words */ #define BLK_SCALAR_IN(x,mm,lg) asm(CLOCK_OFF); \ hold (t_hold[x]); \ t_hold[x] = 0; \ pstat_ar[x].num_inst_comm += 4; \ pstat_ar[x].p_commtim += 4 * RISC_OP; \ pstat_ar[x].p_rdcommtim += 4 * RISC_OP; \ pstat_ar[x].num_om +=1; \ reserve(om[mm*NCPUS+x]); \ hold(FIXED_ACC_COST+lg*ACC_COST); \ release(om[mm*NCPUS+x]); \ pstat_ar[x].p_commtim += (FIXED_ACC_COST+lg*ACC_COS'T); \ pstat_ar[x].p_rdcommtim += (FIXED_ACC_COST+lg*ACC_COST); \ asm(CLOCK_ON) /* simulate a BROADCAST write to a bus from proc x in row-mode or col-mode. A BROADCAST is defined as a vector write of a single value. For a BROADCAST, data buffer access has to be accounted for separately. “ hold(SCAL_ACC_COST)” = 1 i860 instruction (stretched) checking stale flag = 3 i860 instructions */ #define BROADCAST^) asm(CLOCK_OFF); \ hold(t_hold[x]);\ t hold[x] = 0.0;\ u int $$idxx; \ if (state(row_mode(x])==OCC) \ { \ pstat_ar[x].numJnst_comm += 3; \ pstat_ar[x].p_commfim += 3*RISC_OP; \ pstat_ar[x].p_wtcommtim += 3*RISC_OP; \ reserve(om[x*NCPUS+0]); \ hold (FIXED_ACC_COST+ACC_C0ST); \ pstat_ar[x).p_commtim += (FIXED_ACC_COST+ACC_COST); \ pstat_ar[x].p_wtcommtim += (FIXED_ACC_COST+ACC_COST); \ release(om[x*NCPUS+0]); \ pstat_ar[x].num_inst_comm+= 1;\ pstat_ar[x].p_commtim += RISC_OP; \ pstat_ar[x].num_om += 1; \ [_ for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \_______________ reserve(om[x*NCPUS+$$idxx]); \ release(om[x*NCPUS+$$idxx]); \ )\ }\ else if (state(col_mode[x])==OCC) { \ pstat_ar[x].num _inst_comm += 3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ pstat_ar[x].p_wtcommtim += 3*RISC_OP; \ reserve{om[0*NCPUS+x]); \ hold(FIXED_ACC_COST+ACC_COST); \ pstat_ar[x].p_commtim += (FIXED_ACC_COST+ACC_COST); \ pstat_ar[x].p_wtcommtim += (FIXED_ACC_COST+ACC_COST); \ release(om[0*NCPUS+x]); \ pstat_ar[x].num_inst_comm += 1; \ pstat_ar[x].p_commtim += RISCOP; \ pstat_ar(x].num_om += 1; \ for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \ { \ reserve(om[$$idxx*NCPUS+x]); \ release(om[$$idxx*NCPUS+x]); \ } ;\ } ;\ } ;\ asm(CLOCK_ON); /* simulate a Block BROADCAST write to a bus from proc x in row-mode or col-mode. The write con sists of I vectors. A BROADCAST is defined as a vector write of a single value. For a BROADCAST data buffer access has to be accounted for separately. “ hold(SCAL_ACC_COST)” = 1 i860 instruction (stretched) checking state flag = 3 i860 instructions */ #define BLK_BROADCAST(x,lg) asm(CLOCKOFF); \ hold (t_hold [x]); \ t_hold[x] = 0.0; { \ int $$idxx; \ if (state(row_mode[x])==OCC) \ {\ pstat_ar[x],num_inst_comm += 3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ pstat_ar[x].p_wtcommtim += 3*RISC_OP; \ reserve(om[x*NCPUS+0]); \ hold (FIXED_ACC_COST); \ pstat_ar[x].p_commtim += FIXED_ACC_COST; \ ______________ pstat_ar[x].p_wtcommtim += FIXED_ACC_COST; \ 120 release(om[x*NCPUS+OJ); \ pstat_ar[x].num_inst_comm += 1; \ pstat_ar[x).p_commtim += RISC_OP; \ pstat_ar[x].p_wtcommtim += RISC_OP; \ pstat_ar[xJ.num_om += Ig; \ for ($$idxx=0; $$idxx < Ig; $$idxx++) { \ reserve(om[x*NCPUS+0]); \ hold(ACC_COST); \ pstat_ar[x].p_commtim += ACCCOST; \ pstat_ar[x].p_wtcommtim +=ACC_COST; \ release(om[x*NCPUS+0]); \ } ;\ } ;\ if (stafe(col_mode[x])==OCC) \ < \ pstat_ar[x].numJnst_comm += 3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ pstat_ar[x3.p_wtcommtim += 3*RISC_0P; \ reserve(om[x*NCPUS+0]); \ hold(FIXED_ACC_COST); \ pstat_ar[x].p_commtim += FIXED_ACC_COST; \ pstat_ar[x].p_wtcommtim += FIXED_ACC_COST; \ release(om[x*NCPUS+0]); \ pstat_ar[x].num_inst_comm+= 1;\ pstat_ar[x].p_commtim += RISC_OP; \ pstat_ar[x].num_om += Ig; \ for ($$idxx=0; $$idxx < Ig; $$idxx++) { \ reserve(om[x*NCPUS+0]); \ hold(ACC_COST); \ pstat_ar[x].p_commtim += ACC_COST; \ pstat_arlx].p_wtcommtim += ACC_COST; \ release(om[x*NCPUS+0]); \ } ;\ } ;\ asm(CLOCKON) /* simulate a PIPELINED READ from proc x in row or column mode. For a VECTOR_IN, we have to account for data buffer accesses separately. “hold(SCAL_ACC_COST) = 1 i860 instruction (stretched) checking state flag = 3 i860 instructions */ ffdefine VECTOR_IN(x) asm(CLOCK_OFF); \ hold(t_hold[x]);\ t_hold[x] = 0.0; \ int $Sidxx; \ if (state(row_mode[x])==OCC) \ {\ pstat_ar[x].num_inst_comm += 3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ reserve(om(x*NCPUS+0]); \ hold(FIXED_ACC_COST+ACC_COST); \ pstat_ar[x].p_commtim += (FIXEDACCCOST+ACCCOST); \ release(om[x*NCPUS+0]); \ pstat_ar[x].num_om += 1; \ pstat_ar(x].num_inst_comm += 1; \ pstat_ar[x].p_commtim += RISC_OP; \ for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \ { \ reserve(om[x*NCPUS+$$idxx]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[x*NCPUS+$$idxx]); \ } ;\ } \ else if (state(col_mode[x))==OCC) \ { \ pstat_ar[x].num_inst_comm += 3; \ pstat_ar[x).p_commtim += 3*RISC_0P; \ reserve(om[0*NCPUS+x]); \ hold(FIXED_ACC_COST+ACC_COST); \ pstat_ar[x].p_commtim += (FIXED_ACC_COST+ACC_COST); \ retease(om[0*NCPUS+x]); \ pstat_ar[x].num_om += 1; \ pstat_ar[x].num_inst_comm += 1; \ pstat_ar[x].p_commtim += RISC_OP; \ for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \ { \ reserve(om[$$idxx*NCPUS+x]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[$$idxx*NCPUS+x]); \ } ;\ > :\ asm(CL0CK_0N) /* simulate a PIPELINED WRITE from proc x in row or column mode. For a VECTOR_OUT, we have to account for data buffer accesses separately, “hold(SCAL_ACC_COST) = 1 i860 instruction (stretched) checking state flag = 3 i860 instructions * 1 ffdefine VECTOR_OUT(x) asm (CLOCKOFF); \ hold(t_hold[x]);\ t_hold(x] = 0.0;\ int $$idxx; \ if (state (row_mode[xj)==OCC) ( \ pstat_ar[x].num_inst_comm += 3 ;\ pstat_ar[x].p_commtim += 3*RISC_OP; \ 121 reserve(om[x*NCPUS+0]); \ hold(FIXED_ACC_COST+ACC_COST); \ pstat_ar[x].p_commtim += (FIXED_ACC_COST+ACC_COST); \ release(om[x*NCPUS+0]); \ pstat_ar[x].num_om += 1; \ pstat_ar[x].num_inst_comm += 1; \ pstat_ar[x].p_commtim += RISC_OP; \ for ($$idxx=l; $$idxx < NCPUS; $$idxx++) \ < \ reserve (om[x*NCPUS+$$idxx]); \ hold(INCRACCCOST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[x*NCPUS+S$idxx]); ^ > ;\ > \ else if (state(col_mode|x])==OCC) \ { \ pstat_ar[x].num_inst_comm += 3; \ pstat_ar[x].p_commtim += 3*RtSC_0P; \ reserve(om[0*NCPUS+x]); \ hold{FIXED_ACC_COST+ACC_COST); \ pstat_ar[x].p_commtim += (FIXED_ACC_COST+ACC_COST); \ release(om[0*NCPUS+x]); \ pstat_ar[x].num_om += 1; \ pstat_ar[x].num_inst_comm += 1; \ pstat_ar[x].p_commtim += RISC_OP; \ for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \ { \ reserve(om[$$idxx*NCPUS+x]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[$$idxx*NCPUS+x]); \ } ;\ asm(CLOCK_ON) /* Simulate a BLOCK PIPELINED READ from proc x in row or column mode. For a BLK_VECTOR_IN, we have to account for data buffer accesses separately. “hold(SCAL_ACC_COST) = 1 i860 instruction (stretched) checking state flag = 3 i860 instructions x= processor, lg=number of vectors, NCPUS=degree of memory interleaving. */ #define BLK_VECTOR_IN(x,lg) asm(CLOCK_OFF);\ ~ hold (t_hold[x]); \ t_hold[x] = 0.0; { \ int $$idxx,$$idyy; \ if (state(row_mode[x])==OCC) \ (\ _____________________ pstat ar[x].num inst comm + -3 ; \ ___________ ____ _____________ pstat_ar[x].p_commtim += 3*RiSC_OP; \ reserve(om[x*NCPUS+0]); \ hold (FIXED_ACC_COST); \ hold(ACC_COST); \ release(om[x*NCPUS+0]); \ pstat_ar[x].p_commtim += FIXED_ACC_COST; \ pstat_ar[x].p_commtim += ACC_COST; \ for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \ (\ reserve(om[x*NCPUS+$$idxx]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[)?NCPUS+$$idxx]); \ } ;\ pstat_ar[x].num_om += Ig; \ pstat_ar[x].num_inst_comm += 1; \ pstat_ar[x].p_commtim += RISC_OP; \ for ($$idyy=1; $$idyy < Ig; $$idyy++) \ { \ for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \ (\ reserve(om[x*NCPUS+$$idxxJ; \ hold (I NCR_ACC_C0 ST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[x*NCPUS+$$idxx]); \ } ;\ hold (BLK_ACC_COST); \ pstat_ar[x].p_commtim += BLK_ACC_COST; \ } ;\ if (state{col_mode(x])==OCC) \ {\ pstat_ar[x].num_inst_comm += 3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ reserve(om[0*NCPUS+x]); \ hold (FIXED_ACC_COST+ACC_COS‘ r); \ release(om[0*NCPUS+x]); \ pstat_ar[x].p_commtim += FIXED_AGC_COST + ACC_COST; \ for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \ (\ reserve(om[$$idxx*NCPUS+x|); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[$$idxx*NCPUS+x]); \ } ;\ pstat_ar[x].num_om += Ig; \ pstat_ar[x).num_inst_comm += 1; \ pstat_ar[x].p_commtim += RISC_OP; \_____________ 122 for ($$idyy=1; $$idyy < Ig; $$idyy++) \ (\ for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \ { \ res8rve(om[$$idxx*NCPUS+x]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[$$idxx*NCPUS+x]); \ } ;\ hold (BLK_ACC_COST); \ pstat ar[x].p_commtim += BLKjACC_COST; \ } ;\ ~ } ;\ } ;\ asm(CLOCKON) /* simulate a BLOCK PIPELINED WRITE from proc x in row or column mode. For a BLK_VECTOR_OUT, we have to account for data buffer accesses separately. “ hold(SCAL_ACC_COST) = 1 i860 instruction (stretched) checking state flag = 3 i860 instructions x= processor, lg=number of vectors, NCPUS=degree of memory interleaving. */ #define BLK_VECTOR_OUT(x,lg) asm(CLOCKOFF); \ hold(t_hold[x]);\ t_hold[x] = 0.0; \ < \ int $$idxx,$$idyy; \ if (slate(row_mode[x])==OCC) \ { \ pstat_ar[x].num_inst_comm +=3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ reserve(om[x*NCPUS+0]); \ hold(FIXED_ACC_COST);\ hold(ACC_COST); \ release(om[x*NCPUS+0]); \ pstat_ar[x].p_commtim += FIXED_ACC_COST; \ pstat_ar[x].p_commtim += ACC_COST; \ for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \ { \ reserve(om[x*NCPUS+$$idxx]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om{x*NCPUS+$$idxx]); \ pstat_ar[x].num_om +=lg; \ pstat_ar(x].num_inst_comm +=1; \ pstat_ar[x].p_commtim += RISC_OP; \ for ($$idyy=1; $$idyy < Ig; $$idyy++) \ < \ for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \ { \ reserve(om[x*NCPUS+$$idxx]}; \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[x*NCPUS+$$idxx]); \ } ;\ hold(BLKACCCOST); \ pstat ar[x].p commtim += BLK_ACC COST; \ > ;\ “ } ;\ if (state(col_mode[x])==OCC) \ (\ pstat_ar[x].num_inst_comm +=3; \ pstat_ar[x].p_commtim += 3*RISC_OP; \ reserve(om[0*NCPUS+x]); \ hold(FIXED_ACC_COST+ACC_COST); \ release{om[0*NCPUS+x]); \ pstat_ar[x].p_commtim += FIXED_ACC_COST + ACC_COST; \ for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \ (\ reserve(om($$idxx*NCPUS+x]); \ hold(INCR_ACC_COST); \ pstat_ar[x],p_commtim += INCR_ACC_COST; \ release(om[$$idxx*NCPUS+x]); \ > ; \ pstat_ar[x].num_om += Ig; \ pstat_ar[x].num_inst_comm += 1; \ pstat_ar[x].p_commtim += RISC_OP; \ for ($$idyy=1; $$idyy < Ig; $$idyy++) \ U for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \ (\ reserve(om[$$idxx*NCPUS+x]); \ hold(INCR_ACC_COST); \ pstat_ar[x].p_commtim += INCR_ACC_COST; \ release(om[$$idxx*NCPUS+x]); \ } ;\ hold(BLK_ACC_COST); \ pstat ar[x].p_commtim += BLK_ACC_COST; \ } ;\ " } \ asm(CLOCKON) INITIAUZATION routines ' I 123 i ---------------------------------------------- ------------------------------- | /* routine to initialize facilities */ i csimjnitO j { ' int i; | /* declare extra event variables if necessary */ I if (NCPUS <= 64) { I i = max_events(200); i i = max_facilities(5000); i i = max_servers(5000); } else { i = max_events(550); i = max_facilities(20000); i = max_servers (20000); } ; I * declare facilities and events */ s_bus = facility(“ s_bus” );' facility_set(om,”om” ,NCPUS*N CPUS); facility_set(r_bus,”r_bus”,NCPUS); I * inluded for crossbar-connected */ g_done = event(“ g_done” ); event_set(row_mode,” row_mode”,NCPUS); event_set(col_mode,” col_mode”,NCPUS); event_set(c_set,” c_set”,CNTSIZ); I * instantiate and initialize semaphore structures */ for (i = 0; i < CNTSIZ; i++) { cntr[i) = (struct ^COUNTER * ) malloc(sizeof(structt_COUNTER)); cntr(i]->c_num = 0; cntr(i]->init = 0; } ; I * initialize the instruction counting structures and holds array */ for (i = 0; i < NCPUS; i++) { pstat_ar[i].num_om = 0; pstat_ar[i].num_lm = 0; pstat_ar[i].num_buf = 0; pstat_ar[i].num_inst = 0; pstat_ar[i].num_inst_comp = 0; I pstat_ar(il.num_inst_comm = 0; j pstat_ar[i].num_inst_sync = 0; I pstat_ar[i].num_synch = 0; pstat_ar[i].p_time = 0.0; pstat_ar[i].p_comptim = 0.0; pstat_ar[i].p_synctim = 0.0; pstat_ar[i].p_commtim = 0.0; pstat_ar[i].p_wtcommtim = 0.0; pstat_ar[i].p_rdcommtim = 0.0; t hold[i] = O X ); } ; } /* csim-init */ / * “.... CSIM Modeling of Hypercube #include <stdio.h> #include <math.h> #include “ csim.h” / ..... Time Constants - Worst case values Units of simulated time = microseconds #define LM_ACCESS 0.150 #define RISC_0P 0.0303 #define BUF_ACC 0.050 #define START_UP 95.0 #define INCR_COMM 0.394 #define DIM_COST 10.3 #define CNTRDY 3 CSIM Representation for OMP components and machine state FACILITY link[NCPUS*DIM]; /* communication links in hypercube */ EVENT g_done; /* global EVENT dec! to detect end */ EVENT data_rdy[CNTRDY][NCPUS]; /* event variable to indicate data ready situation */ Simulator variables int act, i; float ts 1 [NCPUS], ts2[NCPUS]; float tsum; float tc 1 [NCPUS], tc2 [NCPUS]; float t_hold[NCPUS]; /* variable for consolidating holds */ #ifdef SUN4 #define CLOCK_ON “!#clock_on" #define CLOCK_OFF “!#clock_off” V /* Local Memory access time */ /* Time for int or fp RISC op on i860,*/ r 33 Mips rating for 40MHz chip */ I* Time to readfwrite) one datum from(to) */ /* processor board */ I* data buf or communication data buffer */ /* start-up time for communication */ /* increment access time * 1 r incremental cost for each dim */ I* number of ready signals */ #endif #ifdef SUN3 #de1ine CLOCK_ON “ |#clock_on" #de1ine CLOCK_OFF “|#clock_off" ffendif /* structure for accumulating execution statistics for a process (in addition to CSIM’ s built-in mechanisms) */ struct pstat { long num_om; /* orthogonal memory accesses */ long numjm; I * local memory accesses */ long num_buf; /* data buffer accesses */ long numjnst; /* number of RISC instructions executed */ long num_inst_comp; /* number of inst used for computation */ long num_inst_comm; 1 * number of inst used for communication */ long num_inst_sync; 1 * number of inst used for synchronization */ long num_synch; r number of synchronizations done */ float p_time; 1 * total time spent by the processor */ float psynctim; I * process synchronization time */ float p_commtim; 1 * process communication time */ float p_comptim; I * process computation time */ long numjink; I * number of link communication */ long num_bytes; 1 * number of bytes transfered */ } pstat_ar(NCPUS]; CSIM macros - Verbs for simulation only /* access Local Mem for processor x */ #define ACC_LM(x) asm(CLOCK_OFF); \ hold(LM_ACCESS);\ pstat_ar[x].p_comptim += LM_ACCESS; \ pstat_ar[x].numjm += 1; \ pstat_ar[x].num_inst_comp += 1; \ asm(CLOck_ON) I * account for Local Mem data fetch */ ffdefine _FETCH_LM(x) asm(CLOCK_OFF); \ t_hold[x] += LM_ACCESS; \ pstat_ar[x].p_comptim += LM_ACCESS; \ pstat_ar[x].num_lm += 1; \ pstat_ar[x].num_inst_comp += 1; \ asm(CLOCK_ON) /* advance simulation time by n to account for compute activity on proc x * 1 #define _ADV_TM(n,x) asm(CLOCK_OFF); \ t_hold[x] += n*RISC_OP; \ pstat_ar[x].p_comptim += n*RISC_OP; \ pstat_ar[x].num_inst_comp += n; \ asm(CLOCK_ON) r advance simulation time by n to account for communication activity on proc x */ #define _ADV_TMC(n,x) asm(CLOCK OFF); \ t_hold[x] += n*RISC_OP“ \ pstat_ar[x].p_commtim += n*RISC_OP; \ pstat_ar[x].num_inst_comm += n; \ asm(CLOCK_ON) /* advance simulation time by n to account for synchronization activity on proc x */ #define _ADV_TMS(n,x) asm(CLOCKOFF); \ t_hold[x] += n*RISC_OP; \ pstat_ar[x].p_synctim += n*RISC_OP; \ pstat_ar[x].num_inst_sync += n; \ asm(CLOCKON) #detine REPORT reportO; \ hyp_reportO ;\ mdlstatQA hyp_sum_commtimeO int $$idxx; tsum = 0.0; for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) tsum += pstat_ar[$$idxx].p_commtim; } printffVAn Total communication time\n” ); printff%12.3P, tsum); } count_start(x) intx; { td [x] = pstat_ar[x].p_commtim; ) count_end(x) intx; { tc2[x] = pstat_ar[x],p_commtim; tsum += tc2[x] - tel [x]; } hyp_sum_com reportO { printf(“\n\n Total communication time\n” ); printf(“ %12.3f”, tsum); printffVi” ); } hypreportft { int $$idxx; for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) { pstat_ar[$$idxx].numjnst = pstat_ar|$$idxx].numJnst_comp + pstat_ar[$$idxx].num_inst_comm + pstat_ar[$$idxx].num_inst_sync; pstat_ar[$$idxx].p_time = pstat_ar[$$idxx].p_comptim + pstat_ar[$$idxx].p_commtim + pstat_ar[$$idxx].p_synctim; y . printffVAn---------------------Execution Statistics---------------------“ ); printf(“ \n\n PID totaljnst com pjnst commjnst lm_acc buf_acc link_acc bytes_xfered\n” ) for ($$idxx = 0; $$idxx < NCPUS; $$idxx+4) { printf(“ %3d %8d %10d %8d %9d %8d %8d %8d \n”, $$idxx, pstat_ar[$$idxx].numjnst, pstat_ar[$$idxx].numJnst_comp, pstat_ar[$$idxx].num_inst_comm, pstat_ar[$$idxx].num_lm, pstat_ar[$$idxx].num_buf, pstat_ar[$$idxx].num_link, pstat ar[$$idxx].num bytes); > printf(“ \n” ); printffV i”); printf(“ \n\n PID total_time com pjim e waitjim e commJimeVi” ); for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) { printf(“ %3d %13.3f %13.3f %12.3f %12.3f\n", $$idxx, pstat_ar[$$idxx] .p_time, pstat_ar[$$idxx].p_comptim, pstat_ar[$$idxx].p_synctim, pstat_ar[$$idxx].p_commtim); printffAn” ); } ***CSIM macros - Verbs /* create a parallel code segment named “k” for execution on i860 */ #define CRT_TSK(k) asm(CLOCK_OFF); \ create (“ k” );\ asm(CLOCKON) /* terminate parallel code segment on processor pn, record simulation time for process, and set global completion flag if needed */ #define END_TSK(pn) asm(CLOCK_OFF); \ hold (t_hold[pn]> ;\ t_hold[pn] = 0.0;\ act-; \ if (act == 0) set(g_done); \ asm(CLOCKON) /* Macro to take care of hypercube link communication. Processor x sends message to processor y on dimension d. The length of the m essage is Ig bytes */ #define COMMUNICATE(x,y,d,lg) asm(CLOCKOFF); \ hold(t_hold[x]); \ t_hold(x] = 0.0; \ pstat_ar[x].num_inst_comm += 1; \ pstat_ar[x].num_link +=1; \ reserve(link[x*DIM+d]); \ hold(START_UP); \ hold(lg*INCR_COMM); \ hold(DIM_COST);\ pstat_ar[x].p_commtim += START_UP; \ pstat_ar(x].p_commtim += (lg*INCR_COMM); \ pstat_ar[x].p_commtim += DIM_COST; \ pstat_ar[x].num_bytes += Ig; \ release(link[x*DIM+d]); \ asm(CLOCK_ON); /* Macro to take care of hypercube link communication in a circuit switched manner. x=source, y = destination, Ig = length of message. */ #define COMMUNICATE_CIRCUIT(x,y,lg) asm(CLOCK_OFF); \ hold(t_hold[x]); \ t_hold[x) = 0.0; \ pstat_ar(x].num_inst_comm +=1; \ pstat_ar[x).num Jink +=1; { \ int $$idxx,S$idyy,$$idzz,$$idww; \ $$idxx = 0; \ $$idzz = x; \ for ($$idyy=0; $$idyy<=DIM; $$idyy++) \ { \ if ($$idzz != y) \ {\ $$idxx += 1; \ $$idww = FRO UTE_D[x][$Sidzz]; \ reserve(link[$$idzz*OIM+S$idww]}; \ S$idzz = FROUTE[x][$$idzz]; \ y.\ hold(START_UP+$$idxx*DIM_COST+lg*INCR_COMM); \ pstat_ar[x].p_commtim += START_UP + $$idxx*DIM_COST+ lg*]NCR_COMM; \ pstat_ar[x].num_bytes +=lg; \ $$idzz = y; \ for ($$idyy=0; $$idyy<D!M; $$idyy++) \ { \ if ($$idzz != x) \ { \ $$idzz = BROUTE[x][$$idzz]; \ $$idww = FROUTE_D[x][$$idzz]; \ release(link[SSidzz*DIM+S$idww]); \ } ;\ } ;\ > \ asm(CLOCK_ON) /* waits for data to be received in its input buffer */ #define WAIT_DATA_RDY(y,x) asm(CLOCK_OFF); \ hold(t_hold[x]); \ t_hold"[x] = 0.0; \ ts1[x] = simtime0;\ wait(data_rdy[y][x]); \ ts2[x] = simtimeO; \ pstat ar[x].p_synctim +=(ts2[x] -ts1[x]);\ asm(CLOCK_ON); #define CLR_DATA_RDY(y,x) asm(CLOCK_OFF); \ hold(t_hold[x]);\ t_hold[x] = 0.0; \ clear(data_rdy|yj[x]); \ asm(CLOCK_ON); #define SET_DATA_RDY(y,x) asm(CLOCK_OFF); \ hold (t_hold [x]); \ t_hold[x] = 0.0; \ set(data_rdy[y][x]);\ asm(CLOCK_ON); INITIALIZATION routines /* routine to initialize facilities */ csimj'nitO { int i; /* declare extra event variables if necessary */ if (NCPUS <= 64) { i = max_events(200); i = max_facilities(5000); i = max_servers(5000); } else { if (NCPUS <= 512) { i = max_events(2200); i = max_facilities(20000); i = max servers (20000); } else { i = max_processes(2000); i = max_events(4000); i = max_facilities(20000); i = max_servers(20000); > } ; tsum = 0.0; /* tsum initialization */ r declare facilities and events s_bus = facility(“ s_bus” ); facility_set(om,” om” ,NCPUS*NCPUS); facility_set(r_bus,”r_bus”,NCPUS); /* inluded for crossbar-connected */ facility_set(link,’1ink”,NCPUS*DIM); I * communication links */ g_done = event(“g_done” ); event_set(data_rdy,’ ’ data_rdy”,CNTRDY*NCPUS); I * instantiate and initialize semaphore structures */ for (i = 0; i < CNTSIZ; i++) { 127 cntrp] = (struct t_COUNTER * ) malloc(sizeof(structt_COUNTER)); cntr(i]->c_num = 0; cntr(i]->init = 0; } ; /* initialize the data_ready signals */ for (i=0; i < NCPUS; i++) { clear(data_rdy[0][i]); clear(data_rdy[1][0); clear(data rdy[2][i]); } /* initialize the instruction counting structures and holds array */ for (i = 0; i < NCPUS; i++) { pstat_ar[i].num_om = 0; pstat_ar[i].num_lm = 0; pstat_ar[i].num_buf = 0; pstat_ar[i].num_inst = 0; pstat_ar[t].num_inst_comp = 0; pstat_ar[i].num_inst_comm = 0; pstat_ar(0.num_inst_sync = 0; pstat_ar[i].num_synch = 0; pstat_ar[i].p_time = 0,0; pstat_arp].p_comptim = 0.0; pstat_ar[i].p_syndim = 0.0; pstat_ar[i].p_commtim = 0.0 ; pstat_ar[i].num_link = 0; pstat_ar[i].num_bytes = 0; t_hold[i] = 0.0; } ; } /* csim-init */ Appendix B Simulation Program Listings: 1. Image shifting by columns and rows on OMP 2. Image shifting by columns and rows on OMP using on-the-fly indexing 3. Image shifting by columns and rows on CCM 4. Image shifting by columns and rows on CCM using on-the-fly indexing 5. M atrix transposition and m ultiplication on Hypercube 6. M atrix transposition and multiplication on OMP 7. M atrix transposition and multiplication on CCM 8. M atrix row shuffle on Hypercube 9. M atrix row shuffle on OMP and CCM 10. Histogramming on Hypercube 11. Histogramming on OMP 12. Histogramming on CCM 13. Gaussian-elimination on Hypercube 14. Gaussian-elimination on OMP 15. Gaussian-elimination on CCM i ! 128 | 129 Translates a PxP image on an NxN OMP without using index manipulation. The image is shifted by ‘col-shift’ number of columns and ‘ row-shift’ number of rows. ^define NCPUS 16 #define SIZE 1024 #detine CNTSZ 16 #define cof_shift 5 #define row shift 3 #include “ omp simS.h" #define N NCPUS #define P SIZE #define LEN (P/N) #define CAP (LEN*LEN) FILE *fptr, *outptr; int OMDI[CAP][N][N]; int OMDO[CAP][N][N]; int VRW[N][P]; int TEMP[N](P]; input_arraysO { int i,j,k,row,offset1 .offset; fptr = fopen (“ Image”, “r ”); for (i=0; i <P; i++) ( row = i % N; offset 1 = i/N; for (j=0;j <P;j++) { k = j % N; offset = (offsetl *(P/N)) + (j/N); fscanfffptr, “ %d”, &OMDl[offset][row][k]); } } fclose(fptr); trans(pn) int pn; { int temp, i, j, k, 11, ind,y1 ,y2,y3, cnt; cnt=0; _ADV_TM(1,pn); create(“ trans"); SET_ROWACC(pn); cnt s (cnt+1) % CNTSZ; _ADV_TM(8,pn); TSETCNT(cnt,N,pn); _ADV_TM(5,pn); for (i=0; i < LEN; i++) ( _ADV_TM(5,pn); for (j=0; j < LEN; j++) { PIPEJN(pn); 1 1 = (i *LEN) +j; _ADV_TM(3,pn); for (k=0; k<N; k++) { ind = (j*N) + k ; VRW[pn][ind] = OMDI[l1][pn][k]; } _ADVTM(5,pn); } _ADV_TM(5,pn); for (yl=0; yi < col shift; y1++) { _FETCH_VRW(pn); _ADV_TM(7,pn); TEMP[pn](y1] =0; _ADV TM(5,pn); } y2 = P-coi_shift; _ADV_TM^2,pn); _ADV_TM(5,pn); for (yi =0; yi <y2;yl++) < y3=y1+col_shift; _FETCH_VRW(pn); FETCH VRW(pn); TEMP[pnj[y3] = VRW [pn][yl]; _ADV_TM(17,pn); _ADV~TM(5,pn); } _ADV_TM(5,pn); for (j=0; j < LEN; j++) ( PIPE_OUT(pn); 1 1 =<i*LEN)+j; r overhead for mod operation */ /* loop initialization */ /* loop initialization */ /* Index manipulator init */ /* loop increment */ I * loop initialization */ I * VRW access */ I * instr overhead for filling pattern */ /* fills upO*/ r loop increment */ I * overhead for subtraction */ I * loop initialization */ I * VRW access*/ /* VRW access*/ /* overhead for duplicating */ I * loop increment */ /* loop initialization */ _ADV_TM(3,pn); for (k=0; k<N; k++) < ind = (j*N) + k ; 0 M DO P1 ][pn][k] = TEMP[pn][ind]; } ADV_TM(5,pn); > _ADV_TM(5,pn); } C LR_RO WACC (pn); SYNCH (cnt, pn); SET_COLACC(pn); cnt=(cnt+i) % CNTSZ; _ADV_TM(8,pn); TSETCNT(cnt,N,pn); ADV_TM(5,pn); for (i=0; i < LEN; i++) ( _ADV_TM(5,pn); for (j=0; j < LEN; j++) { PIPEJN(pn); 1 1 = { j *LEN) +i; _ADV_TM(3,pn); for (k=0; k<N; k++) { ind = (j*N) + k ; VRW[pn][ind] = OMDO(l1][k][pn]; > _ADV_TM(5,pn); } _ADV_TM(5,pn); for (yi =0; yi < rowshift; yi ++) { _FETCH_VRW(pn); _ADV_TM(7,pn); TEMP[pn][y1] = 0; _ADV_TM(5,pn); } y2 = P-row_shift; ADV TM(2,pn); ~ADV~TM(5,pn); for (yl=0; yi < y2; y1++) { y3=y1 +row_shift; FETCH VRW(pn); _FETCH_VRW(pn); /* Index manipulator tn rf */ I * loop initialization */ r loop increment */ I * overhead for mod operation * 1 t * loop initialization * 1 I * loop initialization */ /* Index manipulator init */ / * loop increment */ t * loop initialization */ /* VRW access */ /* instr overhead for filling pattern */ /* fills up 0 */ /* loop increment */ r overhead for subtraction */ /* loop initialization */ /* VRW access */ /* VRW access */ TEMP[pn][y3] = VRW [pn](y1J; _ADV_TM(17,pn); _ADV_TM(5,pn); } _ADV_TM(5,pn); for (j=0; j < LEN; j++) ( PIPE_OUT(pn); 1 1 =(j*LEN)+i; _ADV_TM(3,pn); for (k=0; k<N; k++) { ind = (j*N) + k ; OMDI[il][k][pn] = TEMP[pn][ind); } ADV TM(S,pn); F _ADV_TM(5,pn); /* loop increment */ } CLR_COLACC (pn); SYNCH(cnt.pn); END_TSK(pn); simO { int i; createf'sim"); csiminitO; input_arraysO; act = N; for (i=0; i<N; i++) trans(i); wait(g_done); REPORT; Translates a PxP image on an NxN OMP without using index manipulation. The image is shifted by ‘ cot-shift’ number of columns and ‘row-shift’ number of rows. */ #define NCPUS 16 /* overhead for duplicating */ /* loop increment */ /* Index manipulator init */ I* loop initialization */ 131 #define SIZE 1024 ffdefine CNTSZ 16 #define col_shift 5 #define row_shift 3 ffinclude “ omp_sim6.h” #define N NCPUS #define P SIZE Adeline LEN (P/N) Adeline CAP (LEN*LEN) FILE *fptr, *outptr; int 0MDI[CAP][N][N]; int OMDO[CAP][N][N]; int VRW[N][P]; int TEMP[N][P]; inputarraysO { int i,j,k,row,offsetl .offset; fptr = fopen (“ Image", “r ” ); for (i=0; i <P; i++) { row = i%N; offsetl = i/N; for (J=0; j <P; j++) { k = j%N; offset = (offsetl *(P/N)) + (j/N); fscanf(fptr, “ %d”, &OMDI[offset][row][k]); } } fclose(fptr); trans(pn) int pn; { int temp, i, j, k, 11, ind.yl ,y2,y3, cnt; cnt=0; _ADV_TM(1,pn); create(“ trans” ); SET_ROWACC(pn); cnt = (cnt+1) % CNTSZ; _ADV_TM(8,pn); TSETCNT (cnt,N,pn); _ADV_TM(5,pn); I * overhead for mod operation */ /* loop initialization */ for (i=0; i < LEN; i++) { _ADV_TM(S,pn); for (j=0; j < LEN; j++) { PIPEJN(pn); 1 1 = (i *LEN) +j; _ADV_TM(3,pn); for (k=0; k<N; k++) { ind = (j*N) + k ; VRW[pn][ind] = OMDI[l1](pn][k]; } _ADV_TM(5,pn); } _ADV_TM(5,pn); for (yi =0; yi < colshift; yl++) TEMP[pn][y1] = 0; y2 = P-col_shrft; for(y1=0; y1 <y2;y1++) { y3=y1 +col_shift; TEMP[pn|[y3] = VRW [pn](y1]; } _ADV_TM(5,pn); for (j=0; j < LEN; j++) { PIPE_OUT(pn); _ADV_TM(10,pn); 1 1 = (i*LEN)+j; _ADV_TM(3,pn); for (k=0; k<N; k++) ind = (j*N ) + k; 0MD0[l1][pn][k] = TEMP[pn][ind]; _ADV_TM(5,pn); /* loop initialization */ > _ADV_TM(5,pn); /* loop increment */ CLR_ROWACC(pn); SYNCH (cnt,pn); SET_COLACC(pn); cnt=(cnt+1) % CNTSZ; _ADV_TM(8,pn); TSETCNT (cnt.N ,pn); /* loop initialization */ I * Index manipulator init */ /* loop increment */ /* loop initialization */ r No overhead due to OIM */ /* fills up 0*/ I * loop initialization */ r additional overhead for */ r selecting appropriate index set */ /* Index manipulator init */ /* overhead for mod operation */ _ADV_TM(5,pn); for (i=0; i < LEN; i++) { _ADV_TM(5,pn); for (j=0; j < LEN; j++) { PIPEJN(pn); 1 1 = (j *LEN) +i; _ADV_TM(3,pn); for (k=0; k<N; k++) { ind = (j*N ) + k; VRW[pn][ind] =0MD0[l1][k][pn]; } _ADV_TM(5,pn); } _ADV_TM(5,pn); for (y1 =0; y1 < row_shift; y1++) TEMP[pn][y1] = 0; /* fills up 0 */ y2 = P-row_shift; for (y1=0; y1 < y2; y1++) { y3=y1+row_shift; TEMP[pn][y3] = VRW [pn][y1]; > _ADV_TM(5,pn); for (j=0; j < LEN; j++) { PIPE_OUT(pn); _ADV_TM(10,pn); 1 1 = (j *LEN) +i; _ADV_TM(3,pn); F or (k=0; k<N; k++) { ind = (j*N) + k; 0MDI[l1][k][pn] =TEMP[pn][ind]; } ADV TM(5,pn); > ADV_TM(5,pn); } CLR_COLACC(pn); SYNCH (cnt,pn); END_TSK(pn); /* loop initialization */ /* loop initialization */ /* Index manipulator init */ /* loop increment */ /* loop initialization */ /* no overhead for OIM */ /* loop initialization */ I * additional overhead for */ /* selecting appropriate index set */ /* Index manipulator init */ /* loop initialization */ /* loop increment */ simQ { int i; create(“sim” ); csimJnitO; input_arraysO; act = N ; for (i=0; i<N ; i++) trans(i); wait(gdone); REPORT; r .... . Translates a PxP image on an NxN CCM without using index manipulation. The image is shifted by ‘ col-shift’ number of columns and ‘ row-shift’ number of rows. Uses scalar-oriented operations for simulating column interleaving*/ #define NCPUS 1G #define SIZE 1024 #define CNTSZ 16 #define col_shift 5 #define row shift 3 #include “ omp_sim6.h” #define N NCPUS #define P SIZE #define LEN (P/N) #define CAP (LEN*LEN) FILE *fptr, ‘outptr; int OMDI[CAP][N][NJ; int OMDO[CAP][N][N]; int VRW[N][P]; int TEMP[N][P]; input_arraysO { int i,j,k,row,offsetl,offset; fptr = fopen (“Image”, “r ” ); for (i=0; i <P; i++) { row = i % N; offsetl = i/N; for (j=0; j <P; j++) { k = j % N; offset - (offsetl *<P/N)) + (j/N); fscanfffptr, “ %d” , &0MDl[offset][row][k]); } } fclose(fptr); } trans(pn) int pn; { int temp, i, j, k, 1,11,12, ind,y1 ,y2,y3, cnt,rowb,row sh.rl ,i1; cnt=0; _ADV_TM(1,pn); create(“ trans” j; SET_ROWACC(pn); cnt = (cnt+1) % CNTSZ; _ADV_TM(8,pn); TSETCNT(cnt,N,pn); _ADV_TM(5,pn); for (i=0; i < LEN; i++) { _ADV_TM(5,pn); for (j=0; j < LEN; j++) { PIPEJN(pn); II =(i*LEN)+j; _ADV_7M(3,pn); for (k=0; k<N; k++) ind = (j*N) + k; VRW[pn][indJ = 0MDI[l1][pn][k]; } _ADV_TM(5,pn); } _ADV_TM(5,pn); for (y1 =0; y1 <col shift; y1++) { _FETCH_VRW(pn); _ADV_TM(7,pn); TEMP[pn][y1J = 0; _ADV_TM(5,pn); } y2 = P-col shift; _ADV_TM(2,pn); _ADV_TM(5,pn); for (yi =0; y1 < y2; y1++) /* overhead for mod operation */ I * loop initialization */ /* loop initialization */ I * Index manipulator init */ /* loop increment */ /* loop initialization */ I * VRW access */ /* instr overhead for filling pattern */ /* fills up 0 */ /* loop increment */ /* overhead for subtraction */ /* loop initialization */ y3=y1+col_shift; FETCH VRW(pn); ~FETCH~VRW(pn); TEMP[pn][y3] = VRW Ipn][y1J; _ADV_TM(l7,pn); ~ADV~TM(S,pn); } _ADV_TM(5,pn); for (j=0; j < LEN; j++) { PIPE_OUT(pn); 1 1 = (i *LEN) +j; _ADV_TM(3,pn); for (k=0; k<N; k++) { ind = (j*N ) + k; OMDO[H][pn][k] = TEMP[pn][ind]; _ADV_TM(5,pn); } _ADV_TM(5,pn); } CLR_ROWACC(pn); SYNCH (cnt,pn); SET_COLACC(pn); cnt=(cnt+1) % CNTSZ; _ADV_TM(8,pn); TSETCNT(cnt,N,pn); _ADV_TM(5,pn); for (i=0; i < LEN; i++) { _ADV_TM(5,pn); for (j=0; j < LEN; j++) { _ADV_TM(5,pn); for (1-0; l<N; I++) { rowb = (l+pn) % N; _ADV_TM(8,pn); CROSS_ROW_IN(pn,rowb,pn); SYNCH(cnt.pn); cnt=(cnt+1) % CNTSZ; _A0V_TM(8,pn); TSETCNT (cnt,N ,pn); 1 1 = (j *N) +rowb; _A0V TM(4,pn); 1 2 = (jTEN) + i; T VRW access */ rVRW access*/ f overhead for duplicating */ /* loop increment */ / * loop initialization */ I * Index manipulator init */ I * loop initialization */ /* loop increment */ /* overhead f or mod operation */ /* loop initialization */ I * loop initialization */ I * synchronization */ r Index manipulator init */ VRW[pn][M] = OMDO[l2][rowb](pn]; ADV TM(5,pn); T _ADV_7M(5,pn); } for (yi =0; y1 < row shift; y1++) { _FETCH_VRW(pn); _ADV_TM(7,pn); TEMPlpn][y1] = 0; _ADV_TM(5,pn); } y2 = P-row_shift; _ADV_TM(2,pn); _ADV_TM(5,pn); for (y1=0;y1 <y2;y1++) { y3=y1+row_shift; _FETCH_VRW(pn); _FETCH_VRW(pn); TEMP[pn][y3] = VRW [pn](yl]; _ADV_TM(17,pn); ADV_TM(5,pn); } _ADV_TM(5,pn); for (j=0; j < LEN; j++) { _ADV_TM(5,pn); for <l=0; l<N;!++) i rowb = (l+pn) % N; ADV TM(8,pn); CROSS_ROW_OUT (pn,rowb,pn); SYNCH (cnt,pn); /* synchronization */ cnt=(cnt+1) % CNTSZ; ADV TM(8,pn); TSETCNT (cnt,N ,pn); 1 1 = (j *N) +rowb; ADV_TM(4,pn); 1 2 = (j*LEN) +i; OMDI(l2](rowb][pn] = TEMP[pn][l1]; _ADV_TM(5,pn); > _ADV_TM(5,pn); } } CLR_COLACC(pn); SYNCH (cnt,pn); /* loop increment */ /* VRW access */ /* instr overhead for filling pattern */ /•fills u p 0*/ /* loop increment */ 1 * overhead for subtraction */ /* loop initialization */ /* VRW access */ /* VRW access */ /* overhead for duplicating */ /* loop increment */ /* Index manipulator init */ I * loop increment */ END_TSK(pn); sim O ( int i; createfsim” ); csim_init(); input_arraysQ; act = N ; for (i=0; i<N; i++) trans(i); wait(g_done); REPORT; Translates an PxP image on an NxN OMP without using index manipulation. The image is shifted by ‘ col-shift’ number of columns and ‘ row-shift’ number of rows. Uses scalar-oriented operations for simulating column interleaving . #define NCPUS 16 #define SIZE 1024 Sdefine CNTSZ 16 #define col_shift 5 #define row_shift 3 #include “ omp_sim6.h" #define N NCPUS #define P SIZE #define LEN (PIN) ffdefine CAP (LEN*LEN) FILE *fptr, *outptr; int OMDI[CAP](N][N]; int OMDO[CAP][N][N]; int VRW[N][P]; int TEMP[N][P]; inputarraysO { int i,j,k,row,offsetl .offset; fptr = fopen (“ Image”, “ r " ); for (i=0; i <P; i++) { row = i % N; offsetl = i/N; for (j=0; j <P; j++) { k = j% N; offset = (offsetl *(P/N» + (j/N); fscanf(fptr, “ %d”, &0 M DI [offset] [row] [k]); } ) fclose(fptr); trans(pn) intpn; { int temp, i, j, k, 1,11,12, ind,y1 ,y2,y3, cnt,rowb,row sh,r1 ,i1; cnt=0; _ADV_TM(1,pn); create (“ trans” ); SET_ROWACC(pn); cnt = (cnt+1) % CNTSZ; _ADV_TM(8,pn); TSETCNT(cnt ,N,pn); _ADV_TM(5,pn); for (i=0; i < LEN; i++) { _ADV_TM(5,pn); for (j=0; j < LEN; j++) { PIPEJN(pn); 1 1 = (i *LEN) +j; _ADV_TM(3,pn); for (k=0; k<N; k++) { ind = (j*N) + k ; VRW[pn][ind] = 0MDI[l1][pn][k]; } _ADV_TM(5,pn); } for (y1=0; y1 < col shift; y1++) { TEMP[pn][y1] =0; } y2 = P-col_shift; _ADV_TM(2,pn); for (y1 =0; y1 < y2; y1++) /* overhead for mod operation */ /* loop initialization */ I* loop initialization */ /* Index manipulator init */ /* loop increment */ /* no overhead for OIM */ /•fills up 0*/ I* overhead for subtraction */ y3=yHcol shift; TEMP[pn][y3] = VRW [pn][y1]; } _ADV_TM(5,pn); for 0=0; j < LEN; j++) PIPE_OUT(pn); _ADV_TM(10,pn); II = (i *LEN) +j; _ADV_TM(3,pn); for (k=0; k<N; k++) { ind = 0*N) + k ; OMDO[l1][pn][k] = TEM P[pn][ind]; } ADV TM(5,pn); } _ADV_TM(5,pn); } CLRROWACC(pn); SYNCH (cnt,pn); SET_COLACC(pn); cnt=(cnt+1) % CNTSZ; _ADV_TM(8,pn); TSETCNT (cnt,N ,pn); _ADV_TM(5,pn); for (i=0; i < LEN; i++) { _ADV_TM(5,pn); for (j=0; j < LEN; j++) { _ADV_TM(5,pn); for (1=0; l<N; I++) { rowb = (l+pn) % N; _ADV_TM(8,pn); CROSS_ROW_IN (pn,rowb,pn); SYNCH(cnt.pn); cnt=(cnt+1) % CNTSZ; _ADV_TM(8,pn); TSETCNT(cnt,N,pn); 1 1 = (j *N) +rowb; ADV_TM(4,pn); £ = (j*LEN) + i; VRW[pn)[l1] = OMDOP2][rowb][pn]; ADV_TM(5,pn); } /* loop initialization */ I* additional overhead for */ r selecting appropriate index set */ /* Index manipulator init */ /* loop initialization */ /* loop increment */ /* overhead for mod operation */ I* loop initialization */ /* loop initialization •/ /* synchronization */ I* Index manipulator init */ ADV TM(5,pn); } for (yl=0; yl < row shift; yl++) { _FETCH_VRW(pn); ~ADV_TM(7,pn); TEMP[pn][y1] = 0; _ADV_TM(5,pn); } y2 = P-row_shift; _ADV_TM(2,pn); _ADV_TM(5,pn); for (yl=0; yl < y2; yl++) { y3=y1+row shift; _FETCH_VRW(pn); _FETCH_VRW(pn); TEMP[pn][y3] = VRW [pn][y1]; _ADV TM(17,pn); _ADV'TM(5,pn); } _ADV_TM(S,pn); for (j=0; j < LEN; j++) { _ADV_TM(5,pn); for (l=0; l<N; I++) { rowb = (l+pn) % N; _ADV_TM(8,pn); CROSS_ROW_OUT{pn,rowb,pn); SYNCH (cnt.pr^; cnt=(cnt4l) % CNTSZ; _ADV_TM(8,pn); TSETCNT (cnt,N ,pn); 1 1 = ( j *N) +rowb; _ADV_TM(4,pn); 1 2 = (j*LEN) +i; OMDI[l2][rowb][pn] = TEMP[pn][l1]; _ADV_TM(5,pn); } _ADV_TM(5,pn); } } CLR_COLACC(pn); SYNCH(cnt,pn); END_TSK(pn); r loop increment */ /* VRW access */ I* instr overhead for filling pattern */ /* fills up 0 */ I* loop increment */ /* overhead for subtraction */ /* loop initialization */ /»VRW access*/ /* VRW access*/ /* overhead for duplicating */ /* loop increment */ /* synchronization */ /* Index manipulator init */ /* loop increment */ sim0 { int i; createfsim” ); csim_initO; input_arrays(); act = N; for (i=0; i<N; i++) trans(i); wait(g_done); REPORT; Multiplies two matrices A and B on a hypercube using circuit switched link communication. M essages are sent using the pairwise algorithm proposed by Bokhari. ---- #de1ine DIM 4 Adeline NCPUS 16 #de1ine SIZE 5 12 #de1ine CNTSZ 16 #include “hyp_sim2.h” #define M NCPUS #define M 1 NCPUS/2 #define D DIM #define P SIZE #define LEN (P/M) FILE *fplr, *outptr; int A[M][LEN][P]; int 8[M][P][LEN]; int 8T[M][LEN][P]; int C(M][LEN][P]; int BUF(M][P][LEN]; int FROUTE[M]EM]; int FROUTE D[M][M]; int BROUTE[M][M]; int BIND[M]; input arrays 0 { int i,j,k; fptr = fopen (7home/aloha/panda/simtemp/matrix_A”, “r ” ); for (i=0; i <M; i++) for (j=0; j <LEN; j++) for (k=0; k < P; k++) { fscanf(fptr, “ %d”, & A[i][j][k]); BT[G Ifl[k] = A[i][j][k]; } fclose(fptr); for (k=0; k < P; k++) Cti]ffl[kl = 0; > trans(pn) int pn; { int ni,j,k,H,l2,l3,lg; int z1 ,z2,z3,z4,z5,z6,msg_length,dst; create(“ trans"); msgjength = LEN*LEN; ADV TM(7,pn); ~ADV~TM(5,pn); for (ni=1; ni<M; ni++) { dst = pn A ni; _ADV_TMC(7,pn); routing(pn,pn,dst); COMMUNICATE_CIRCUIT(pn,dst .msglength); for (z2 = 0; z2 < LEN; z2++) { for (z3 = 0; z3 < P; z3++) { z4 — Z3/LEN; z5 = z3 % LEN; z6 = (pn*LEN) + z2; B[z4][z6][z5] = BT[pn][z2][z3]; } } ; SET_DATA_RDY (dst); W AI T_DATA_RD Y (pn); _ADV_TM(57pn); } END TSK(pn); } /* determining message length */ /* loop initialization */ /* determining the destination */ /* determining routing path */ I * performs the communication */ /* copies data from BT to B */ I * loop exit */ compute(pr) int pr; { int x1 ,x2,x3,x4,bindex,temp, y1, y2; _ADV_TM (S,pr); /* loop initialization */ for (x2=0; x2<LEN; x2++) { _ADV_TM(5,pr); /* loop initialization */ for (x3=0; x3<LEN; x3++) { r temp=0; _ADV_TM(1 ,pr); _ADV_TM(5,pt); for (x4=0; x4<P; x4++) { temp += A[pr][x2][x4] * BUF[pr][x4][x3]; _ADV_TM(2,pr); _FETCH_LM(pr); _FETCH_VRW(pr); _ADV_TM(5,pr); }; yl =x2; _A0V_TM(2,pr); y2 = BIND[pr]*LEN+x3; _ADV_TM(4,pr); C[pr][y1)[y2] = temp; _FETCH_LM(pr); _ADV_TM(5,pr); }; ADV_TM(5,pr); I* initializing temp */ /* loop initialization */ /* multiplication and add */ I * local mem access */ /* data buffer access */ /* loop increment */ /* assignment of y1 */ /* assignment of y2 */ /* storing result */ /* loop increment */ /* loop increment */ } copy_buffer(x) intx; { int X 1 ,x2x3; for(x1 =0; xt<P;xl++) for(x2=0; x2<LEN; x24+) { BUF[x ][x1][x2] = B[x][x1][x2]; } ; } /* initially copies matrix B to buffer */ L co o o power(x,n,pr) intx,n,pr; { int i,p; P=1; for (i=t; i <=n; i++) { p = p*x; } return(p); } routing (p,x,y) I * raise x to n-th power; n > 0 */ /* can be kept in a look-up table */ /* so no timing overhead */ /* determines routing links */ int p,x,y; { int i,j,k,pr1 ,step,s_node,d_node,tempsrc,tempdst,cj; step = 0; s_node = x; d_node = x; for(i=0; i<D; i++) if (d_node != y) { pr1=p; j = power(2,i,pr1); tempsrc = s node & j; tempdst = y & j; if (tempsrc != tempdst) { if (tempsrc == 0) d node = s_node | j; else { q = -i; d_node = s_node & cj; step += 1; FROUTE[p][s_node] = d_node; FROUTE_D[p](s_node] = i; BROUTE[p][d_node) = s_node; s node = d_node; > ; } ; } ; } mult(pn) int pn; { int temp, ni, j, k, src_blk_length, dest_blk_index; int z1, z2, z3, y1, y2, y3, msgjength, cpn, dst; createf'mult’ ); _ADV_TM(3,pn); copy_buffer(pn); BIND[pn] = pn; _ADV_TM(5,pn); compute(pn); _ADV_TM(8,pn); msg_length = P*LEN; _ADV_TM(7,pn); _ADV_TM(5,pn); /* computing the complement */ I* sets own identified block */ I* sets the identification */ /* initiating compute operation */ /* determining message length */ I* loop initialization */ for (ni= 1; ni < M ; ni++) { dst = pn A ni; _ADV_TMC(7,pn); I* determining the destination */ routing (pn,pn,dst); /* determines the routing path */ COMMUNICATE_CIRCUIT(pn,dst,msg_length); /* performs the communication */ for (z2 = 0; z2<P; z2++) { for (z3=0; z3<LEN; z3++) { BUF[dst][z2][z3] = B[pn][z2][z3]; } ; } ; BIND[dst] = pn; /* marks the block sent */ _ADV TMC(7,pn); /* overhead for marking the block *t SET_DATA RDY(dst); WAIT_DATA_RDY (pn); compute(pn); } END_TSK(pn); } simO { int xi; createfsim” ); csimJnitO; input_arrays(); init_cO; act = M ; for (xi=0; xi<M ; xi++) { trans(xi); mult(xi); } wait(g_done); REPORT; Matrix multiplication and transposition on OMP. Uses BLK_BROADCAST operation for writing data to OM and BLK_VECTOR_IN to read data from the OM. Allows matrix transpose and then multiplication. #define NCPUS 16 #define SIZE 256 #define CNTSZ 16 #include “ bus_sim2.h” ffdefine N NCPUS #define P SIZE #define LEN (P/N) FILE *fptr, *outptr; int A[N][LEN][P]; int B[N][LEN][P]; int BT[N][P][LEN1; int C[N1[LEN][P]; int OMD[P*LENJ[Nj[NJ; int VRWIN][P]; input_arraysO { int i,j,k; fptr = fopen fmatrix_A”, Y); for (i=0; i <N; i++) for (j=0; j <LEN; j++) for (k=0; k < P; k++) { fscanf(fptr, “ %d”, &A[i][j][k]); B T[i][k]G ] = A0G][k]; > fclose{fptr); ) init CO { “ int i,j,k; for (i=0; i <N; i++) for 0=0; j <LEN; ]++) for (k=0; k < P; k++) C[i][fl[k] = 0; trans(pn) int pn; { int i,j,k,H,l2,l3,lg,cnt; cnt=0; _ADV_TMS(1,pn); create(“pn"); SET_ROWACC(pn); 140 TSETCNT (cnt.N ,pn); Ig = LEN*LEN; BLK_VECTOR_OUT (pn ,lg); _ADV_TMC(10,pn); for (i=0; i < LEN; i++) { _ADV_TMC(lO,pn); for (j=0; j < P; j++) < 12 = j / LEN; 1 1 = (i*(P/N))+(j % LEN); for (k=0; k<N; k++) 0MD[l1][pn][l2] = BT [pn][fl[i]; } } ; SYNCH (cnt.pn); SET_COLACC(pn); lg=LEN*LEN; _ADV_TMC(3,pn); BLK_VECTOR IN(pn,lg); _ADV_TM C (1 oTpn); cnt = (cnt+1) % CNTSZ; TSETCNT (cnt,N ,pn); for (j=0; j<LEN; j++) { _ADV_TMC(10,pn); for (k=0; kcLEN; k++) { l1=(k*LEN)+j; _ADV_TMC(4,pn); for {(3=0; I3<N; I3++) { l2=(l3*LEN)+k; B[pn][fl[l2j = 0MD[l1][l3][pn]; SYNCH (cnt,pn); END TSK(pn); } /* number of vectors */ r writes data to the OM */ ! * index manipulator initialization */ /* copies data from matrix B to OM */ /* VRW initialization */ /* synchronizes the processors */ /* number of vectors */ /* multiplication overhead */ /* reads data from OM to B */ /* index manipulator initialization */ I* VRW index set initialization */ I* index calculation */ mult(pn) int pn; { int temp, i, j, k, II, 12,13, cnt.lg; cnt=0; _ADV_TMS(l,pn); createfmult'); SET_ROWACC(pn); cnt = (cnt+1) % CNTSZ; TSETCNT(cnt,N,pn); Ig = P*LEN; BLK_BROADCAST (pn ,lg); _ADV_TMC(10,pn); for (i=0; i < P; i++) { for (j=0; j < LEN; j++) { 1 1 = (i*LEN)+j; for (k=0; k<N; k++) OMD[l1][pn][k] = B [pn][j][i]; } } SYNCH (cnt.pn); SET_COLACC(pn); lg=P*LEN; BLK_VECTOR_IN(pn,lg); _ADV_TMC(10,pn); cnt = (cnt+1) % CNTSZ; TSETCNT (cnt.N ,pn); _ADV_TM(5,pn); for (j=0; j<P; j++) { _ADV_TMC(10,pn); for (k=0; k<LEN; k++) { l1=(j*LEN)+k; _ADV_TMC(4,pn); for (13=0; I3<N; I3++) { l2=(l3*LEN)+k; VRW[pn][l2] = OMD[l1][l3][pn]; } } for (i=0; i<LEN; i++) { temp = 0; _ADV_TM(1 ,pn); _ADV_TM(5,pn); for (11=0; 1 1 <P; 11++) { _FETCH_LM(pn); _FETCH_VRW(pn); _ADV_TM(2,pn); temp += A[pn][i][l 1 J*VRW[pn][l1 ]; ADV TM(5,pn); /* broadcasts data to the OM */ f* index manipulator initialization */ t* copies data from matrix B to OM */ I* synchronizes the processors * 1 I* reads data from OM to VRW */ /* index manipulator initialization */ /* VRW index set initialization */ /* index calculation */ /* var initialization */ I* loop initialization */ /* loop increment */ } _FETCH_LM(pn); C[pn][iJD] = temp; _ADV_TM(5,pn); } _ADV_TM(5,pn); > SVNCH(cnt,pn); END_TSK(pn); } simO { int i,i1; create(“ sim” ); csim_init(); input_arraysO; init_C; a d = N; for (i=0; i<N; i++) { trans(i); } ; wait(g_done); a d = N; for (i1 =0; i1<N; i1++) { mult(i1); } ; wait(g_done); REPORT; ^********************** *************** ********************* Matrix multiplication and transposition on COM. ---- #de1ine NCPUS 16 #define SIZE 256 #define CNTSZ 16 #include “ bus_sim2.h” #de1ine N NCPUS #de1ine P SIZE #define LEN (P/N) p loop increment */ FILE *fptr, *outptr; int A[N][LEN][P]; int B[N][LEN][P]; int BTtN][P)tLEN]; int C[N][LEN][P]; int OMD[P*LEN][N][N]; int VRW[N][P]; input arraysO { int i,j,k; fptr = fopen (“matrix_A”, “ r"); for (i=0; i <N; i++) for (j=0;j <LEN; j++) for <k=0; k < P; k++) { fscanf(fptr, “ %d”, & A [i]Q ][k]); BT[0[k]0] = A [O B ][k]; } fclose(fptr); } init C O { int i,j,k; for (i=0; i <N; i++) for 0=0; j <LEN; j++) for (k=0; k < P; k++) C[i][D[k] = 0; } trans(pn) intpn; int i,j,k,H,l2,l3,lg,cnt,rowb; cnt=0; _ADV_TMS(1,pn); createfpn” ); SET_ROWACC(pn); TSETCNT (cnt,N ,pn); Ig = LEN*LEN; _ADV_TMC(3,pn); BLK_VECTO R_0 U T(pn ,lg); _ADV_TMC(10,pn); for (i=0; i < LEN; i++) { _ADV_TMC(10,pn); for 0=0; j < P; j++) P number of vedors */ I* length determination *f I* writes data to the OM */ /* index manipulator initialization * 1 I* copies data from matrix B to OM */ /* VRW initialization * ! 142 { 1 2 = j / LEN; 1 1 = (i*(P/N))+(j % LEN); for (k=0; k<N; k++) OMD[l1][pn][l2] = BT[pn][j][i]; ) } ; SYNCH (cnt,pn); SET_COLACC(pn); lg=LEN*LEN; TSETCNT (cnt,N ,pn); _ADV_TM(5,pn); for (i=0; i<N; i++) { CROSS_ROW_BLK_IN(pn,pn,pn,lg); } ; for (i=0; i<N; i++) _ADV_TMC(10,pn); TSETCNT (cnt.N ,pn); _ADV_TMC(5,pn); for (i=0; i<N; i++) { _ADV_TMC(5,pn); rowb=(i + pn) % N; _ADV_TMC(8,pn); for (k=0; k<lg; k++) { 1 1 =k / LEN; I2=k% LEN; 13= (rowb * LEN) + II; _ADV_TMC(8,pn); B[pn]ti2][l3] = OMD[k][rowb][pn]; _ADV_TMC(5,pn); } I ADV_TMC(5,pn); SYNCH (cnt,pn); END_TSK(pn); } /* synchronizes the processors */ /* number of vectors */ /* synchronization */ /* multiplication */ f holds time for BLKJN operation */ /* rest simulates data movement */ /* incorporating overheads for */ /* index manipulator initialization */ /* overhead for mod operation */ f* index calculation */ r mult(pn) int pn; { int temp, i, j, k, II, 12,13, cnt, I, rowb, mem,lg; cnt=0; _ADV_TMS(1,pn); create(“mult’ ); SET ROWACC(pn); TSETCNT (cnt,N ,pn); Ig = P*LEN; BLK_BRO ADCAST (pn.lg); _ADV_TMC(10,pn); for (i=0; i < P; i++) { for (j=0; j < LEN; j++) ( 1 1 = (i*LEN)+j; for (k=0; k<N; k++) OMD[l1][pn][k] = B [pn][j][i]; } } SYNCH(cnt.pn); SETCOLACC(pn); cnt = (cnt+1) % CNTSZ; _ADV_TMS(8,pn); Ig = P*LEN; for (i=0; i < N; i++) { CR O SSR O W_BLK_I N (pn ,pn ,pn ,lg); ADV TMC(10,pn); ) " TSETCNT (cnt.N ,pn); _ADV_TM(5,pn); for (j=0; j<P; j++) { _ADV_TM(5,pn); for (k=0; k<LEN; k++) { 1 1 =(j*LEN)+k; ADV TM(4,pn); _ADV_TM(5,pn); for (l=0; l<N; I++) { rowb=(l+pn) % N; _ADV_TM(8,pn); 1 2 = (rowb*LEN)+k; _ADV_TM(4,pn); VRW[pn][l2] = OMDpi][rowb][pn); _ADV_TM(5,pn); } ADV_TM(5,pn); } ADV TM(5,pn); j* broadcasts data to the OM * 1 f index manipulator initialization */ /* overhead for mod operation */ /* holds time for BLKJN operation */ I * index manipulator initialization */ I * index calculation */ ! * loop initialization */ /* overhead for mod operation */ I * overhead for indexing */ /* loop increment */ for (i=0; icUEN; i++) { temp = 0; _ADV_TM(1,pn); _ADV_TM(5,pn); for (11=0; 1 1 <P; 11++) { _FETCH_LM(pn); _FETCH_VRW(pn); _ADV_TM(2,pn); temp += A[pn][i][l1]*VRW[pn][l1]; _ADV_TM(5,pn); } _FETCH_LM(pn); C[pn][i]D l = temp; _ADV_TM(5,pn); > _ADV_TM(5,pn); } SYNCH (cnt.pn); ENDTSK(pn); > simO int i,il; createf'sim” ); csimJnitQ; input_arraysO; init_C; act = N ; for (i=0; i<N; i++) { trans(i); > : wait(g_done); act = N; for (il =0; i1<N; it++) { mult(i1); } ; wait(g_done); REPORT; /* var initialization */ I * loop initialization */ /* loop increment */ /* loop increment */ / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Shuffle permutes the rows of a (PxP) matrix on an M-processor hypercube using circuit switched link communication. Messages are sent using the pairwise algorithm proposed by Bokhari. #detine NCPUS 1024 #detine D IM 10 #define SIZE 1024 #define CNTSZ 16 #include “ hyp_sim2.h” #define M NCPUS #define M 1 NCPUS/2 #define D D IM #define P SIZE #define LEN (P/M) FILE *fptr, *outptr; int A[M][L£N][P]; int BUF[M](LEN](P]; int FROUTE[M][M]; int FROUTE_D(M][M]; int BROUTE[M][M); int BIND[M]; int DESTJM]; int SRC[M]; input arraysO { int i,j,k; fptr = fopen (“matrix_A”, “ r'); for (i=0; i <M; i++) for (j=0; j <LEN; j++) for (k=0; k < P; k++) fscanf(fptr, “ %d”, & A[i][j][k]); } fcloseffptr); } print_matrix_R(T) int T[M][LEN][P]; { intxl ,x2,x3; for(x1=0; x1<M;x1 ++) for (x2=0; x2<LEN; x2++) 144 { for (x3=0; x3<P; x3++) printf f%8d”, T[x1][x2][x3]); prinlf{“ \n” ); printf On” ); } shuffledetermineO { int i,j, k, src, dest; for (i=0; i < M ; i++) { k = M/2; j = i & k ; dest = (i« 1) % M ; if (j != 0) { dest = dest | 1 ; > DESTfi] = dest; SRC[dest] = i; } /* determines the MSB */ /* multiplies the src address */ /* adjusts the LSB */ /* stores destination addresses */ /* stores source address */ shuffle(pn) int pn; { int i,j, 11,12, src, dst, msgjength; createfshuffle” ); m sgjength = LENT; dst = DESTjpn); _ADV_TM(6,pn); /* overhead for initialization */ if (dst T= pn) { routing(pn,pn,dst); /* determining routing path */ COMMUNICATE_CIRCUIT(pn,dst,msgjength); I* performs the communication */ for (i=0; icLEN; i++) /* copies data from matrix A to Buffer */ for (j=0 ; j<P; j++) BUF[dst][i][0 = A[pn][i]lj; SET_DATA_RDY (dst); WAIT_DATA_RDY(pn); for (i=0; i<LEN; i++) for (j=0; j < P; j++) A[pn][i][j] = BUF[pn][i][j]; } END_TSK(pn); } /* raise x to n-th power; n > 0 /* initializing p */ /*for loop initialization */ /* loop inaement */ I* returning the value */ /* determines routing links */ int p,x,y; { int i,j,k,pr1,step,s_node,d_node,tempsrc,tempdst,cj; step = 0; s_node = x; d_node = x; for(i=0; i<D ; i++) { if (d_node != y ) { p n = p ; j = power(2,i,pr1); tempsrc = s_node & j; tempdst = y & j; if (tempsrc != tempdst) { if (tempsrc == 0) d_node = s_node | j; else { cj = -j; d_node = s_node & cj; } ; step+= 1; FROUTE[p][s_node] = d_node; FROUTE_D[p](s_node] = i; BROUTE[p][d_nodeJ = s_node; s_node = djtode; } ; power(x,n,pr) int x,n,pr; { int i,p; p=f; _ADV_TM(2,pr); _ADV_TM(5,pr); for (i=l; i <=n; i++) { P = p*x; _ADV_TM(5,pr); } _ADV_TM(3,pr); return(p); routing(p,x,y) 145 > ; } simO { int xi; create(“ sim"); csim_initO; input_arrays(); shuffle_determine(); act = M ; for (xi=0; xi<M; xi++) { shuffle(xi); } wait(g_done); REPORT; Shuffle permutes the rows of a (P xP) matrix on an N-processor OMP and CCM U ses BLK_SCALAR_OUT operation for writing data to OM and BLK_SCALAR_IN to read data from the OM. Performs a perfect shuffle on P/N rows associated with the processors #define NCPUS 16 #define SIZE 512 ffinclude “ bus_sim4.h” #define N NCPUS #define P SIZE ffdefine LEN (P/N) FILE *fptr, *oulptr; int A[N][LEN][PJ; int OMD[P*LEN][N][N]; int VRW(N][P]; int DESTIN]; int SRC[N]; input_arrays() { int i,j,k; fptr = fopen (“matrix_A”, “ r" ); for (i=0; i <N; i++) for (j=0; j <LEN; j++) for (k=0; k < P; k++) { fscanf(fptr, “ % d", & A[i][j][k]); } fclose(fptr); } shuffle_determine() < int i,j, k, src, dest; for (i=0; i < N; i++) { k = N/2; j = i & k ; dest = (i« 1) % N; * 0 1 = O J ( dest = dest 11; } OESTfi] = dest; SRC[dest] = i; } shuffle (pn) int pn; { int i,j,k,H ,l2,l3,lg,cnt,dst,src; cnt=0; _ADV_TMS(l,pn); createfshuffle’ ); SET_ROWACC(pn); TSETCNT(cnt,N ,pn); Ig = LEN*P; dst = DESTlpn); _ADV_TM(14,pn); if (dst != pn) { BLK_SCALAR_OUT(pn,dst,l } _ADV_TMC(10,pn); for (i=0; i < LEN; i++) { for (j=0; j < P; j++) { /* determines the MSB */ /* multiplies the src address */ /* adjusts the LSB */ I * stores destination addresses */ I * stores source address */ /* number of words */ I* notes down the destination */ /* PE number */ r overhead for initialization */ /* writes data to the OM */ /* index manipulator initialization */ /* copies data from matrix A to OM */ 1 2 = dst; 1 1 = (i*P)+j; 0MD[l1][pn][l2] = A [pn][i][j]; } } SYNCH (cnt,pn); SET_COLACC(pn); TSETCNT (cnt,N,pn); lg=P; src = SRC[pn]; _ADV_TM(6,pn); if (dst != pn) { for (i=0; i <LEN; i++) { _ADV_TMC(10,pn); BLK_SCALAR_IN (pn,src,lg); for (j=0; j<P; j++) { 1 1 = (i*P)+j; VRW[pn)[j] = 0MD[l1][src][pn); ) for (j=0; j<P; j++) < A[pn][i][j] = VRW[pn]Q]; ) } } SYNCH (cnt,pn); ENDTSK(pn); > simO { int i; createfsim” ); csim_initO; input_arraysO; shuffle_determine(); act = N; for (i=0; i<N; i++) { shuffle(i); } ; wait(g done); REPORT; I * synchronizes the processors */ I * number of words */ /* notes down the source PE number */ I * overhead for assignment * 1 /* VRW initialization */ I * reads data from OM */ /* copies data from OM to VRWs */ I* copies data from VRW to matrix */ Histogramming on a hypercube using circuit switched link communication. Histogramming is performed on a SIZE x SIZE image data with variable GREY levels LEVELS. M essages are sent using the pairwise algorithm proposed by Bokhari. #define NCPUS 16 #define DIM 4 #define SIZE 1024 ffdefine LEVELS 1024 #define CNTSZ 16 #include “ hyp_sim3.h” #define M NCPUS #define M 1 NCPUS/2 #define D D IM #define P SIZE #define B LEVELS #define LEN (P/M) FILE *fptr, *outptr; int A[M][LEN][P]; int H[M][B]; int BUF[M][BJ; int FROUTE(M)[M); int FROUTE D[M ][M ]; int BROUTE[MJ[M]; int BIND[M]; input_arrays() int i,j,k; fptr = fopen (“ matrix_A”, Y); for (i=0; i <M; i++) for (j=0; j <LEN; j++) for (k=0; k < P; k++) { fscanf(fptr, “ %d”, & A [i][D [k]); > fdose(fptr); } in'rt_H0 { int i,j,k; for (i=0; i <M; i++) for (j=0; j <B; j++) histg(pn) int pn; { int temp,i,j, m sgjength, di, index, oindex, nindex, prl, src, dst, tpn; int finish; createfhistg” ); _ADV_TM(5,pn); for (i=0; i<B; i++) { H[pn][i] = 0; _ADV_TM{7,pn); _ADV_TM(5,pn); } _ADV_TM(5,pn); for (i=0; i<LEN; i++) { _ADV_TM(5,pn); for (j=0; j<P; j++) { _ADV_TM(5,pn); temp = A[pn][i](B; if (temp < B) H[pn][temp] +=1; else H[pn][B-1]+=1; _ADV_TM(25,pn); _ADV TM(5,pn); } _ADV_TM (5,pn); } prl = pn; oindex = 0; m sgjength = B; di = 0; finish = 0; _ADV_TM(7,pn); _ADV_TM(5,pn); for (di=0; di<D; di++) { _ADV_TM(3,pn); if (finish == 0) { nindex = power(2,di,pr1); index = oindex + nindex; tpn = pn & index; I* loop initialization */ /* initilaizes histogram values */ I * overhead for initialization */ /* loop condition checking */ /* loop initialization */ I * loop initialization */ I * loop initialization */ /* histogram on local data */ I* overhead for local histogramming */ I * loop exiting */ /* loop exiting */ I* initializes the variables */ /* overhead for initializing the variables */ I* for loop initialization * 1 /* condition check */ /* determine the power */ _ADV_TM(14,pn); _ADV_TM(4,pn); if (tpn == oindex) { dst = pn + nindex; _ADV_TM(4,pn); routing(pn,pn,dst); ADV_TM(3,pn); if (di != 0) { WAIT_DATA_RDY(0,pn); } COMMUNICATE_CIRCUIT(pn,dst,msg finish = 1; _ADV TM(2,pn); SET_DATA_RDY(1 ,dst); _ADV_TM(2,pn); ) else { _ADV_TM(4,pn); if (tpn == index) { WAIT_DATA_RDY(1,pn); src = pn - nindex; _ADVJM(4,pn); _ADV_TM(5,pn); for (i=0; i<B; i++) ( H(pn)[i) += H[src][i]; _ADV_TM(1S,pn); } SET_DATA RDY(0,pn); } } index = index; _ADV_TM(2,pn); } END_TSK(pn); t * overhead for above 3 instructions */ /* condition check */ /* determines the destination */ I * overhead */ I * determining routing path */ I * overhead for condition check */ I * wait for receiving the message */ length); /* performs the communication * 1 I* the processor’ s job is over */ I* overhead for setting the flag */ /* condition check */ I* condition check */ I* determines the source */ I* overhead for determining the source */ /* loop initialization * 1 I* adding up the counts */ I* overhead */ I* maintains the old index */ I* overhead */ power(x,n,pr) int x,n,pr; { inti.p; p=f; _ADV_TM(2,pr); I* raise x to n-th power; n > 0 */ /* initializing p */ 148 _ADV_TM(5,pr); for (i=1; i <=n; i++) < p = p*x; _ADV_TM(5,pr); } _ADV_TM(3,pr); return(p); routing (p,x,y) int p,x,y; i int i,j,k,pr1,step,s_node,d_node,tempsrc,tempdst,cj; step = 0; s_node = x; d_node = x; for(i=0; i<D; i++) { if (d_node != y) ( pn =p; j = power(2,i,pr1); tempsrc = s_node & j; tempdst = y & j; if (tempsrc != tempdst) { if (tempsrc == 0) d_node = s_node | j; else { cj = -j; d_node = s_node & cj; } ; step += 1; FRO UTE[p][s_node] = d_node; FRO UTE_D[pHs_node] = i; BROUTE[p][d_node] = s_node; s_node = d_node; } ; } ; > ; } simO { int xi; create fsim ” ); /* for loop initialization */ /* loop increment */ I * returning the value */ /* determines routing links csim_inH0; input_arraysO; init_H(); act = M ; for (xi=0; xi<M; xi++) { histg(xi); } wait(g_done); REPORT; ^ .a a a a . , . . ^ . ******** *.*, . . **. .*.*............. Histogramming on OMP. U ses BLK_BROADCAST operation for writing data to OM and BLK_VECTOR_IN to read data from the OM. #define NCPUS 32 #define SIZE 1024 #define LEVELS 1024 #detine CNTSZ 16 #include “ bus_sim3.h” #define N NCPUS #define P SIZE #define B LEVELS ffdefine LEN (P/N) FILE *fptr, *outptr; int A[N][LEN][P]; int H[N][B]; int OMD[B][N][N]; int VRW[N](B][N]; input_arraysO { int i,j,k; fptr = fopen (“matrix_A”, Y); for (i=0; i <N; i++) for (j=0; j <LEN; j++) for (k=0; k < P; k++) { fscanf(fptr, "% d”, & A[ijfil[k]); } fclose(fptr); 149 init_HO { int i,j,k; for (i=0; i <N; i++) for (j=0; j <B; j++) H [i]fi] = 0; } histg(pn) int pn; { int temp, i, j, cnt, length; cnt =0; _ADV_TMS(1,pn); createfhistg’ } ; ADV_TM(5,pn); for (i=0; i<B; i++) { H[Pn][i] = 0; _ADV_TM(7,pn); _ADV_TM(5,pn); } _ADV_TM(5,pn); for (i=0; i<LEN; i++) { _ADV_TM(5,pn); for (j=0; j<P; j++) { _ADV_TM(5,pn); temp = A[pn][i][fl; if (temp < B) H[pn][temp] +=1; else H[pn][B-1] +=1; _ADV_TM(25,pn); _ADV_TM(5,pn); } _ADV_TM(5,pn); > S ET_R0WACC (pn); cnt = (cnt+1) % CNTSZ; _ADV_TMS(8,pn); TSETCNT (cnt,N,pn); length = B; _ADV_TMC(2,pn); BLK_SCALAR_OUT(pn,0, length); P loop initialization */ p initilaizes histogram values */ P overhead for initialization */ p loop condition checking */ p loop initialization */ p loop initialization */ P loop initialization */ p histogram on local data */ p overhead for local histogramming */ P loop exiting */ P loop exiting */ P overhead for mod operation */ P length assignment */ P writes a block of data into a single */ _ADV_TMC(10,pn); for (i=0; i<B; i++) OMDp][pn][0] = H [pn][0; SYNCH(cnt.pn); SET_COLACC(pn); cnt = (cnt+1) % CNTSZ; _ADV_TMS(8,pn); TSETCNT(cnt,N,pn); length=B; _ADV_TMC(2,pn); _ADV_TM(3,pn); if (pn == 0) ( BLK_VECTOR_IN(pn,length); OM*/ _ADV_TMC(10,pn); for (i=0; i<B; i++) for (j=0;j<N; j++) VRW[pn][i][j] = 0MD[i)[j][0]; _ADV_TM(5,pn); for (i=0; i<B; i++) { _ADV_TM(5,pn); for(j=1;j<N; j++) { _ADV_TM(5,pn); temp = VRW [pn][0QJ; _FETCH_LM(pn); _FETCH_VRW(pn); H(pn][i] +=temp; _ADV_TM(21 ,pn); ~ADV TM(5,pn); _ADV_TM(5,pn); } ADV_TM(3,pn); } " SYNCH(cnt,pn); END TSK(pn); } simO ( int i,i1; createfsim’ ); csimjnitO; input_arrays(); P memory module in OM */ p index manipulator initialization */ p copies local data to OM */ P synchronizes the processors */ p overhead for mod operation */ p number of vectors */ p length assignment */ p condition check */ p reads partial histogram results from p index manipulator initialization */ p loop initialization */ P loop initialization */ p computes global histogramming */ p overhead for computation */ p loop exiting */ P condition check */ 150 init_H; act = N; for (i=0; i<N; i++) { histg(i); } ; wait(g done); REPORT; * ******** ****************************** A * * * * * * * * Histogramming on COM Uses BLK_BROADCAST operation for writing data to OM and BLKVECTORJN to read data from the OM. . #define NCPUS 8 #define SIZE 1024 #define LEVELS 1024 #define CNTSZ 1 G #include “ bus_sim3.h" # define N NCPUS #define P SIZE #define B LEVELS #define LEN (P/N) FILE *fptr, ‘outptr; int A[N][LEN](P]; int H(N][B]; int OMD[B][N][N]; int VRW[N)[B][N]; inputarraysO { int i,j,k; fptr = fopen (“ matrix_A", Y ); for (i=0; i <N; i++) for (j=0; j <LEN; j++) for (k=0; k < P; k++) { fscanf(fptr, “ %d”, & A[0(D[k]); } fclose(fptr); } init_H0 int i,j,k; for (i=0; i <N; i++) for (j=0; j <B; j++) H P lO l = 0; } histg(pn) int pn; { int temp, i, j, cnt, length, bus; cnt =0; _ADV_TMS(1,pn); createfhistg'); _ADV_TM(5,pn); for (i=0; i<B; i++) H[pn][i] = 0; ADV_TM(7,pn); ~ADV_TM(5,pn); } _ADV_TM(5,pn); for (i=0; i<LEN; i++) { _ADV_TM(5,pn); for (j=0; j<P; j++) { _ADV_TM(5,pn); temp = A(pn][i][j|; if (temp < B) H[pn][temp] +=1; else H[pn][B-1]+=1; _ADV_TM(25,pn); _ADV_TM(5,pn); } _ADV_TM(5,pn); } SET_ROWACC(pn); cnt = (cnt+1) % CNTSZ; ADV TMS(8,pn); TSETCNT (cnt,N ,pn); length = B; _ADV_TMC(2,pn); BLK_SCALAR_OUT(pn,0,length); /* loop initialization */ /* initilaizes histogram values */ /* overhead for initialization */ /* loop condition checking */ /* toop initialization */ /* loop initialization */ I* loop initialization */ /* histogram on local data */ /* overhead for local histogramming */ I* loop exiting */ /* loop exiting */ /* overhead for mod operation */ I* length assignment * ! I* writes a block of data into a single 7 /* memory module in OM 7 _ADV_TMC(10,pn); for (i=0; i<B; i++) OMDp][pn][0] = H [pn][i]; SYNCH(cnt,pn); SET_COLACC (pn); cnt = (cnt+1) % CNTSZ; ADV TMS(8,pn); TSETCNT(cnt,N,pn); _ADV_TM(3,pn); if (pn == 0) ( _ADV_TMC(5,pn); for(bus=0; bus < N; bus++) { CROSS_ROW_BLK_IN(pn,bus,0,length); _ADV_TMC(10,pn); ADV TMC(5,pn); f for (i=0; i<B; i++) for (j=0;j<N; j++) VRW[pn][i)fi] = O M D [i](a[0); _ADV_TM(5,pn); for (i=0; i<B; i++) _ADV_TM(5,pn); for(j=1; j<N; j++) { _ADV_TM(5,pn); temp = VRW[pn][i][j]; _FETCH_LM(pn); _FETCH_VRW(pn); H[pn][i] +=temp; _ADV_TM(21,pn); _ADV_TM(5,pn); } _ADV TM(5,pn); } ADV_TM(3,pn); } SYNCH (cnt.pn); END_TSK(pn); /* index manipulator initialization */ /* copies local data to OM 7 /* synchronizes the processors * 1 /* overhead for mod operation */ /* condition check */ /* loop initialization */ /* reads partial histogram results */ /*from OM */ /* index manipulator initialization */ ! * loop exiting */ /* loop initialization */ /* loop initialization */ /* computes global histogramming */ /* overhead for computation */ /* loop exiting */ /* condition check */ simO { createfsim1 ) ; csim_inilO; input_arraysQ; init_H; act = N; for (i=0; i<N; i++) ( histg(i); } ; wait(g_done); . REPORT; } . Performs Gaussian Elimination and Back Substitution of P linear equations on an N-processor Hypercube. P rows are distributed across the processors in column-wrapped-mapping manner. Uses one-to-all broadcast (tree-structure) communication using circuit switched link communication. Messages are sent using the pairwise algorithm proposed by Bokhari. #define NCPUS 1024 #define DIM 10 #define SIZE 1024 #define CNTSZ 16 #include “ hyp_sim2.h” #define M NCPUS #define M 1 NCPUS/2 #define D DIM #define P SIZE #define LEN (P/M) FILE *fptr, *outptr; float A[M] [LEN) [P]; float BUF[M][P+2]; float B[M][LEN); float X[M][LEN]; 1loat RB[M][LEN); 1loat PSUM[M][LEN]; int FROUTE[M][M]; int FROUTE D[M][M]; int BROUTE(M](M); 152 int BIND [M ]; int BRDCNT[M ]; input_arraysfl { int i,j,k; fptr = fopen (“ matrixA”, “ r” ); for (i=0; i <LEN; i++) for (j=0; j <M; j++) for (k=0; k c P; k++) { fscanf(fptr, "%f, & A[j][i][kl); } for (i=0; i<M; i++) for (j=0;j<LEN; j++) { k = (j*M )+i; Bti][fl = A[0][o][k]; } fclose{fptr); } init_X0 { int i,j; for (i=0; i < M ; i++) for (j=0; j <LEN; j++) { X f0(0 = 0.0; PSUM [i]G] = 0.0; } } matrix_array_mul() int i,j,k,pe,ind; float temp; for (i=0; i < LEN; i++) for 0=0; j < M ; j++) temp = 0.0; for (k=0; k < P; k++) { ind= k/M; pe = k% M ; temp += AG]ti][k]*X[pe][ind]; RB[j][i] = temp; } } gauss (pn) int pn; { int i,j,k,kl ,1 1 ,l2,tl,tpn,length,src,cnt; int bent,brdindex,next; float temp,tempi; cnt=0; tpn=pn; _ADV_TMS(2,pn); create(“ gauss” ); _ADV_TM(5,pn); for (i=0; i < LEN; i++) _ADV_TM(5,pn); for (j=0; j < M ; j++) k = (i*M ) + j; BRDCNT[pn] = 0; _ADV_TM(5,pn); _ADV TM(4,pn); if (k < P) { _ADV_TM(4,pn); if (tpn == j) { k = (i*M ) + j; temp = A[tpn](i][k); _ADV_TM(l6,pn); ~ADV~TM(7,pn); _ADV_TM(4,pn); if (k != 0) { WAIT DATA_RDY(1,tpn); } _ADV_TM(4,pn); if (temp != 0) { count_start(tpn); length"= P-k+2; _ADV_TMG(5,pn); _ADV_TMC(5,pn); ~ADV_TMC(6,pn); while (BRDCNT[tpn] < DIM ) { /* initialization overhead * 1 /* initialization overhead */ /* initializes broadcast count */ t* overhead *( ! * condition check overhead */ r condition check overhead */ /* select the pivot element */ I * computational overhead */ I * additional overhead */ I * condition check */ r condition check */ /* start measuring comm time */ I* assignment overhead */ /* initialization overhead */ I * while condition checking */ 153 bent = BRDCNTflpn]; _ADV_TMC(5,pn); brdindex = power (2, bent, tpn); tl = tpnA brdindex; _ADV_TMC(4,pn); routing (pn.pn.tl); COMMUNICATE_CIRCUIT(pn,t1,l BUF[t1][0) = temp; for (I2=k; I2<P; I2++) BUF[t1][l2+1] = A[tpn][i][l2); BUF[t1][P+1] = B[tpn][i]; bent += 1; BRDCNT[tpn] = bent; BRDCNT[tl] = bent; _ADV_TMC(13,pn); SET_DATA_RDY(0,t1); } count_end(tpn); } _ADV_TM(5,pn); for (I1=i+1; I1<LEN; 11++) { tempi = A[tpn][l1][k]Aemp; _ADV_TM(16,pn); _ADV_TNI (5,pn); for (I2=k; I2<P; I2++) { A[tpn]pi][l2] = A[tpn][l1][l2l - (tempi * _ADV_TM(38,pn); _ADV_TM(5,pn); } B[tpn][M] = B[tpn)[l1) - (tempi *B[tpn][i]) _ADV_TM(28,pn); ADV_TM(5,pn); f for (tl=0;tl<M;t1++) { if (t1 != tpn) { SET_DATA_RDY(2,t1); } } next = (j +1) % M ; if ( k < P-1) { /* assignment overhead */ /* determining destn overhead */ /* determine routing path */ I; /* performs the communication */ /* broadcasts temp, the A[pn][i]th row, */ /*and B[pn](i]; one-to-all broadcast */ /* copies temp */ /* copies the row vector to OM */ /* copies element from B */ /* increment broadcast count */ /* cnt maintaining overhead */ /* stops counting */ I* assignment overhead */ /* initialization overhead */ A[tpn][i]P2]); /* performs elimination */ r on its own block */ /* multiply and sub overhead */ r loop exit */ /* assignment of B overhead */ I* loop exit */ /* releases the processors from wait loop */ /* assignment overhead */ I* condition checking */ /* assignment overhead */ I* determine routing path */ SET_DATA_R D Y(1 ,next); } ) else ( W AI T_D ATAR DY(O.tpn); length = P-k+2; ADV_TMC(2,pn); ~ADV_TMC(6,pn); while (BRDCNT[tpn] < D IM ) ( bcnt=BRDCNT[tpn]; _ADV_TMC(5,pn); brdindex = power(2,bent,tpn); H = tpn A brdindex; _ADV_TMC(4,pn); routing (tpn,tpn,t1); COMMUNICATE_CIRCUIT(pn,t1 .length); BUF[t1][0] = BUF[tpn][0]; for (I2=k; I2<P+1; 12++) BUF(t1][l2+1J = BUF[tpn][l2+1]; bent +=1; BRDCNT[tpn] = bent; BRDCNT(t1] = bent; _ADV_TMC(13,pn); SET D A T A RDY(0,t1); } temp = BUF[pn][0]; _ADV_TM(8,pn); _ADV_TM(5,pn); for (11=0; 1 1 <LEN; 11++) { kl= (I1*M ) + tpn; _ADV_TM(5,pn); ADV_TM(5,pn); if (k1 > k ) { tempi = A[pn][l1][k]/temp; ADV_TM(l6,pn); ~ADV_TM(5,pn); for (I2=k; I2<P; I2++) { A[pn][l1][l2] = A[pn][l1][l2] - (tempi* BUF[pn][l2+1]); /* performs elimination */ _ADV_TM (38,pn); /* computation overhead */ _ADV_TM (5 ,pn); I* loop exiting */ } B[pn][l1] = B[pn](l1] - (tempi *BUF[pn][P+1]); _ADV_TM(28,pn); I* determining B overhead */ /* increment broadcast count */ /* cnt maintaining overhead */ r receives temp */ /* assignment overhead */ /* initialization overhead */ /* assignment overhead */ /* condition check overhead */ /*tempi initialization*/ /* loop initialization */ } _ADV_TM(5,pn); } WAIT DATA_RDY(2,tpn); } } _ADV_TM(5,pn); } _ADV_TM(5,pn); /* loop exiting */ } END_TSK(pn); } backsubst(pn) int pn; < int i,j,k,H ,l2,cnt,k1,tpn,length,t1; int bent,brdindex,next; float temp, tem pi, x_update; tpn=pn; _ADV_TMS(1,pn); create(“back_subst"); _ADV_TM(5,pn); for (i=LEN-1; i>=0; i-) { _ADV_TM(5,pn); for (j=M-1; j>=0; j-) { k = (i*M ) + j; BRDCNT[pn] = 0; _ADV_TM(5,pn); _ADV_TM(4,pn); if (tp n = = i) { _ADV_TM(4,pn); if (k == P-1) { X [D [i] = B[j]P]/Afi][i][k]; temp = X[j][i]; ADV_TM(36,pn); f else { tempi = B[tpn][i] - PSUM[tpn][i]; temp = tempi / A[tpn][i][k]; X Q ][i] = temp; ADV_TM(43,pn); r /* loop exiting */ I* wait condition for the loop * 1 /* loop exiting */ /* loop initialization */ /* loop initialization */ /* initialization overhead */ /* condition check */ /* condition check * / /* determines last element */ /* data to be broadcasted */ I* computational overhead */ /* computational overhead */ _ADV_TM(4,pn); if (k != P-1) ( W AIT_D ATAR D Y ( 1 ,tpn); } length =1; count_start(tpn); _ADV_TMC(2,pn); ADV TMC(S,pn); while (BRDCNT[tpn) < DIM ) ( bent = BRDCNT[tpn]; _ADV_TMC(5,pn); brdindex = power (2, bent, tpn); t1 = tpnA brdindex ; _ADV_TMC(4,pn); routing{pn,pn,t1); COMMUNICATE_CIRCUIT(pn,t1 .length) BUF[t1][0] = temp; bent += 1; BRDCNT[tpn] = bent; BRDCNT[t1] = bent; _ADV_TWC(13,pn); SET_DATA_RDY (0,t1); } count_end(tpn); _ADV_TM(5,pn); for (II ~ = 0; 1 1 < LEN; 11++) { k1 = (I1*M ) + tpn; _ADV_TM(5,pn); _ADV_TM(5,pn); if (kl < k ) { PSUM[pn][l1] += A[pn][h][k] ‘ temp; _ADV_TM(29,pn); _ADV TM(5,pn); for (t1=0;t1<M;t1++) { if (t1 != tpn) { SET DATA_RDY(2,t1); } /* condition check */ r assignment overhead */ /* while condition checking */ /* assignment overhead */ I* determining destn overhead */ /* determine routing path */ I* performs the communication */ /* broadcasts temp, the A[pn][i]th row, */ I* and B[pn][i]; one-to-all broadcast */ I* copies temp */ /* increment broadcast count */ I * cnt maintaining overhead */ /* stops count */ f loop initialization overhead */ /* computational overhead */ I * condition check */ /* computes partial sum for other rows*/ /* within that block */ /* computational overhead */ /* loop exiting */ /* releases all processors from wait-loop */ 155 } ■ 0 = o ) i next = M-1; } else { next = j-1; > if (k != 0) { SET_DATA_RDY(1 ,next); } } else { WAIT_DATA_RDY(0,pn); length = 1; _ADV TMC(2,pn); _ADVJMC(6,pn); while (BRDCNTttpn] < DIM) { bcnt=BRDCNT[tpn]; _ADV_TMC(5,pn); brdindex = power(2,bent,tpn); t1 = tpn A brdindex; _ADV_TMC(4,pn); routing (tpn,tpn,t1); COMMUNICATE_CIRCUIT(pn,t1 .length); BUF[t1][0] = BUF[tpn][0]; bent +=1; BRDCNT[tpn] = bent; BRDCNT[t1] = bent; _ADV_TMC (13,pn); SET_DATA_RDY(0,t1); } x_update = BUF[pn](0]; _ADV_TM(8,pn); _ADVTM(5,pn); for (11=0; I1<LEN; 11++) { k1 = (I1*M ) +tpn; _ADV_TM(5,pn); _ADV_TM(5,pn); if (k1 < k ) { PSUM[pn][l1] += A[pn][l1][k] * x_update; _ADV_TM(29,pn); /* assignment overhead */ /* while condition checking */ /* assignment overhead */ I* determining destn overhead */ /*cnt maintaining overhead */ /* receives data from Buffer */ r overhead for update */ I* loop initialization */ /* computational overhead */ I* condition check *f /* summation overhead */ > I _ADV_TM (5,pn); I * loop exiting */ } WAIT_DATA_RDY(2,pn); /* wait for processors to synchronize */ } _ADV_TM(5,pn); /* loop exiting */ } _ADV_TM (5,pn); I* loop exiting */ } END_TSK(pn); } power(x,n,pr) I* raise x to n-th power; n > 0 */ intx,n,pr; ( int i,p; p=f; for (i=1; i <=n; i++) { p = p*x; } return(p); routing(p,x,y) /* determines routing links */ int p,x,y; { int i,j,k,pr1,step,s_node,d_node,tempsrc,tempdst,cj; step = 0; s_node = x; d_node = x; for(i=0; i<D; i++) { if (d node != y) f prl =p; j = power(2,i,pr1); tempsrc = s_node & j; tempdst = y & j; if (tempsrc != tempdst) { if (tempsrc == 0) d_node = s_node | j; else { cj = -j; d_node = s_node & cj; } ; step += 1; FROUTE[p][s_node] = d_node; FROUTE_D[p][s_nodeJ = i; BROUTE[p][d_node] = sn o d e ; s_node = dnode; h } ; } ; > simO { int xi; create (“ sim” ); csimjnitO; input_arraysQ; init_X0; act = M ; foi (xi=0; xi<M ; xi++) { auss(xi); } wait(g_done); act = M ; for (xi=0; xi<M; xi++) { back_subst(xi); } wait(g_done); REPORT; hyp_sum_com_report(); matiix_anay_mul 0; Performs Gaussian Elimination and Back Substitution of P linear equations on an N-processor OMP. P rows are distributed across the processors in a column-wrapped-mapping manner. U ses one-to-all broadcast communication. U ses BLK_BROADCAST operation for writing data to OM and BLK_SCALAR IN to read data from the OM. ffdefine NCPUS 32 ffdefine SIZE 1024 ffinclude “ bus_sim4.h” #define N NCPUS #define P SIZE #define LEN (P/N) FILE *fptr, *outptr; float A[N][LEN][P]; float B[N][LEN]; float X[N][LENJ; float RB[N](IEN]; float OMD[P+2][N][NJ; float VRW[N][P+2]; float PSUM[N][LEN); inputarraysO { int i,j,k; fptr = fopen (“ matrix_A”, “ t” ) ; for (i=0; i <LEN; i++) for (j=0; j <N; j++) for (k=0; k < P; k++) { fscanf(fptr, “ % f, & A(fl[i][k]); } for (i=0; i <N; i++) for (j=0;j < LEN; j++) { k = (j*N)+i; B [fl[D = A[0](0][k]; } fclose(fptr); } init_X0 { inti.j; for (i=0; i < N; i++) for (j=0;j < LEN; j++) < X [i][D = 0.0; PSUM [i](j] = 0.0; } matrix array m ulO { int i,j,k,pe,ind; float temp; for (i=0; i < LEN; i++) for (j=0; j < N; j++) { temp = 0.0; tor (k=0; k < P; k++) ' { ind = k/N ; pe = k % N; temp += A0][i][k] * X[pe][ind]; } RB[j][i] = temp; } gauss (pn) int pn; { int i,j,k,k1,11,12,tpn,length,src,cnt; float temp,tempi; cnt=0; tpn=pn; _ADV_TMS(2,pn); createf gauss’ ); SET_ROWACC(pn); TSETCNT (cnt,N,pn); _ADV_TM(5,pn); for (i=0; i < LEN; i++) ( _ADV_TM(5,pn); for (j=0; j < N; j++) { k = (i*N) + j; _ADV_TM(5,pn); _ADV_TM(4,pn); if (k < P) { _ADV_TM(4,pn); if (tpn == j) { k = (i*N) + j; temp = A[tpn](i][k]; _ADV_TM(16,pn); _ADV_TM(7,pn); if (temp != 0) { length = P-k+2; _ADV_TMC(5,pn); BLK_BROADCAST(pn,length); /* initialization overhead */ I * initialization overhead */ /* overhead */ f condition check overhead */ /* condition check overhead */ /* select the pivot element */ I* computational overhead */ /* additional overhead */ r assignment overhead */ I* broadcasts temp, the A[pn][i]th row, */ /* and B[pn][0 , one-to-all broadcast */ for (11=0; 1 1 <N; 11++) { OMD[0][pn][l1] = temp; for (I2=k; I2<P; I2++) OMD[l2+1][pn][l1] = A[tpn][i][l2]; OMD[P+1][pn][l1] = B[tpn][i]; } } SYNCH(cnt.pn); SETCOLACC(pn); TSETCNT(cnt,N,pn); _ADV TM(5,pn); for (I1=i+1; McLEN; I1++) ( tempi = A[tpn][t1][k]Aemp; _ADV_TM(16,pn); ~ADV_TM(5,pn); for (I2=k; I2<P; I2++) { A[tpn][l1][l2] = A[tpn][l1][l2] - (tempi * own block */ _ADV_TM(38,pn); ~ADV_TM(5,pn); r B[tpn][l1] = B[tpn][l1] - (tempi * B[tpn][i]) _ADV_TM(28,pn); _ADV_TM(5,pn); } SYNCH(cnt,pn); SET_ROWACC(pn); TSETCNT(cnt,N,pn); > else { SYNCH (cnt.pn); SET_COLACC(pn); src = j; length = P-k+1; _ADV_TMC(7,pn); BLK_SCALAR_IN (pn,j,length); for (11=0; 1 1 <N; 11++) { VRW [I1][0] = OM D[0][g[pn]; for (I2=k; I2<P; I2++) VRW[I1][I2+1] = 0MD[I2+1]P1]; VRW[I1][P+1] = OMD[P+1)(fl(l1]; } temp = VRW[pn][0]; /* copies temp to O M */ /* copies the row vector to OM */ /* copies element from B */ I* synchronizes with other PEs */ /* switch to column mode */ /* next synchronization */ /* assignment overhead */ /* initialization overhead */ A[tpn][i][l2]); /* performs elimination on its /* multiply and sub overhead */ /* loop exit */ /* assignment of B overhead */ /* loop exit */ /* initialization overhead */ /* reads data from OM, scalar read */ I * receives the pivot element */ /•receives the pivot row */ /* receives B(fl[i] */ 158 _ADV_TM(8,pn); _ADV_TM(5,pn); for (11=0; IKLEN; 11++) < k1 = (I1*N) + tpn; _ADV_TM(5,pn); _ADV_TM(5,pn); if (k1 > k ) { tempi = A[pn][l1 ][k] / temp; _ADV_TM(16,pn); _ADV_TM(5,pn); for (I2=k; I2<P; I2++) < A(pn][l1l[l2l = A[pnJ[l1][l2] _ADV_TM(38,pn); _ADV_TM(5,pn); } B[pn][l1] = B[pn)[l1] - (tempi _ADV_TM(28,pn); } _ADV_TM(5,pn); } SYNCH(cnt.pn); SET_ROWACC(pn); } } _ADV_TM(5,pn); } _ADV_TM(5,pn); } SYNCH(cnt.pn); END_TSK(pn); back_subst(pn) int pn; { int i,j,k,H ,l2,cnt,k1,tpn,length; float temp, tem pi, x_update; cnt=0; tpn=pn; _ADV_TMS(1,pn); _ADV_TM(3,pn); createfback subst” ); SET_ROWACC(pn); TSETCNT(cnt,N,pn); _ADV_TM(5,pn); /* assignment overhead 7 /* initialization overhead 7 /* assignment overhead 7 /* condition check overhead 7 /* tempi initialization 7 r loop initialization 7 (tempi* VRW[pn][l2+ 1]); /* performs elimination 7 I * computation overhead 7 /* loop exiting */ VRW[pn][P+1]); I* determining B overhead */ I* loop exiting 7 /* loop exiting 7 /* loop exiting 7 I* loop initialization 7 for (i=LEN-l; i>=0; i-) _ADV_TM(5,pn); for (j=N-1; j>=0; j-) { k = (i*N) + j; _ADV TM(5,pn); _ADV_TM(4,pn); if (tpn ==j) { _ADV_TM(4,pn); if (k == P-1) { X (D (i] = BB][i]/A[fl[i][k]; temp = X [jjfi] ; _ADV TM(36,pn); } else { tempi = B[tpn][i] - PSUM[tpn][i); temp = tempi /A[tpn][i][kJ; X[fl[i] = temp; _ADV_TM (43 ,pn); } BROADCAST^; for (11=0; 1 1 <N; 11++) OMD[0][fl[l1] = temp; SYNCH(cnt,pn); SET_CO LACC(pn); TSETCNT(cnt,N,pn); _ADV_TM(5,pn); for (1 1 = 0; 1 1 < LEN; 11++) { k1 = (I1*N) +tpn; _ADV TM(5,pn); _ADV_TM(5,pn); if (kl < k ) { PSUM[pn)[l1] += A[pn][l1][k] * _ADV_TM(29,pn); } ADV_TM(5,pn); } ” SYNCH(cnt,pn); SET_ROWACC(pn); TSETCNT(cnt,N,pn); r loop initialization 7 r initialization overhead 7 /* condition check 7 /* condition check 7 /* determines last element 7 /* data to be broadcasted 7 /* computational overhead 7 /* computational overhead 7 I* broadcasts last element 7 I * copies data to OM 7 /* loop initialization overhead 7 I* computational overhead 7 I* condition check 7 i; /* computes partial sum for other 7 I* rows within that block 7 /* computational overhead 7 I* loop exiting 7 159 else { SYNCH(cnt.pn); S ETCO LACC (pn); length = 1; _ADV_TMC(2,pn); BLK_SCALAR_IN(pn,j,length); xupdate = OMD[O][0[pn]; ADV TM(8,pn); _ADV_TM(5,pn); for (11=0; I1<LEN; 11++) { k1 = (I1*N) + tpn; _ADV_TM(5,pn); _ADV_TM(5,pn); if (k1 < k ) < PSUM[pn][l1] += A[pn][l1][k] * _ADV TM(29,pn); } ADV TM(5,pn); } SYNCH(cnt,pn); SET_ROWACC(pn); } ADV TM(5,pn); f _ADV_TM(5,pn); } SYNCH (cnt.pn); END_TSK(pn); } /* length initialization 7 /* reads the data from OM */ /* moves data from OM to buffer 7 /* overhead for update 7 /* loop initialization 7 f computational overhead 7 I* condition check 7 update; I* summation overhead 7 /* loop exiting 7 /* loop exiting 7 /* loop exiting 7 simO int i; createfsim” ); csim_initO; input_arraysO; init_X0; act = N; for (i=0 ; i<N; i++) { gauss (i); } ; wait(g_d°ne); act = N; for (i=0; i<N; i++) { back subst(i); ) wait(g_done); REPORT; omp_sum_commtime(); matrix_array_mulO; Performs Gaussian Elimination and Back Substitution of P linear equations on an N-processor CCM. P rows are distributed across the processors in a eolumn-wrapped-mapping manner. Uses one-to-all broadcast communication. U ses BLK_BROADCAST operation for writing data to OM and CROSS_ROW_BLK_IN to read data from the OM. . #detine NCPUS 32 #define SIZE 1024 #include “ bus_sim4.h" #define N NCPUS #define P SIZE #define LEN (P/N) FILE *fptr, *outptr; float A[N][LEN][P]; float B[N][LEN]; float X[N](LEN]; float RB[N][LEN]; float OMD[P+2)(N)[N]; float VRW[N][P+2]; float PSUM[N][LEN); inputarraysO { int ij.k; fptr = fopen (“ matrix_A”, Y); for (i=0; i <LEN; i++) for (j=0;j <N; j++) for (k=0; k < P; k++) { fscanfffptr, “ % f, & A[j][i][k]); ) for (i=0; i <N; i++) for (j=0; j < LEN; j++) { k = (j*N)+i: 160 [ B [i][G = A[0][0][k]; I > I fclose(fptr); ! > j inK_X0 ! < " . | int i ,j ; ] for (i=0; i < N; i++) i for 0=0; j < LEN; j++) I { X[0[fl = 0.0; ; PSUMIOD] = 0.0; ; } matrix_array_mulO int i,j,k,pe,ind; float temp; for (i=0; i < LEN; i++) for (j=0; j < N; j++) < | temp = 0.0; I for (k=0; k < P; k++) ; { j ind = k/N ; pe = k%N; temp += A[j][i][k] * X[pe][ind]; RB[j][i] = temp; } } gauss (pn) int pn; { int i,j,k,k1,11,12,tpn,length,src,cnt; float temp,tempi; cnt=0; tpn=pn; _ADV_TMS(2,pn); create(“gauss” ); SET_ROWACC(pn); TSETCNT(cnt,N,pn); _ADV_TM(5,pn); for (i=0; i < LEN; i++) ADV_TM(5,pn); I* initialization overhead */ I * initialization overhead */ (j=0;j<N;j++) { k = (i*N) + j; _ADV_TM(5,pn); _ADV_TM(4,pn); if (k < P) { _ADV_TM(4,pn); if (tpn == j) { k = (i*N) + j; temp = A[tpn][i][k]; _ADV_TM(16,pn); _ADV_TM(7,pn); if (temp != 0) { length = P-k+2; _ADV_TMC(5,pn); BLK_BROADCAST(pn,length); for (11=0; I1<N; 11++) { OMD[0][pn][H] =temp; for (I2=k; I2<P; I2++) OMD[l2+1][pn][l1] = A[tpn](i][l2]; OMD[P+1][pn][l1] = B[tpn][i]; } } SYNCH (cnt.pn); SET_COLACC(pn); TS ETCNT (cnt.N ,pn); _ADV_TM(5,pn); for (I1=i+1; I1<LEN; 11++) { tempi = A[tpn][l1][k]/temp; _ADV_TM(16,pn); _ADV_TM(5,pn); for (I2=k; I2<P; I2++) I* overhead */ I * condition check overhead */ /* condition check overhead */ /* select the pivot element */ I * computational overhead */ /* additional overhead */ /* assignment overhead */ r broadcasts temp, the A(pn][i]th row, */ I * and B[pn][i], one-to-all broadcast */ /* copies temp to OM */ I * copies the row vector to OM */ I * copies element from B */ r synchronizes with other PEs */ I* switch to column mode */ I * next synchronization */ I * assignment overhead */ /* initialization overhead */ { A[tpn][l1][l2] = A[tpn][l1]02] - (tempi * A[tpn](i][l2]); I* performs elimination */ /* on its own block */ _ADV_TM(38,pn); f multiply and sub overhead */ _ADV_TM(5,pn); /* loop exit*/ } B[tpn][l1] = B[tpn][l1] - (tempi * B[tpn][i]); _ADV_TM(28,pn); I * assignment of B overhead */ _ADV_TM(S,pn); f loop exit*/ SYNCH(cnt.pn); SET_RO WACC (pn); TSETCNT(cnt,N,pn); } else { SYNCH(cnt,pn); SET_COLACC(pn); src = j; length = P-k+1; _ADV_TMC(7,pn); CROSS_ROW_BLK_IN (pn,j,pn,length) ; for (1 1 =0; 1 1 <N; I1++) < VRW[I1][0] = 0MD[0][fl[pn]; for (I2=k; I2<P; 12++) VRW[I1][I2+1] = OMD[l2+1][fl[H]; VRW(M][P+1J = 0MD[P+1][fl[l1]; } temp = VRW[pn](0]; _ADV_TM(8,pn); _ADV_TM(5,pn); for (11=0; I1<LEN; 11++) { k1= (I1*N) + tpn; _ADV_TM(5,pn); _ADV_TM(5,pn); if (k1 > k ) < tempi = A[pn][l1][k]/temp; _ADV_TM(16,pn); _ADV_TM(5,pn); for (I2=k; I2<P; 12++) { A[pn](l1][l2] = A(pn][l1][l2] _ADV_TM(38,pn); _ADV_TM(5,pn); } B[pn][l1] x B[pn][l1] - (tempi _ADV_TM(28,pn); } ADV TM(5,pn); F SYNCH(cnt.pn); SET ROWACC(pn); } } ADV_TM(5,pn); /* initialization overhead */ /* reads data from OM, scalar read */ /* receives the pivot element */ /*receives the pivot row */ f* receives B fiJ[i] */ I * assignment overhead */ I* initialization overhead */ I * assignment overhead */ /* condition check overhead * J I*tempi initialization*/ /* loop initialization */ (tempi* VRW[pn][l2+1]); /* performs elimination */ /* computation overhead */ /* loop exiting */ VRW[pn][P+1]); I* determining B overhead */ /* loop exiting */ /* loop exiting */ } _ADV_TM(5,pn); } SYNCH(cnt.pn); END TSK(pn); } back_subst(pn) int pn; { int i,j,k,H ,l2,cnt,k1,tpn,length; float temp, tem pi, x_update; cn1=0; tpn=pn; _ADV_TMS(1 ,pn); _ADV_TM(3,pn); create(“back_subs1” ); SET_ROWACC(pn); TSETCNT (cnt,N ,pn); _ADV_TM(5,pn); for (i=LEN-1; i>=0; i-) { _ADV_TM(5,pn); for (j=N-1; j>=0; j-) { k = (i*N) + j; _ADV_TM(5,pn); _ADV_TM(4,pn); if (tpn ==j) _ADV_TM(4,pn); if (k == P-1) { X[fl[i] = BG ][i]/A[fl[i][k]; temp = X G lfil; _ADV TM(36,pn); } else ( tempi = B(tpn][i] - PSUM[tpn][i]; temp = tempi / Appn][i][k]; X D J[i} = temp; _ADV TM(43,pn); } BROADCAST^); for (1 1 =0; 1 1 <N; 11++) OMD[0]Q][I1] =temp; SYNCH(cnf,pn); /* loop exiting */ /* loop initialization */ I * loop initialization */ /* initialization overhead */ /* condition check */ /* condition check */ I * determines last element */ /* data to be broadcasted */ /* computational overhead */ /* computational overhead */ I* broadcasts last element */ /* copies data to OM */ /* loop initialization overhead */ /* computational overhead */ /* condition check */ SET COLACC(pn); TSETCNT(cnt,N,pn); ADV_TM(5,pn); T o r (1 1 =0; 1 1 < LEN; I 1++) { k1 = (I1*N) +tpn; _ADV_TM(5,pn); ADV_TM(5,pn); if (kl < k ) ( PSUM[pn][l1l += A[pn][l1][k] * temp; /* computes partial sum for other */ /* rows within that block */ _ADV_TM(29,pn); /* computational overhead */ ADV_TM (5,pn); f loop exiting */ f SYNCH(cnt.pn); SET_ROWACC(pn); TSETCNT(cnt,N,pn); } else { SYNCH(cnt,pn); SET_CO LACC (pn); length = 1; _ADV_TMC(2,pn); CROSS_ROW_BLK_IN (pn,j,pn, length); x update = OMD[0][fl[pn]; _ADV_TM(8,pn); _ADV_TM(5,pn); for (11=0; I1<LEN; 11++) { kl = (I1*N) + tpn; _ADV_TM(5,pn); ~ADV TM(5,pn); if (kl < k ) { PSUM(pn][l1] += A[pn][l1][k] * x update; /* length initialization */ f reads the data from OM */ I* moves data from OM to buffer */ f* overhead for update */ r loop initialization */ /* computational overhead * 1 f* condition check */ _ADV_TM(29,pn); } ADV TM(5,pn); } “ SYNCH (cnt.pn); SETROWACC(pn); } _ADV_TM(5,pn); } _ADV_TM(5,pn); /* summation overhead */ /* loop exiting */ /* loop exiting * 1 /* loop exiting */ } SYNCH(cnt,pn); END_TSK(pn); } simO { int i; createfsim” ); csim_initO; input_arraysO; init_X 0; act = N; for (i=0; i<N; i++) { gauss(i); }; wait(g_done); act = N; for (i=0; i<N; i++) ( back_subst(i); } wait(g_done); REPORT; ccm_sum_commtimeO; matrix_array_mul();
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255754
Unique identifier
UC11255754
Legacy Identifier
DP22828