Close
USC Libraries
University of Southern California
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Folder
Communications-efficient architectures for massively parallel processing
(USC Thesis Other) 

Communications-efficient architectures for massively parallel processing

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content COM MUNICATION-EFFICIENT ARCHITECTURES FOR MASSIVELY PARALLEL PROCESSING by Joy deep Ghosh A Dissertation Presented to the F a c u l t y o f t h e G r a d u a t e S c h o o l U n iv e r s it y o f S o u t h e r n C a l i f o r n i a In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) A u g u s t 1 9 8 8 Copyright 1988 Joydeep Ghosh UMI Number: DP22773 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. U M T Dissertation Publishing UMI DP22773 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 ProQuest* Q y e UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089 This dissertation, written by JOYDEEP GHOSH under the direction of h.w8<j Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re­ quirements for the degree of C fS m i 337^4 DOCTOR OF PHILOSOPHY Dean of Graduate Studies D a te J. ?. ! . ® DISSERTATION COMMITTEE Chairperson ii Dedication To my parents and my teachers Acknowledgments / rem em ber ... com ing to the distin ct conclusion th a t there were only tw o things really worth living for - the glory and b ea u ty o f Nature, and the glory and b ea u ty o f human love and friendship ... — Alan M. Turing, quoted in Alan Turing, The Enigma (1983) I express my sincerest gratitude to my advisor, Dr. Kai Hwang for his constant guidance, support and encouragement. During the course of my graduate studies, he has been a fountain of inspiration and insights. His un­ flagging enthusiasm for new ideas and adventures were invaluable in shaping this thesis. I also thank Prof. Les Gasser and Prof. Dan Moldovan for serving on my dissertation committee, and Prof. C. S. Raghavendra and Dr. Anujan Varma for imbibing in me the spirit of scholarly ethics. My parents carefully nurtured me with boundless love and affection. My dear sister Jhum ki has been a source of great cheer and support. June and May m ean more to me than the warm th of summer. I thank them all for giving me the joys of life. Several friends and fellow graduate students helped me through my thesis. Raymond Chowkwanyun greatly assisted my forays into programming and debugging and has been a wonderful partner for bouncing ideas and witti­ cisms. Zhiwei Xu, Dongseung Kim and H. C. Wang accompanied me in burn­ ing midnight oil on numerous occasions. Prof. Alice C. Parker, Sarma Sastry, Fadi Kurdahi and Nohbyung Park provided a very congenial atm osphere dur­ ing my formative years at USC. The drudgery of routine adm inistrative tasks was made bearable by the highly spirited and helpful staff of M arlena Libman, Diane Demetras, Christine Estrada, Jenine Abarbanel and others.____________ iv I was financially supported for four years by an All-University Pre- doctoral Fellowship from the Dean of G raduate School. My research has been partially funded by an NSF grant DMC-84-2102 and an ONR contract No. N00014-86-K-0559. V Contents D e d ic a tio n ii A ck n o w led g m en ts iii L ist o f F ig u re s viii L ist o f T ables x A b s tra c t x i 1 In tro d u c tio n 1 1.1 The Advent of Massively Concurrent S y ste m s.................................. 1 1.2 Design C o n sid eratio n s............................................................................ 3 1.2.1 Interprocessor Communication C h a ra c te ris tic s .................. 3 1.2.2 Toward Communication-Efficient A rc h ite c tu re s.................. 5 1.2.3 Top-Down and Bottom-Up A p p ro a c h e s ............................... 6 1.3 Overview of the T h e sis............................................................................ 9 2 P rin c ip le s o f H y p e rn e t A rc h ite c tu re s 12 2.1 Synthesis of H y p ern ets............................................................................ 12 2.1.1 The Building B lo c k s .................................................................... 12 2.1.2 The Construction M ethodology................................................ 14 2.1.3 Cubelet-based Hypernets .......................................................... 16 2.1.4 Hypernets built with Treelets and Buslets ........................ 20 2.2 Message Routing in Hypernets ................................................. 23 2.2.1 A Distributed Routing S trateg y ................................................ 24 2.2.2 Broadcasts and Partial B ro a d c a sts......................................... 26 2.3 1/O and Fault Tolerance I s s u e s ........................................................... 28 2.4 Partially Constructed H y p e rn e ts ........................................................ 30 vi 3 C o m p u ta tio n a l a n d C o m m u n ic a tio n P ro p e rtie s o f H y p e rn e ts 33 3.1 Topological P ro p e rtie s ............................................................................. 33 3.1.1 Network Diameter and Link C om plexity............................... 34 3.1.2 Average D is ta n c e s ....................................................................... 36 3.2 Average P ath D e la y s................................................................................ 39 3.3 N atural Communication P a tte r n s ......................................................... 43 3.4 Performance under Localized C om m unication.................................. 47 3.5 Concluding R e m a rk s................................................................................ 49 4 P a ra lle l P ro c e ssin g on H y p e rn e ts 51 4.1 Introduction................................................................................................. 51 4.2 M apping of Data-Parallel A lgorithm s................................................... 52 4.2.1 Hypernets as Hierarchical S tru c tu re s...................................... 53 4.2.2 Hypernets as Virtual H ypercubes............................................ 55 4.2.3 Speedup Performance A n a ly s is ................................................ 58 4.2.4 Memory Based Reasoning ...................................................... 64 4.3 Data-Dependent P ro c e ssin g ................................................................... 65 4.3.1 General P rin c ip le s ....................................................................... 65 4.3.2 Dynamic Allocation and Load B alancing............................... 66 4.3.3 Simulation R e s u lts ....................................................................... 68 4.4 Concluding R e m a rk s................................................................................ 72 5 M a p p in g N e u ra l N e tw o rk s o n to M u ltic o m p u te rs 73 5.1 Intro d u ctio n................................................................................................. 73 5.2 Connectionist Models and M echanism s............................................... 74 5.2.1 W hat is C o n n e ctio n ism ............................................................. 74 5.2.2 Physical Structure of Neural N etw orks.................................. 78 5.3 A Structural Model for Mapping Neural N e tw o rk s ........................ 80 5.3.1 General M o d e l ............................................................................. 81 5.3.2 Characterizing a Layered N e tw o rk ......................................... 84 5.4 Multiprocessors for Neural Network S im u la tio n ............................... 87 5.4.1 Full versus Virtual Im plem entation......................................... 87 5.4.2 Processor and Memory Requirements .................................. 90 5.4.3 Mapping of Neural N e tw o rk s ................................................... 98 5.4.4 Choice of Interconnection N e tw o rk ............................................100 6 C o m m u n icatio n -efficien t C o n n e c tio n ist M ach in es 103 6.1 Introduction....................................................................................................103 6.2 Communication Bandwidth R eq u irem en ts............................................104 6.2.1 Theoretical E s tim a te s ...................................................................104 6.2.2 Simulation R e s u lts ..........................................................................108 6.3 Suitability of Existing Parallel C o m p u te rs............................................119 vii 6.4 Optical Interconnects.................................................................................122 6.5 Execution of Connectionist Models on the H y p e r n e t..................... 124 6.5.1 Communication S u p p o rt............................................................. 124 6.5.2 Packaging and VLSI C o n sid e ra tio n s...................................... 127 6.5.3 Hierarchical O p eratio n .................................................................129 6.6 Discussion and S u m m a r y ............... 130 7 C o n clu sio n s 132 7.1 Main C ontributions....................................................................................132 7.2 Suggestions for Further R e se a rc h ..........................................................134 B ib lio g ra p h y 137 A p p e n d ice s 150 A H y p e rn e t S im u la tio n s 150 A .l Source code listings......................................................................................150 A.1.1 In itia liz a tio n ................................................................................. 150 A. 1.2 Hypernet description.................................................................... 152 A. 1.3 Process m ig r a tio n ........................................................................ 154 A. 1.4 Process suspension and a w a k e n in g ...........................................156 A. 1.5 Template of a TAK p ro c e s s ........................................................158 A. 1.6 Execution of T A K ................................... 160 A.1.7 Display of re s u lts ............................................................................161 A.2 An execution trace for T A K (18,12,6)................................................... 164 B N e u ra l N e tw o rk S im u latio n s 170 B .l Source Code L istin g s................................................................................. 170 B.1.1 Logical communication traffic for random connections . . 170 B .l.2 Logical communication traffic for clustered connections . 172 B .l.3 Description of hypothetical neural n e ts ................................... 175 B.1.4 Specification of m ulticom puter to p o lo g ie s .............................177 B.1.5 Bandwidth requirements c a lc u la tio n .......................................179 B.2 Execution T ra c e s ........................................................................................182 B.2.1 Logical traffic for random c o n n e c tio n s............................182 B.2.2 Logical traffic for localized connections............................183 B.2.3 Function arguments for bandw idth estim atio n ......................194 B.2.4 Influence of packet siz e ................................................................. 196 List of Figures 1.1 A bottom -up approach to designing communication-efficient ar­ chitectures 8 1.2 A top down case study of connectionist arc h ite ctu re s........................l'l 2.1 A computer node ................................................................................... 13 2.2 Various building blocks for constructing h y p e r n e ts ....................... 14 2.3 Construction of a (3,3)-net with eight (3,2)-nets 18 2.4 Construction of (3,3)-net with 32 treelets ........................................ 21 2.5 A (3,2)-net built using buslets ........................................................... 22 2.6 Proof of Lemma 2.1 27 3.1 Diameters of various m ulticom puter n e tw o rk s................................. 35 3.2 Link complexity of various m ulticom puter netw orks....................... 37 3.3 Average distances for various multicom puter topologies ............. 38 3.4 Comparison of average path delay for various m ulticom puter topologies (W = 2) 41 3.5 Average path delays in (4,h)-nets ..................................................... 42 3.6 Models for specifying interprocessor traffic patterns .................... 45 3.7 Average distances under localized traffic (Threshold Model, a = 0 .9 ,r = 1 6 ) ................................................................................................ 48 4.1 D ata movements during execution of the MIN a lg o rith m ............. 54 4.2 Viewing a (3,2)-net as a two-tiered structure ................................. 56 4.3 D ata movements during execution of the FFT a lg o rith m ............. 56 4.4 Communication costs for algorithms of the Ascend, Descend and Recursive Doubling C la s s e s .................................................................. 61 4.5 Average path delays in hypernets for three algorithmic classes . 62 4.6 Varying load distribution through bias control .............................. 70 4.7 Effect of bias on communication profile ........................................... 71 5.1 Critical issues in developing a connectionist a rc h ite c tu re ............. 75 5.2 Architecture of a cognitive s y s t e m ..................................................... 77 ix 5.3 The universe of a processing c e l l ......................................................... 83 5.4 Characterization of a multilayered network ..................................... 86 5.5 Mapping of neural nets onto parallel computers ........................... 89 5.6 A generic processing cell of a neural n e t w o r k .................................. 91 5.7 A m ulticom puter architecture for implementing connectionist n e tw o r k s ....................................................................................................... 93 5.8 Detailed processor o rg an iz atio n ............................................................. 94 6.1 Effect of clustering on bandwidth re q u ire m e n ts..................................110 6.2 Bandwidth demands for different system sizes ..................................112 6.3 Interprocessor communication demands for linear speedup . . . . 114 6.4 Bandwidth required for different n e ts ..................................................... 115 6.5 Effect of packet size on bandw idth ( p ,- = 1/ 8, pr = 1 /1 6 ) .................117 6.6 Effect of packet size on bandw idth ( Pi = pr — 1 / 3 2 ) ........................118 6.7 Bandwidth requirements for different m ulticom puter topologies . 126 6.8 Bandwidth demands in implementing Net C on hypernets and h y p e rc u b e s ................................................................................................... 127 X List of Tables 2.1 C o m p o n e n t C o u n t o f H y p e rn e ts ............................................... 17 4.1 C o m p le x ity o f A lg o rith m s o n V a rio u s M u ltic o m p u te r T o p o lo ­ gies .............................................................................................................. 63 4.2 E x e c u tio n o f T A K (18,12,6) o n a ( 3 , 3 ) - n e t ........................... 68 5.1 S tr u c tu r a l C h a ra c te ris tic s o f V a rio u s C o n n e c tio n ist N e t­ w o rk s ............................................................................................................ 79 5.2 C o m m e rcia lly A v ailab le N e u ra l N e tw o rk S im u la to r P a c k ­ ages .......................................................................................................... 88 6.1 S p ecificatio n s o f F iv e H y p o th e tic a l N e u ra l N e ts ................ 109 6.2 V alues o f pt fo r V a rio u s N e ts ..............................................................120 6.3 A rc h ite c tu ra l R e q u ire m e n ts fo r S im u la tin g a N e u ra l N e t­ w o rk w ith 1 M illio n C e l l s .................................................................. 121 6.4 A b ility o f E x istin g P a ra lle l C o m p u te rs to S im u la te L arg e N e u ra l N e tw o rk s....................................................................................... 122 6.5 P a c k ag in g R e q u ire m e n ts fo r a C o n n e c tio n ist M a c h in e w ith 64K P r o c e s s o r s ............................................................................ 129 Abstract This dissertation investigates architectural techniques for supporting interprocessor communication in highly parallel m ulticom puter systems. A fam­ ily of hierarchical networks called hypernets is proposed as a communication- efficient architecture for parallel applications th at exhibit a mix of localized and global communication among processes. The m odular construction of hyper­ nets from VLSI building blocks is illustrated, and their architectural potentials are analyzed in term s of message routing complexity, cost-effective support for global as well as localized communication, I/O capabilities, and fault tolerance. Hypernets are seen to be superior to m any existing architectures for large system sizes. Techniques for mapping a variety of algorithms onto hypernets are pre­ sented. A novel dynamic process allocation scheme is illustrated by simulating data-dependent programs on a 256-node hypernet. The interprocessor messages generated during massively parallel processing are examined. The message p at­ terns observed are often close to the abstract characterization of interprocessor communication most suited to hypernets. Architectural issues in simulating artificial neural networks on m ulti­ com puter systems are addressed in a detailed case study of massively paral­ lel computing. First, models are presented for characterizing the structure of large neural networks and the functional behavior of network elements. A dis­ tributed processor/m emory organization based on these models is developed for efficiently simulating asynchronous, value-passing architectures. The volume of messages th at need to be exchanged among physical processors for simulating xii the weighted connections of a neural network is estim ated from the structural description and m apping policy. This determines the interprocessor communi­ cation bandw idth required, and the optim al num ber and granularity of proces­ sors needed to meet a particular cost/perform ance goal. The suitability of a hypernet-based system for simulating large neural networks is assessed in light of the estim ated architectural demands. 1 Chapter 1 Introduction T he aim s o f scientific thou gh t are to se e the general in the particular and the eternal in the transitory. - ALFRED N. WHITEHEAD, Science and the Modern World (1925) 1.1 The Advent of Massively Concurrent Systems Massively parallel processing has emerged as a promising research area with the advent of new hardware and software technologies coupled w ith the development of models of com putation th at emphasize fine-grained parallel and distributed processing. The highly concurrent m ulticom puter systems used for such processing contain a large number of computing nodes th at are intercon­ nected by a multicomputer network. Each node comprises of a processor with local memory and a network interface. The processor-memory elements in these systems operate asynchronously with distributed programs and d ata sets, and use mess age-passing or value-passing mechanisms [FH83] for internode com­ munication over the multicom puter network. They can alternatively be run in sim ple-instruction/m ultiple-data (SIMD) mode to perform massively data- parallel operations, or as multiple-SIMD machines to exploit parallelism at both control and data levels [SM86). This is in contrast with tightly coupled com- 2 puter organizations such as the Ultracom puter [GGK*83] where the processors communicate through shared memory. The principal driving forces behind the resurgence of interest in mas­ sively parallel architectures include: • Hardware advances: The availability of 32-bit single-chip microprocessors and inexpensive memory has already lead to the commercial production of medium-scale hypercube m ulticom puter systems by Intel, Ametek, NCube and Floating Point Systems [SF88,RF87]. These systems typically employ O (l02) nodes, each operating at 0(1) million instructions/sec (MIPS) to expoit medium-grain parallelism in programs. If simpler computing elements are used, then it is possible to fabricate an array of nodes on monolithic chips using current VLSI and packaging technology. One can now conceive of economically using thousands of such processor/m emory chips and their supporting hardware to construct an extremely powerful m ulticom puter system contained in a few cabinets. • Parallel software supports: Recent advances in software support for con­ current processing include the development of special language constructs for concurrent programming [Hoa78], message communication protocols [LGFR82], strategies for cooperative com putation [Smi80,FH87] and mech­ anisms for ensuring the security and integrity of data and program s [Kle85]. Language paradigms such as object-oriented programming provide message- passing schemes for cooperation, communication and control among con­ current processes [Lan82,Dal86]. • Application demands: Besides the ever increasing demands of numeric su­ percom puting [Hwa87], massively parallel systems are deemed indispens­ able for a num ber of artificial intelligence (AI) applications [WL86,HGC87]. In particular, they can be employed to alleviate the von Neumann proces­ sor/m em ory bottleneck to intense and irregular memory access patterns characteristic of many symbolic processing applications, and to make large 3 problems tractable [HG87b]. Promising com putational paradigm s such as connectionism [FB82] and data-parallelism [HS86] are intrinsically based on distributed representations and are amenable to massively-parallel pro­ cessing. 1.2 Design Considerations 1.2.1 In terp ro cesso r C om m u n ication C h a ra cteristics A key issue in the design of large-scale m ulticom puters is their ability to provide adequate support for interprocessor communication. The significance of the m ulticom puter interconnection network increases with the system size. Indeed, for very large systems, interconnection networks can account for over 30% of the total cost (in term s of time and money) of design and development [Pfi85], and occupy over 50% of the system volume. In this dissertation, we study networks th at are limited in extent to a few cabinets rather than being geographically distributed. The communica­ tion delays in such systems should not exceed a few microseconds. In contrast, delays in milliseconds may be tolerated in a loosely-coupled distributed system such as a network of workstations. The stringent conditions on network perfor­ mance necessitates hardware support for message formation, routing, buffering and flow control. Most commercial m ulticom puters include a communications coprocessor or controller in every node, th at is capable of routing messages in­ dependent of the com putation processor [RF87]. Thus improved performance is obtained at the expense of increased hardw are complexity. The selection of a suitable trade-off point becomes a critical design decision. The nature and volume of interprocessor communication depends on the grain of com putation, defined as: D e fin itio n 1.1 The g ra in o f c o m p u ta tio n denotes the size of unit tasks, and is measured by the amount of computation between process interactions. 4 A process may comprise of several unit tasks. A unit task is executed on a single computer node, and can be completed between two consecutive process interactions. Massively parallel processing emphasizes fine grained computa­ tion, where a unit task typically involves fewer than a hundred machine cycles, and processes communicate very frequently using messages th a t are only a few bytes long. Medium grained parallelism entails larger unit tasks such as sub­ routines, less frequent communication among processors, and larger message sizes. Numeric processing is largely performed on regular d ata structures such as vectors and arrays. It often involves simple, repetitive patterns of com­ munication among interacting processes. A host of researchers have investi­ gated varied techniques for reducing communications among nodes, improving processing speed, and for providing suitable interconnection networks for nu­ meric applications. Techniques such as pipelining, vectorization, m ultitasking and systolization have been developed over the years [HB84]. Symbolic processing for AI applications differs from numeric process­ ing in several fundam ental ways [HG87b]. Consequently they have different architectural requirements. Symbolic com putation is often parallel and non- deterministic [Coh79], which makes good dynamic allocation, load-balancing and garbage collection policies imperative [Coh81]. It often involves changes in granularity and/or data form ats for different applications. More im portant, it can be highly irregular in term s of memory access patterns as well as in the volume and destination of messages in the system [M0I86]. For large systems, interprocessor communication can be the critical bottleneck when parallel and distributed processing is practised for such applications [Wit81]. Criteria for performance evaluation of interconnection networks have been mostly based on the assum ption of a globally uniform distribution of mes­ sage traffic, where each processor communicates with every other w ith equal probability [AJP86]. This model may be modified for special circumstances such as “hot-spot” conditions in shared-memory multiprocessors [PN85], The globally uniform model yields misleading results in distributed symbolic pro- 5 cessing environments requiring both local and global communications. This motivates us to consider alternative models which can approxim ate the antici­ pated interprocessor communication patterns more closely and still be amenable to analytical examination [RS83]. Some studies of message-based parallel processing on medium-scale m ulticom puter systems have been reported recently. Of particular note is a book by Reed and Fujimoto which examines meduim-grain parallelism on hy­ percube systems [RF87]. However, to our knowledge, little investigation of interprocessor communication characteristics in highly parallel and distributed processing environments has been carried out, even though this is a crucial issue for large-scale multicomputers with over a thousand nodes. This dissertation attem pts to fill this void by exploring both hardware provisions and software so­ lutions to the above problem. We are specially interested in developing a theory of large scale architectures th at can provide efficient communication support for highly concurrent symbolic processing and connectionist com putations. 1.2.2 T ow ard C om m u n ication -E fficien t A rc h itectu re s D e fin itio n 1.2 A c o m m u n icatio n -efficien t multicomputer architecture is one that provides adequate interprocessor communication bandwidth using a mini­ m um of hardware resources such as communication links and interface hardware, message buffers, and processor cycles. Communication-efficiency entails a balance among processor speed, memory bandw idth system input/output capabilities and interprocessor communication bandw idth, such th at none of these factors become responsible for a bottleneck in system performance. It simultaneously aims at the maximum utilization of the interconnection network to yield a cost-effective solution. The design of communication-efficient architectures uses the theory of interconnection networks [WF84] to cater to the anticipated communication requirements. These requirements depend not only on the proposed application domain and grain of com putation, but also on implementational issues such 6 as memory allocation, process scheduling, load balancing, choice of computing model, m apping of algorithms, and techniques for synchronization and deadlock avoidance [RS83]. The evolution of a communication-efficient architecture is influenced by all the factors mentioned above as well as by technological and commercial considerations. The most crucial factor however, is the choice of an appropri­ ate interconnection network. Networks with small diameters and latency are favored. VLSI emphasizes regular interconnections with short wires and few crossovers [U1184]. Technological and cost considerations limit fanout and link complexity of interconnection networks. Issues of message routing and control, reconfigurability, and fault-tolerance are also encompassed [Fen81,RV88]. Several efforts at designing large m ulticom puter networks have been reported by researchers [WL81,Wit8l]. Some of the schemes advocate bus- based architectures or reconfigurable m ultistage networks. They attem pt to link up all the processor/m emory nodes as a pseudo-complete graph. Rings, busses and m ultistage interconnection networks are viable for small or medium­ sized systems with hundreds of nodes. A point-to-point topology offers direct connections among the computer modules, simpler protocols and more efficient support for local communications. If internode communication is largely local, then the connectivity provided by a pseudo-complete graph is superfluous. So for very large systems w ith thousands of nodes and localized communication patterns, direct-interconnect architectures are more attractive [Uhr87]. 1.2.3 T op -D ow n and B o tto m -U p A p p roach es The design of AI machines can be approached using a top-down, middle- out or bottom -up methodology [Wah87]. The top-down methodology begins with requirements specification for the target applications, and gradually works down to finer details about knowledge representation and execution model till the hardware design level is reached. In our case, a top-down approach first deals with design issues at the knowledge representation and control levels for the pro- 7 jected symbolic processing environment. The communication requirements for the chosen representation and control mechanisms are then determined. Finally a suitable machine-level architecture is specified for implementing the execution model. The bottom -up methodology bases the hardw are design prim arily on technological and cost considerations. A basic com putational model such as dataflow, control-flow or reduction is selected [TBH82]. Architectural consider­ ations such as instruction set design, memory organization and interconnection network theory have a significant impact on the hardw are design. The software is then built on top of the hardware to support the application. The middle-out methodology is a short-cut to the top-down design methodology. It starts with a knowledge-representation scheme such as a se­ m antic network or a well-established language such as Lisp. Working upwards, this language or representation scheme is modified to adapt to the application requirements. In the downward direction, hardware supports such as a tailored instruction set for the chosen language, are designed. O ur interpretation of the bottom -up approach for the design and anal­ ysis of communication-efficient architectures is shown in Fig. 1.1. The initial synthesis of a m ulticom puter network is primarily determ ined by architectural and technological considerations. Applications are represented only by abstract communication models at this stage. The behavioral model represents an appli­ cation problem as a num ber of logical processes and specifies the dependencies among them . The translation from this interm ediate form to the virtual net­ work involves choosing the grain of com putation. These virtual networks are task precedence graphs whose nodes are unit tasks, and links between nodes represent d ata or control dependencies. Network analysis for evaluating the synthesized architecture considers the mapping of virtual networks onto it. Net­ work evaluation is used to iteratively improve the architectural design. 8 C Applications ^ --------- Indirect Study Direct Study Behavioral model [intermediate forn (executed by) Communication virtual network Models (mapping) physical network, Network Analysis Network Synthesis Architectural considerations VLSI Technology Figure 1.1: A bottom -up approach to designing communication-efficient archi­ tectures 9 1.3 Overview of the Thesis This dissertation is divided into two distinct parts: • In Chapters 2, 3 and 4, we approach communication-efficient architectures using a bottom -up design methodology. • Chapters 5 and 6 reflect a top-down approach to the same problem using connectionist processing such as artificial neural network sim ulation as a case study. The them e th a t underlies and unifies the two complementary approaches is the exposition of a new family of m ulticom puter networks called hypernets. Hypernets are developed through the bottom -up approach, and are later shown to be suitable for efficiently supporting the simulation of large neural networks. In C hapter 2, we introduce hypernet architectures th at are essentially based on a point-to-point topology. These architectures can be enhanced or fine-tuned by incorporating global connections an d /o r local busses. The design reflects considerations of the proposed application domains as well as th at of the underlying VLSI/W SI technology. A construction methodology and a dis­ tributed message routing scheme is given, and basic architectural issues such as I/O capabilities and fault-tolerance are examined. In C hapter 3, we compare hypernets with existing direct-interconnect architectures such as hypercubes, meshes and tree-based systems, using appro­ priate communication models. The suitability of these architectures for different com puter sizes, localized communication and technological constraints are as­ sessed. Evaluation criteria include network diam eters, link complexity, average path lengths, network congestion and average message delays for various com­ m unication patterns. A variety of algorithms are m apped onto hypernets in C hapter 4. Though distributed processing implies m ultiple-instruction, m ultiple-data type of operations, we also consider some algorithms which are executed in S3MD I or multiple-SIMD mode. Goals th at are aimed for during the m apping process 1 0 include minimizing interprocessor communication, localizing communication by limiting interaction with distant processors, m axim um parallelism, balanced resource utilization, and ease of inputing d ata and accessing results. An in-depth case study is carried out in C hapters 5 and 6 using the top- down methodology outlined in Fig. 1.2. Here, we focus on connectionist models for knowledge representation and inferencing. These models are based on an abstraction of the information processing properties of neurons in the biological world [HT86] and are amenable to massively parallel and distributed processing [RM86]. We first characterize the processing and comm unication requirements for executing these models. Major classes of parallel architecture are assessed regarding their suitability for implementing these models. A processor/m em ory organization for sim ulating large neural networks is presented and a policy for efficiently m apping connectionist networks onto m ulticom puters is given. Though m ost researchers are not occupying themselves w ith hardware realizations of cognitive computers at present [HMT86], some proposals such as the use of reconfigurable hashnets, PLA-like structures for dynamic inter­ connections [Fel82], and bus-based neurocomputers [TRW86] have been made recently. We examine how direct-interconnect architectures compare w ith the above approaches in supporting the communication demands. In particular, we explore ways of efficiently implementing connectionist models on hypernets, and determine the hardware requirements. Im plem entation of neurocom puters using optical or electro-optical technology is also considered [Hec86,FP85]. We hope to provide the foundations of a design methodology for communication-efficient cognitive computers. 1 1 Applications Connectionist Models lommunication Requirement i Analysis I (executed by) Artificial Neural Networks M APPING / Communication-efficient Connectionist Architectun Figure 1.2: A top down case study of connectionist architectures 1 2 Chapter 2 Principles of Hypernet Architectures W hat shall / call m y dear little dorm ouse? His eyes are sm all, b u t his tail is e-rtor-mouse. — A. A. Milne, The Christening (1950) 2.1 Synthesis of Hypernets In this section, we present a general methodology for synthesizing var­ ious hypernets. These ideas are elaborated through case studies of hypernets built out of identical cubelets, treelets and bus lets. 2.1.1 T h e B u ild in g B lo ck s Hypernets are constructed incrementally by methodically putting to­ gether a num ber of basic modules. Each module consists of a set of inter­ connected nodes. A node is an abstraction for a processor element w ith local memory and a communication switch interface [RS83], as shown in Fig. 2.1. A node has m ultiple ports which connect to other nodes through dedicated bidirec­ tional communication links or via shared busses. If a link is used for connecting to another node within the same module, it is an internal link, otherwise it is 13 r L . Node Components \ Node Symbol (M: Memory; P: Processor; S: Switch) Figure 2.1: A computer node called an external link. W hen a basic module is incorporated as p art of a bigger network, the external links are used for direct connections to other modules, or for interfacing with the external world. Figure 2.2 presents three types of basic modules. A cubelet is formed by augmenting a hypercube w ith one extra “external” link per node. Figure 2.2.b shows a 3-dimensional cubelet. Each cubelet has a special node, called an I/O node, th at is used for direct com­ m unication with the outside world. Its external link is used as an I/O channel (darkened line) and not for connecting to other nodes. The rem aining nodes are processing nodes, whose external links are used for a direct linkage w ith other cubelet modules, or for interfacing w ith the external world. In the sequel, a d- dimensional cubelet is simply called a d-cubelet. For example, of the 8 external links in a 3-cubelet, one serves as a dedicated I/O channel for the I/O node while the others are attached to the 7 processing nodes. Each node also has 3 internal links which form the 3-cube connections. Among other choices for the basic module are treelets, which augment complete binary trees with extra links for each node (Fig. 2.2.b), and buslets consisting of a set of dual-ported nodes sharing a common bus (Fig. 2.2.c). The basic modules are targeted to be implemented w ith VLSI/W SI technologies. If each node has a fine-grained, reduced instruction set computer 14 Internal Link *» External Link O Processing Node O I/O Node — ^ I/O Channel Jill Bus w (d) A buslet Figure 2.2: Various building blocks for constructing hypernets (RISC) architecture with some symbolic m anipulation capabilities [Hil85], then a module comprising of such processing nodes can be implemented on a single chip. A medium-sized hypernet built out of these modules can be fabricated on a few PC boards. Higher dimensional cubelets may be possibly constructed as VLSI modules and large hypernets may be constructed with WSI technology in the future. 2.1.2 T h e C o n stru ctio n M eth o d o lo g y A hypernet is characterized by a quadruple (JB, d, h, G). B represents the set of basic modules used for constructing the network. Each basic m odule in B has 2d external links. The num ber of levels in the hierarchical construction is given by h, while G measures the global connectivity of the network. A particular choice for B and G yields a family of hierarchical networks. An instance of a family formed w ith identical basic modules is specified by its (d,h) (b) A 3-cubelet (c) A treelet 15 value, and will be called a (d,h)-net. A smaller hypernet w ith i < h levels, that forms a substructure of a (d,h)-net, is called a (d,i)-subnet. A (d,l)-subnet is nothing but a basic module with 2d external links. A hypernet is constructed “bottom -up” as follows: A (d,2)-net, i.e. a hypernet with 2 levels of hierarchy is formed by interconnecting a set of (d,l)- subnets, such that: (i) There are G direct links between any two basic modules. These links are called level-1 links or Ajs. (ii) In each basic module, G external links are dedicated as I/O chan­ nels and 2d~1 — G links are used as AiS to realize the direct connections specified by rule (i). From (i) and (ii), it follows th at 2d~1/G basic modules are required to construct a (d,2)-net. If each module is considered as a vertex of a graph, then a (d,2)-net is a complete graph on 2d~l / G of such vertices. Furtherm ore, exactly half of the external links in each basic module are still available for higher level interconnections or as extra I/O links. The actual choice of the node within a subnet for connecting to another subnet depends on the nature of the basic modules used. A distributed routing mechanism is desired in which the desti­ nation address suffices to determine the neighboring node to which a message should be forwarded to. This is facilitated if the addresses of neighboring nodes by bit-permute-com plem ent (BPC) [NS82] perm utations. In Section 2.1.3, we give the selection rules th at are used when cubelets serve as the building blocks, and illustrate the construction methodology in detail. A (d,h)-net is formed by interconnecting (d,/i^l)-subnets such that there are G direct links between any two (d, A-1)-subnets. These links are called level-(h-l) links or Xh-is. In the process, each of these subnets acquires G dedicated I/O links to be shared by all the nodes in th a t subnet. Further­ more, exactly half of its unallocated external links rem ain available for further connections. Let N i be the num ber of nodes in a basic module. A (d,h)~net has the following numbers of components when G = 1 :_____________________________ N = # (Nodes) M = #((d,h — I ) — subnets) C = # (basic modules) I = # ( / / 0 Nodes) P = # ( Processing Nodes) 16 _ jyl22' h ~l(d-2)+k+l-d = 22k-2(<i-2)+1 N 22 fc_1(d-2)+A +1-d (2.1.1) No t=h * ' = f c . ( f l)(2f c -2 * ')+ fc-» * = 1 = N - I In particular, if JVj = 2d and G = 1, we get: N = 22ft_1(d- 2)+h+1 (2.1.2) Equation 2.1.2 is obtained by solving the recurrence relations: rih = 2rih-i — (h — 1) ; ni = d where n .,- = log(# of nodes in a (d,t)-net). Table 2.1 summarizes the component count for this case. The size of a hypernet increases exponentially with h. For example, a (4,2)-net has 128 nodes, a (4,3)-net has 4096 nodes and a (4,4)-net has over 2 million nodes, as shown in Table 2.1. Interm ediate network sizes can be obtained by increasing G, or by using partial hypernets, as elaborated in Section 2.4. A layout of a hypernet aims to keep all modules belonging to the same subnet as physically close as possible. This goal is autom atically m et in the process of bottom -up construction of a hypernet. A hypernet is constructed level by level by adding basic modules until the hierarchical level is high enough to yield a network of the desired size. The construction methodology is m odular. Links once formed need not be altered as the network grows in size. The hypernet architecture can be applied to build m ulticom puters w ith sizes ranging from hundreds to millions of nodes. 2.1.3 C u b elet-b a sed H y p e r n e ts In this section, we study hypernets m ade out of identical cubelets as representative examples of the broad class of hypernet architectures. The con- 17 Table 2.1: C o m p o n e n t C o u n t o f H y p e rn e ts d h N M C P I L nodes subnets cubelets PNs INs ext. links 2 2 8 2 2 6 2 4 2 3 16 2 4 10 6 4 3 2 32 4 4 28 4 16 3 3 256 8 32 216 40 64 3 4 8192 32 1024 6880 1312 1024 4 2 128 8 8 120 8 64 4 3 4096 32 256 1760 288 1024 4 4 22 1 512 21 7 1949184 147968 21 8 4 5 23 8 131072 23 4 0(15 x 234) 0 (234) 23 4 5 2 512 16 16 496 16 256 5 3 65536 128 2048 63360 2176 16384 5 4 22 9 8192 224 0(31 x 224) 0 (224) 22 6 5 5 25 4 22 S 24 9 0(31 x 249) 0 (249) 25° 5 6 2 1 03 249 298 0(31 x 298) 0 (298) 29 S Note: PN = Processing node and IN = 1 /0 node. struction process of cube-structured hypernets follows the general methodology given in Section 2.1. The construction process is best illustrated by the (3,3)-net shown in Fig. 2.3.a. This hypernet has 3 levels of hierarchy. The (3,3)-net is constructed by interconnecting eight (3,2)-subnets. Each (3,2)-subnet is in turn constructed from four 3-cubelets. A (3,3)-net thus uses a total of 32 cubelets to yield a m ulticom puter network w ith 256 nodes. Note th a t some external links have not been used for interconnecting nodes within the (3,3)-net. These links are available for making connections with other (3,3)-nets to form a (3,4)-net, or as extra I/O links. Since each cubelet has 2d nodes, the total num ber of nodes in a (d,ft)-net built out of such cubelets is given by Eq. 2.1.2. In the example hypernet (Fig. 2.3.a.), we have 7=40 and 7> =216. 64 links are available for further connections. As mentioned earlier, each cf-cubelet has a dedicated I/O node. In addition, each smaller (d,t)-net f°r 1 < * < ft, which forms a subnet of the (d, ft)-net, has a special node w ith a spare external link_that_can be used as a dedicated I/O node for th at subnet._______________ (a) Topology of a (3,3)-net consisting of eight (3,2)-subnets. b7 b6 b5* b4 b3ib2 b l bO i * b4 b3 b2lb7 b6<b5 b l bO Source Address Target Address (3,2)-net icubelet address Jaddress i i i (c) Address Mapping address within buslet OOxxx cubelet lOxxx > cubelet Internal Link i < yooi $ 111 d\0 0 1 External Links in /TOO A v o o o j .... A Olxxx y .....- j V e J c cubelet ^01 in 501 p \ 111 / Aoo / n o r Am /no Q 0 1 ;on Aoo m o }01 1 in / Aoo K o A oo r A io K 101 Aoo U o llx x x cubelet 111 (b) (3,2)-subnet with 110 prefix Figure 2.3: Construction of a (3,3)-net with eight (3,2)-nets 00 19 Each node in a {d,h)~net is specified by a binary num ber B = bn- i , bn- 2, - J bi, bo, where n = log2N . The m ost significant m = log2M bits of B identify the {d,h-l)-subnet th at this node belongs to. The least significant d bits are used to distinguish among nodes w ithin the same cubelet. The middle n-m-d bits are used to identify the cubelet within a [d,h-l)-subnet, as illustrated in Fig. 2.3.b. Note th a t 2m + h — 1 = n as can be inferred from Eq. 2.1.2. The following rules are applied recursively to obtain a (d,h)-net from M (d,/i-J)-subnets: Qualification Rule: The external link of a node B qualifies as a link between two different (d,/i-l)-subnets if and only if the least significant (/i-l) bits of B are identical to the binary sequence 011..1; th a t is 6< = 1 for i = 0, 1, ..h — 3, and bh- 2 = 0. Connection Rule: Let D be the node whose address is obtained by swapping the most significant m bits of B with the next significant m bits in B , i.e. D = bn- m- i, ..bn- 2m.,bn- i , -bn-m ,0,1, ..1. A qualified link stemm ing from a node B is connected to D provided B ^ D (Fig. 2.3.c). Exception Rule: If B = D above, then B serves as a special I/O node, and its qualified link is considered as an I/O channel of the (d,/i-l)-subnet. From the qualification and connection rules we see th a t if 6, is the least significant “zero” bit in the address of a node in a (d,h)-net, then its external link connects directly to a node belonging to a different (d,t)-subnet if (t < h), and is unallocated if (h < = *). Furtherm ore, exactly half of the available links qualify at each level. As an example, consider the node B=10111001 in a (3,3)-net (Fig. 2.3.a). This node belongs to the (3,2)-subnet identified by the first 3 bits 101 in B . This node satisfies the qualification rule since the last h — 1 = 2 bits are 01. We connect this node to the node identified by D = 11010101, belonging to the (3,2)-subnet identified by 110. The above procedure is applied recursively to obtain a complete speci­ fication of the connectivity in the hypernet. We note th a t in each (d,h-l)-subnet, there is exactly one node for which the m ost significant m bits are identical to the.next_m_bits._This.particular nodeJs.chosen^as the special.I/O node.forthat 20 subnet in accordance w ith the exception rule. For the (3,2)-subnet with address 110, this node has an address of 11001 within th a t subnet (see Fig. 2.3.b), since m = 3. If the subnet is a cubelet (i.e. h = 2), then this special node is nothing b ut the I/O node of th at cubelet. For these nodes, & o = 0 and b^-i, ..i>i are identical to the next significant d bits. Exactly 2d_t — 1 of the 2d external links of a cubelet are used as A ts for connecting to different (d,t)-subnets in order to form a (d,i+ l)-net, or as I/O nodes. Thus the connectivity among nodes in a hypernet decreases in higher dimensions, unlike in a hypercube where the connectivity is constant for all dimensions. From a macro-level view, a hypernet looks like a completely con­ nected graph as in Fig. 2.3. The increasing sparseness in connectivity is apparent only when one zooms into each subnet to view its internal structure. 2.1.4 H y p e r n e ts b u ilt w ith T reelets an d B u sle ts Figure 2.4 show the construction of a (3,3)-net from identical treelets such as the one shown in Fig. 2.2.c. Figure 2.5 illustrates how identical buslets such as the one in Fig 2.2.d are linked to form a (3,2)-net. The topology of a (3,3)-net constructed from treelets or buslets look the same as th at of a cubelet- based hypernet at a macro level. The difference is revealed only upon examining the internal structure of each basic module. A treelet-based (d,h)~net is built from treelets with depth d. Such a treelet has 2d — 1 nodes. Two external links are allocated to the root while all other nodes have one external link each to yield a total of 2d external links/treelet. Hypernets built from treelets have a degree of 4 for all d and h, and are candidates for synthesis using Transputer chips [Whi85]. For an inorder labeling of the tree nodes, the root is num bered 2d~x — 1. A virtual node num bered 2d — 1 is also m apped onto the root and allocated one of its two external links. The root thus acts as a pair of logical nodes. An interesting consequence of the inorder labeling of the nodes is th at for nodes .at_depth_d_(i.e._the_leaf_nodes) ,.6q = 0; for nodes at depth d-1,_bibn = 01; for lOxxx tre e le t O O x x x I treelet (a) Topology o f a (3 ,3 )-n et I h 1 1 1 XXX | treelet ( b) (3,2)-subnet w ith 110 prefix b7 b6 b5 b4 b3 b2 bl bo b4 b3 b2 b7 b6 b5 bl bo T a r g e t A d d r e s s (3,2)-n e t tre e le t address address address w ithin tr e e le t (c) Address Mapping Figure 2.4: Construction of (3,3)-net with 32 treelets Olxxx treelet to lOxxx buslet 22 b c i j OOxxx buslet bus bus llxxx o buslet h ra a n Figure 2.5: A (3,2)-net built using buslets 23 nodes at depth d-2, 626160 = 011 and so on. If the qualification and connection rules given in Section 2.3 are followed, then all Ajs are a t the leaf nodes; A 2S are associated w ith nodes at depth d-1, etc. Thus a A ,- link is equally accessible to 2*-1 leaf nodes within a treelet. In hypernets constructed with buslets, each node has only two links: an internal link connected to the shared bus inside the basic module, and an external link th at connects to a different buslet. This simplifies broadcasting of messages considerably. A broadcast message received via the external link of a node is immediately sent over its only internal link and vice versa. Similarly, a node receiving a message not m eant for it ju st transfers th a t message through the other link. If the bandw idth is adequate, the internal bus provides complete con­ nectivity for all the nodes within the basic module at the expense of a more involved comm unication protocol. The bus can additionally be used to access a locally shared memory, I/O controllers and/or secondary storage units. Thus bus-structured hypernets provide more flexibility and greater coupling between nodes within the basic module, and are most favored for extremely localized interprocessor communication and partial broadcasts. The connectivity within a building block is the sparsest for treelets. Connectivity among different build­ ing blocks is the same for all three families. Particular comm unication traffic patterns among the nodes of a hypernet may be tailored for one specific family. However, cubelet-based hypernets are expected to exhibit characteristics m id­ way between the other two families for general comm unication patterns. Keep­ ing this observation in m ind, we will restrict our study of hypernets to those built with cubelets in all subsequent sections, unless m entioned otherwise. 2.2 Message Routing in Hypernets The nodes in a hypernet communicate with one another by exchang­ ing messages among themselves. These messages m ight be complex enough to need. packers witching mechanism s, or they m ight be simple ones using marker 24 or value-passing mechanisms [FH83]. Modules w ith fine-grain nodes for pro­ cessing of semantic networks or low-level vision problems can use value-passing mechanisms, while applications, such as distributed problem solving [GBH87a] and high-level symbolic processing of images [MT85], th at require m edium or coarse-grain parallelism will need more complex message passing protocols. The nature of interprocessor message traffic depends not only on the network topology, allocation and scheduling policies and the types of program s being executed, but also on the message routing strategy employed. This strat­ egy determines the sequence of links traversed if a message is sent from a source node S to a given target node T. A good routing scheme should select the short­ est path from S to T, and also find alternative paths if a chosen p ath is not viable due to congestion or link/node failures. Furtherm ore, the source and destination addresses should be adequate to determine the routing path. We first present a distributed routing scheme for the hypernet, and then examine schemes for broadcasting messages. The effectiveness of the hypernet topology in providing efficient support for various interprocessor communication patterns is analyzed in C hapter 3. 2.2.1 A D istr ib u te d R o u tin g S tra te g y The distributed routing scheme for the hypernet yields a preferred p ath between any two nodes S and T. This scheme can be easily modified to provide alternate paths in the face of congestion or network faults. D e fin itio n 2.1 Two nodes A and B are i-eq u iv ale n t, denoted A =,• B , if they belong to the same (d,i)-su6net. A B if either i > h or A and B belong to two distinct (d,i) -subnets. D e fin itio n 2 .2 Node B is a n e ig h b o r of node A, denoted A — * B , if there is a direct link between A and B. From the bidirectional nature of the links, this also implies B — > A ._______________________________________________________ 25 The routing scheme follows the hierarchical nature of the network. F irst, the hierarchy level t is determined such th a t S =,- T and S % T . Since the (d,»-i)-subnets th at form a (d,t)-net are connected as a complete graph, we can find nodes C and E such th at S =,_x C , T = ,_ i E and C — * ■ E . Node S sends the message to node C which directly transm its it to node E. The message is then routed from E to its final destination T. Thus the problem of finding a p ath between two nodes w ithin a (d,i)-net is reduced to finding paths between nodes within a (d,i-J)-net. The decomposition is carried out recursively until all source-destination pairs are neighbors. Routing w ithin a basic m odule is determined by its structure. Note th at this scheme ensures th a t if S =,• T , only links at the ith level or lower are used for routing. We now illustrate this routing scheme for hypernets built out of cubelets. First, the m ost significant bit position where the addresses of S and T differ is used to determine the hierarchical level t such th a t S and T belong to the same (d,»)-net but different (d,t-l)-subnets. This happens if this bit position is be­ tween rii — 1 and ra,_i, where 2ni is the num ber of nodes in a (d,t)-subnet for * = 2 ,3 ,.., h. For simplicity, we denote n* as n. Since the (d,i-l)-subnets th at form a (d,i)-net are connected as a complete graph, we can find a node C which belongs to the {d,i-l)-subnet containing S and whose external link connects to a node E belonging to the (d,*-J)-subnet containing T. From the connection and qualification rules, we see th at the address of C = c„_i,..Ci,Co satisfies cn,_!-1 = *«,— 15.....; c*-i = *»,•_! and ct-_2, ~co = 0 ,1 ,1 , ..1. The recursive decom­ position of the routing p ath is carried out until all source-destination pairs are neighbors, following the general methodology. Routing within a d-cubelet is the same as th at for a d-cube and takes an average of d/2 steps. The routing strategy shows two im portant properties: Property 1: The path taken by a message traveling between any two nodes of a (d,ti)-net is confined within th at net, and traverses at most one link connecting two (d,t-i)-subnets. Property 2: If 5 and T belong to the same (d,i)-net, then the bits at positions rii — 1 through n — 1. rem ain unchanged along the routing path. They 26 can therefore be discarded. The destination address shrinks monotonically as the routing progresses. If message traffic is largely localized, the average size of the destination address will be much less than n bits long. The path taken by a message at each step can be determ ined by looking up a hierarchical routing table. We note th a t the address of C is the same for all nodes in the (d,i-l)-subnet containing S th a t axe sending a message to any node w ithin the -subnet containing T. Thus a hierarchical organization whose levels correspond to the levels of the hypernet results in substantial savings in table size [Kam76], 2 .2 .2 B ro a d ca sts an d P a r tia l B ro a d ca sts D e fin itio n 2.3 A message is c o m p le te ly b ro a d c a s t over a multicomputer network if every node receives exactly one copy of that message. A complete broadcasting scheme can completely broadcast messages originating from any node. The paths taken by copies of a broadcast message form a spanning broadcast tree w ith the source node as the root. D e fin itio n 2.4 A message is p a rtia lly b ro a d c a s t over a multicomputer net­ work if it is completely broadcast within a specified subnetwork, but is not sent to any node outside this subnet. Broadcasting Method: Complete broadcasting can be achieved in a (d,h)-net using a vector P of h — 1 bits, Ph-i,--Pi, attached to each message header. First we choose an appropriate m ethod for broadcasting within a basic module. For cube-structured hypernets, the hypercube broadcasting schemes given in [HJ86] are suitable. This m ethod is used for replicating and forwarding of messages over internal links. The vector P determines w hether a message should also be forwarded over an external link. At the originating node, P = 0. A node receiving the message forwards it over a A ,- link if and only if p, = 0. In this case, the node sets p.- = 1 and resets p ,-i, ..pi = 0, -.0.____________________ 27 A, S C D <rx A Figure 2.6: L e m m a 2 .1 Let A and B be two arbitrary nodes of a (d,h)-ne£. Then A =,• B and A B /o r some 1 < t < h. A ny broadcast path from A to B traverses only one A ,- link. Furthermore, this A ,- is the unique link L that directly connects the two (d,i-l)-su&nets to which A and B belong. P ro o f: The existence and uniqueness of L follows from the construction m ethod­ ology for G = 1. Suppose there are two A,’s on some path from A to B, as shown in Fig. 2.6. When the message is forwarded by node C to D, pi is set to 1. Since no link higher than a A ,- is traversed from D to E, p, is still 1 when the message is received by E. Therefore it will not be forwarded to F if EF is also a A,. Any path from A to B th a t does not traverse link L m ust visit some node a (d,i-l)-net not containing A or B. Such a path will therefore traverse more th an one A ,-, which is a contradiction. Therefore a valid broadcast path from A to B m ust traverse L. T h e o re m 2 .1 The broadcasting method given above yields a complete broad­ casting scheme. P ro o f: (a) Reachability: Let 5 be the node th at originates the message. First we prove by induction th a t a copy of the message is received by every node. Since we are using a standard broadcasting scheme for internal links, all nodes within the (d,i)-subnet containing S receive the message. Assume th a t all nodes within the (d,t)-subnet containing S receive the message. Since no Ay’s are traversed in reaching these nodes for * < j < h , py = 0 . So the message is sent over all Ai links em anating from this (d,*)-subnet. From the construction methodology, the (d,t)-subnets of the (d.t-fl)-net containing S are connected as a complete B Proof of Lemma 2.1 28 graph. Thus some node in each of these subnets will will receive the message, and in the process become the originating node for broadcasting w ithin th at subnet. Therefore, all nodes of the (d,i+ l)-net containing S will receive a copy of the message. (b) Unique Paths: To show th at no message is received more than once by any node, it suffices to show th a t the p ath traversed from S to another node is unique. Applying Lemma 2.1 to all pairs of nodes on a p ath such th a t these nodes are not within the same (d,J)-subnet, we see th a t all external links traversed along the path are uniquely determ ined. Thus, if any m ultiple paths are traversed, then these paths m ust differ only w ithin the same (d,l)-subnet while traversing internal links. This is not possible if a complete broadcasting scheme is used within each basic module. Thus the broadcast path from S to any other node is unique. The broadcasting scheme can be easily extended to cater to partial broadcasts w ithin a subnet at any level, as indicated by the following corollary: C o ro lla ry 2.2.1 If pn-i, -Pi = 1 ,-1 = 0, ..0 at the originating node S, then a message is partially broadcast within the (d,i)-net containing S if the broadcast method given above is followed. P ro o f: Since originally, ph-i,..pi = 1 ,-1 , a link higher than A,_i will never be traversed. Therefore, the message will be fully contained w ithin the (d,*)- net containing S. From Theorem 2.1, we are guaranteed a complete broadcast within this subnet. 2.3 I/O and Fault Tolerance Issues A distinct feature of hypernets is th a t they provide numerous I/O nodes w ith dedicated channels for direct access to the external world, as can be observed from Table 2.1 and Eq. 2.1.2. These nodes can link w ith secondary storage devices, a host com puter or other m ulticom puter systems. Not only does every cubelet have a dedicated I/O node, b u t there is an additional node 29 w ith a dedicated I/O channel for each (d,t)-subnet (2 < * < h) of a (d,/i)-net. This is in contrast w ith m ost other networks which do not explicitly reserve some links for I/O . The hypernet thus provides a high I/O bandw idth. As th e network size increases, the total num ber of I/O channels may grow towards unm anageable proportions. This problem can be circumvented by combining several such channels in a m ethodical way. For example, we can use concentrator switches to concentrate relatively few messages on m any I/O channels onto fewer channels [MGN79]. This process is repeatedly applied till the num ber of direct channels to the external world is manageable. Time division m ultiplexing can also be used. Alternatively, the I/O nodes of a hypernet can form the leaves of a set of fan-in/out trees. The roots of these trees will be accessed directly by the outside world. These trees would be sim ilar to “fat-trees” [Lei85] in the sense th a t the capacity of the communication channel between a parent and child increases towards the root. However, the required capacity increase is not as much as th a t proposed by Leiserson. This is because all nodes of a m ulticom puter system are not expected to demand I/O operations at the same tim e if the system is executing a num ber of independent tasks in a distributed fashion. The process of grouping together the I/O channels and external links should follow the hierarchical structure of the network. Thus all links going out of a cubelet can be grouped into two sets: one linking to the external world and the other to nodes at higher levels of the hierarchy. For purposes of arbitration, fault diagnosis/recovery and the regulation and routing of messages, a controller m ight be necessary for every cubelet of an actual system as well as for larger subnets. The I/O nodes of a hypernet are a natural choice for doubling as controllers. If buslets are used as building blocks, the controller can be an extra node attached to the internal bus. Since a controller for a (d,»)-subnet is shared by all the nodes in th at subnet, the overhead incurred per node will not be prohibitive. The controllers can carry a condensed image of the rest of the system. M ajor changes in the location of data/processes/system resources need not be transm itted to every node in the system th a t is affected by such changes. 30 Instead, they are conveyed to controllers at appropriate levels. T he controllers use this extra knowledge to reroute messages and requests, reschedule jobs and so on. In the hypercube, any faulty link can be sidestepped by two extra hops [AG81]. The fault tolerance for internal link failures in a d-cubelet is the same as th a t of a d-cube if we do not use external links in the rerouting. W hat is the tolerance towards the failure of a A,? We first observe th a t {d,h-J)-subnets are connected as a complete graph to form a (d,/i)-net. Thus a (d,/i-l)-subnet can be isolated from the rest of the system only if all its Xh-is are faulty. If only some of these links are faulty, then a node within a (d,A-l)-subnet which has lost its direct link to the (d, h-1)-subnet containing the destination node can still route to the destination in two stages: It sends the message to the nearest node th at belongs to a third (d,h-l)~subnet whose direct link w ith the (d,h-l)- subnet containing the destination is still intact. This node then routes to the destination in the norm al way. The information about a A ,- failure need only be kept with the two controllers which represent the two affected (d,i)-subnets The average extra path which needs to be traversed by all routes affected by a single A , failure equals 1 + i/2 , for i > 2. Thus, on the average, the extra path length incurred for the hypernet is about the same as th a t for a hypercube if the failed link is Ai or A 2. 2.4 Partially Constructed Hypernets Hypernets constructed out of identical d-cubelets have definite sizes as determ ined by Eq. 2.1.2. One would like to have a wider range of sizes available, as well as the ability to increase the size of an already existing configuration. Suppose we want a network w ith 2nd nodes. It is possible th a t for some other choice of the cubelet size, we can obtain a (d,h)-net of the desired size. How­ ever we may not want to change the size of the cubelets selected because they are already in stock or in position in the current m ulticom puter configuration. So we choose th e largest (d.?)-subnet th a t has fewer than 2w < t nodes. These 31 subnets are now interconnected w ith S = 2n> +1~n < £ direct links between any two (d,y)-subnets. This results in a network w ith stronger global connectivity. Finer growth of the network can be achieved by attaching smaller hypernets to the m ain system using the available links. The resulting structures will how­ ever be less symmetric, and require more involved schemes for num bering of nodes, routing and fault-tolerance. This is observed in m any other topologies which have m athem atically desirable features [BDQ86], but are yet to gain wide acceptance among designers of m ultiprocessor and m ulticom puter systems. In the description of a hypernet, each level can be looked upon as a complete graph of black-boxes, where each black-box is a unit at the next lower level. If identical modules are used as basic modules, then each black-box at a given level has the same internal structure. However, the internal structure of a black-box does not affect the working of the rest of the system , so long as it provides the same interface to the external world. The internal structure of a unit at any level can therefore be altered to optimize on the class of tasks assigned to it. This does not necessitate modifications in either hardw are or operating system and communication software of other units, so long as the interface constraints are met. Moreover, the resulting heterogeneous system can still be analyzed using the techniques developed for the general hypernet. For example, some of the cubelets in a hypernet can change their in­ ternal form. This might affect the logical num bering of the nodes as well as the interconnection patterns within th a t transform ed cubelet. As was suggested in Section 2.3, each basic module should have a controller to m onitor its interface w ith the rest of the system. This controller needs to record the rem apping of addresses in a table w ith entries of the form (old address, new address(es)). This table is kept in an associative memory and accessed for modifying the destination address if necessary whenever a message is sent/received across the module interface. The rest of the system still sees a virtual cubelet, and re­ quires no modification. In a sim ilar way, a (d,i)-subnet can also be modified in a way transparent to the external world, and presented as a virtual subnet. Thus a hypernet can evolve into a process-structured, heterogeneous architecture 32 [Uhr84] w ith units of possibly different sizes, each tailored to its own particular requirements. 33 Chapter 3 Computational and Communication Properties of Hypernets We n eed a dream world in order to discover the featu res o f the real w orld we think w e Inhabit. — Paul Feyerabend, Against M ethod (1975) 3.1 Topological Properties Among point-to-point topologies [Fen81], hypercubes and (augmented) binary trees have been particularly favored for efficiently supporting a large class of symbolic processing requirem ents often encountered in AI applications [HGC87]. The strong connectivity, regularity and sym m etry of the hypercube makes it a powerful candidate for general-purpose applications [Fox86]. Dis­ tributed routing and broadcasting can be implemented efficiently in a hypercube [HJ86]. However, the hypercube is not easily expandable, as the degree of each node increases w ith the size of the network. Thus both hardw are configuration and communication software of each node has to be altered if a hypercube needs to be expanded at some future tim e. Also, the num ber of links used for inter- connections is large. A tree, on the other hand, uses the fewest possible links 34 to connect a set of nodes. But it is suspect to faulty links and to message con­ gestion towards the root. To alleviate these problem s, several researchers have proposed alternative architectures based on trees augm ented w ith additional links an d /o r nodes [DP78,GS81,HZ8l]. The hypercube and complete binary tree (CBT) are complementary in the sense th at a weakness in one of them is a strength in the other. Ideally, we would like to construct a network which incorporates the good features of both topologies. We therefore compare the performance of hypernets w ith hy­ percubes and binary trees of similar sizes in subsequent sections. Furtherm ore, we assume that: • Message arrivals at different nodes are independent, and at each node the arrival times follow a Poisson distribution. • The nodes have unlim ited buffer space. • Routing through each node is deterministic. • Transmission error rates are negligible. Low transm ission error rates obviate the need for error checking and retransm ission on a hop-by-hop basis. Instead, an end-to-end protocol is deemed sufficient for m ulticom puter networks [RF87], thus resulting in a substantial reduction in routing delays. 3 .1.1 N etw o rk D ia m e te r an d L ink C o m p lex ity The diameter of a network is the m axim um num ber of links th a t need to be traversed to communicate between any two nodes. If we denote the diam­ eter of a (d,h)-net by D (h), then, from the routing scheme given in Section 2.2, we see th at the diam eter is bounded above by 1 + 2D (h — 1). Solving this recursion, we obtain: ________________________ D (h) < 2*~X (-Dfl) + 1) - 1__________________ (3.1.1) 35 Diameter D(h) torus CB 50 40 (5,h)-hypernet 30 hypercube 20 10 30 20 10 0 2 2 Num ber of Network Nodes Figure 3.1: Diameters of various m ulticom puter networks 36 where D (l) is the diam eter of the basic module. For a d-cubelet, D ( 1) = d, so th a t D{h) < 2A-1(d + 1) — 1. Figure 3.1 compares the diam eters of (5,h) - hypernets w ith those of hypercubes, CBTs and toruses. The diam eter of the torus increases as 0(y/N), while the other topologies exhibit an 0(log N) increase in diam eter w ith system size. D e fin itio n 3.1 The lin k c o m p le x ity of a multicomputer network is the aver­ age number of communication links per processor/memory node in the network. A comparison of diam eters should be accompanied by a comparison of link com­ plexity, since a higher connectivity is expected to lead to smaller diam eters. In a hypernet, each node has d+1 links including I/O channels and external links. So hypernets form families w ith constant node degree. In a hypercube however, each node has log2 N links, and the link complexity becomes significantly more for large networks. The link complexity of a CBT serves as a reference since it provides a lower bound for a connected network. Link complexities of several networks are shown in Fig. 3.2. 3 .1 .2 A v era g e D ista n c e s Let di be the average distance between any two nodes within a (d,fj- net, m easured in term s of the average num ber of hops between any two nodes. It follows from the routing m ethod given in Section 2.2 th a t dj, < 1 -f 2dh-i. A conservative estim ate of the actual average distance is obtained if we assume the equality to hold above. This yields : dH = 2*"1 (dT + 1) - 1 (3.1.2) For a d-cubelet, dy = d/2, so that: d* = 2f c ~2(d + 2) — 1 (3.1.3) The average distance traveled by a message is a popular criterion in evaluating a m ulticom puter topology. For a meaningful comparison between networks w ith 37 Av. no. of links/node 10 “ ’ hypercube (6.h) net (5.h) net (4,h) net lama__ hypertree CBT J— 20 Num ber of Network Nodes Figure 3.2: Link complexity of various m ulticom puter networks 38 400 * : hypercube • : (5,h)-hypernet S>: CBT | 300 - Normalized average distance 2 0 0 - 1 0 0 - 5 10 15 20 25 l°gz{ # of network nodes) — » Figure 3.3: Average distances for various m ulticom puter topologies different num ber of output ports per node, some norm alization is desired. After considering realistic lim itations on the num ber of pins and am ount of power available to drive comm unication lines, it has been proposed th at, in the context of single chip com puters, a constant bandw idth per chip be assumed [DP78]. If total bandw idth available per node is fixed, then the bandw idth available per port is inversely proportional to the num ber of ports for th a t node. Therefore, we normalize the average distance d by m ultiplying it by the num ber of ports per node [GS81,AJP86]. The simplified “constant bandw idth per chip” model is based on the assum ption th a t pinout lim itations constrain the chip design. The model is also shown to be adequate if power consum ption is taken as the limiting factor 39 [Fuj83]. The assum ption of bandw idth being proportional to the num ber of pins ignores the effects of data skew in a parallel comm unication link, which forces the receiver to wait for all arriving bits to reach a stable value before clocking in the data. However, by letting the bit lines operate autonomously, this problem can be circumvented. Figure 3.3 shows a plot of the normalized average distance against the network size for hypernets, hypercubes and CBTs. For networks w ith less than a thousand nodes, all three topologies have sim ilar characteristics. However, for larger networks, the hypernet remains comparable w ith the CBT, while the performance of the hypercube deteriorates rapidly. If norm alization is not done, the hypernet has an average distance of less than twice th a t of the hypercube and less than half th at of the CBT for d > 2. 3.2 Average Path Delays In practice, the cost of sending a message across a link is a function of the physical length of th a t link. This cost is particularly significant in two situations: (i) In VLSI, longer wires take up more area in silicon, need more driving power to transm it a signal, increase propagation tim e, are more difficult to route, and adversely affect reliability and the chip yield. (ii) In a network containing tens of thousands of nodes, the physical distance between two arbitrarily chosen nodes can vary by two or more orders of m agnitude. For example, if the methodology for laying out a hypernet given in Section 2.1 is followed, then we expect th at a A , will be more costly and will require more path delay on the average as compared to a Ay,* > j . This is due not only to a progressive increase in the physical length of links b u t also because of a possible qualitative change in the nature of these links. For example, links within a 3- cubelet may be confined w ithin a chip, while links connecting these cubelets into a (ff.l?)-net might run across a PC boar d and the links interconnecting differ­ 40 ent (Sj£)-nets involve ofF-the-board comm unication. Therefore, the traditional m easures of comm unication costs based on equal penalty incurred for every link traversal need to be revised in order to achieve a better approxim ation to the actual costs. To show the effect of taking physical distances into consideration in the performance analysis of the hypernet, let /,• be the average path delay incurred in traversing a A ,-. Let us assume th a t /,• is related to the delay in a A t_i link as: = WU-X (1 < * < h) (3.2.4) Internal links (A <js) have unit delays. W m easures the im pact of increasing link lengths on the p ath delay. We constrain W to be greater th an one so that links at higher levels show more average delay than lower level links. A high value of W signifies th at a small increase in the length of a link will result in a significant increase in the tim e taken to traverse th a t link. This simplified model purports to show the effect of taking physical distances into consideration in the perform ance analysis of hypernets. The following expressions are obtained for the average path delay (P D ) (after norm alization) in a (d,h)-net: P D ^ r m „ = (<i+ l)[2k-V + ‘*1 for W + 2 PD hn'r*# = (d + l)[2h~2(d + 2(h — 1))] for W = 2 (3.2.5) For an n-cube w ith a weight of W % on links between the dimensions i and n,-, we obtain: n,W (d-2)[{2W )h~1 - 1 ] W h~x - 1 P D hypercu(ie = - [ -----------2W - 1 -------------- W W - 1 + ^ (3.2.6) The average path delays in a network depend on the layout methodology used [HZ81,Lei83]. For a CBT w ith N = 2n_1, and using the layout scheme given by Leiserson, the average path delay can be approxim ated by : — ,W{d-2)[(2W)h~1 -1} „ TW h~l - 1 , P £> cb t= 6[—^ ---------1 + W w _ i + d - 2 - W h '] (3.2.7) Figure 3.4 plots these average path delays as a function of the network size for 41 t 3000 2000 Q O i » 0 0 ) x > xx 4 J 8. & 1000 - m u 0 ) > < 30 20 o f network n od es) — >• i n Figure 3.4: Comparison of average path delay for various m ulticom puter topolo­ gies (W = 2) Figure 3.5: Average path delays i n (4,h)-nets to O O Average path delay —>• 8 8 C o O © r t > O'* Average path delay PD (normalized) -->■ to O O 4 * . § o o \ o CD © o © to <0 f o - - rr O U i © Z f 43 W = 2. Figures 3.5.a and 3.5.b show the p ath delays for (4,h)-nets taken as representative examples. For this performance m etric, the hypernet is clearly superior to the other two topologies. The improvement is more m arked when the level h becomes large, or when a larger weight W is chosen. This is not surprising because the routing algorithm for the hypernet ensures th a t longer links are traversed less frequently than shorter ones. In fact, the algorithm is proven optim al as W — * • oo. 3.3 Natural Communication Patterns The previous analyses were all based on the assum ption of a uniform traffic, where the probability of a message being sent from any node to any other is the same for all pairs of nodes. This is not expected to be the case in m ost practical situations in a large m ulticom puter environment. In m any distributed AI systems, a process is taken as the basic unit of com putation. This results in a coarse granularity of parallel com putation [KL84]. A process which is spawned by another process usually runs on the same com putational node as its parent or on a nearby node [GBH87b]. Static and dynamic allocation techniques as well as distributed load balancing schemes try to assign related processes to nearby nodes. Communicating processes th a t are being executed on distant nodes may even m igrate towards one another [Hil85]. All these attem pts at minimizing interprocessor traffic lead to message traffic patterns which are much more localized as compared to a uniform distribution. Consequently the probability of two nodes communicating at any given tim e shows an overall downward trend w ith increasing distance between the nodes. We now look at two models of internode communication. These models are used to evaluate the perform ance of large networks under localized traffic. Let Prob.(S,T )= p if the probability th a t a message originating at a node S has node T as its final destination is p. Suppose th at Pro6. (S,T) depends only on the distance x between S and T. This implies th a t all nodes at the same distance from a source have equal chances of receiving a message from th a t 44 source. If this message distribution p attern is the same for all sources, then we can simply use Prob.(x) to denote the probability th a t a message from some source is m eant for a particular destination a t a distance of x from th a t source. Prob.(x) can be used to specify the internode message traffic in a m ulticom puter network. If this traffic is globally uniform, then Prob.(x) = l / N for all x. For m ore localized traffic Prob.{x) will decrease overall as x increases. Little work has been done in characterizing the locality of messages. However, the following models have been suggested: Threshold Model [RS83] : W ith every node we associate a “sphere of locality” consisting of all nodes w ithin a distance t (the threshold) of th a t node. A fraction a of all message destinations are uniformly distributed w ithin the local region of the source. The rem aining destination addresses are uniformly distributed over the entire network. Figure 3.6.a illustrates this model using the Prob.[x) function described above. If a — 0, we obtain the uniform global distribution, while a = 1 means th a t messages are never addressed to a node at a distance greater th an t from the source. For the same a , a lesser value of t corresponds to more local distribution pattern. Geometric Distribution [Lan82]: For every source S, th e nodes of the m ulticom puter system are divided into regions R \,R z ,.. of increasing dis­ tance from S. A fraction /? of all messages are destined for region R \ of S, /? of the rem aining messages go to region R.2 , and so on. W ithin each region, the distribution is uniform. If each region contains only one element, then this model corresponds to Prob.(x) decaying geometrically w ith a decay constant In /?. Figure 3.6.b shows a local uniform distribution superim posed on an geo­ m etric distribution. D e fin itio n 3.2 A message distribution pattern is a n a tu r a l c o m m u n ic a tio n p a tt e r n for a given network topology and its associated routing scheme if it causes ^each Jink of the network to be utilized in direct proportion to its capacity. D t •H • H D t distance between nodes x — distance between nodes/ x — > (a) Threshold Model (b) Geometric + Uniform Local Figure 3.6: Models for specifying interprocessor traffic patterns s t k . cn 46 The capacity of a link is the m axim um rate at which d a ta can be reliably trans­ m itted over it. In particular, if all links have the sam e capacity, a natural pattern results in each link having the same average load. For a tree, the natural comm unication p attern is extremely localized: each node communicates only w ith its immediate neighbors at a fixed rate per neighbor. Due to the complete sym m etry of the hypercube, m any distribution patterns tu rn out to be natural for the hypercube. If the choice among all links which reduce the path distance to the destination is random , then any isotropic distribution will be natural. This includes the two distributions shown in Fig. 6 for values of a and /?, if unweighted distances are considered. We now determ ine the natural comm unication p attern for the hyper­ net. From the recurrence relation for the distance as given in Section 3.1.1, we see th a t for every A , • th a t is traversed, two A t -_jS are traversed on the average. Also, for a (d,h)-net, the num ber of A,s = 2n~ * for 1 < i < h. So, for every A ,-, there are two A ,- — Is for i > 1. This means th a t if a message is sent between two nodes belonging to distinct (d,/i-J)-subnets, the probability of a link being used is the same for all external links, and is equal to l /( 2 n~*+1). However links within a cubelet will be only half as loaded as links am ong larger subnets. This leads to the following properties: (i) A globally uniform message distribution is natural for the hypernet if the capacity of a link w ithin the cubelet is half th a t of any external link. (ii) If a source node sends 2h~1 messages to nodes w ithin its cubelet for every message sent to some arbitrarily chosen external node, then every link in the hypernet is used w ith equal probability. Suppose the local region t of a node consists of all nodes which share the same cubelet and all links have the same capacity. T hen property (ii) means th a t the natural distribution for a hypernet is given by the threshold model with a = . Such a distribution is reasonable in several situations. For example, let the local memory of a processor be distributed uniformly w ithin its cubelet, and the global memory be stored uniformly over the entire network. Then the threshold model, w ith t defined as above, will capture the expected processor- 47 memory comm unication pattern for a shared memory system w ith favorite or home memories. 3.4 Performance under Localized Communica­ tion Figure 3.7 shows the average distances for the hypernet, hypercube and CBT under a localized message traffic specified by the threshold model with a = 0.9 and t= 16. For purposes of comparison, an n-dimensional hypercube is divided into local regions of size t — 2T defined by the m ost significant n — t address bits. For the CBT, the local region of a node is taken as the t nodes of the building block containing th a t node. This leads to following expressions for the localized average distance L A D : "LADkypercube = ” " ^ (3.4.8) L A D cbt = o l(2t — 6 H — -— -— -) + (1 — a ) (2n — 6 H — ——-) (3.4.9) X J L If the local region is chosen as the cubelet to which the source belongs, we get: ______ rvn LADhypernet = - y + (l - «) ^ (d + 2) - l) (3.4.10) For other choices of t, the solution is not obtained in a simple closed form. From Fig. 3.7, we observe th at the hypernet is superior to the CBT before norm alization, and to the hypercube after norm alization. The links in a hypercube or hypernet are more evenly loaded th an links in CBT. If the capacity of a link is fixed, then the m axim um volume of traffic the CBT can support turns out to be much less than th a t for the hypercube or the hypernet. The above observations have been m ade for various values of a , t and d and compared w ith the results for a globally uniform distribution (a = 0). The hypernet shows a relative improvement in perform ance over the hypercube. There is virtually no difference, however, between the perform ance of the hyper- net and the CBT so long as t is confined w ithin the cubelet. If t is larger, then 48 a Localized Av. Distance (Normalized) 100 torus 80 hypercube 60 40 CBT 20 (better) 30 20 10 0 2 2 2 2 Number of Network Nodes Figure 3.7: Average distances under localized traffic (Threshold Model, a. = 0.9, r = 16) 49 there is a little degradation in the hypernet’s perform ance, since the locality of m essage-distribution does not nicely fit the locality inherent in the architecture. For example, in Fig. 3.7, the threshold t was 16 so th a t the region of locality exactly fitted a 4-cubelet, was contained w ithin a 5-cubelet but spilled over for 3-cubelets. This choice of t leads to a relatively poor perform ance for the (3,h)- nets. However, we see th a t this crossover phenomenon is nowhere as severe as in bus-based hierarchical systems [GJS82]. 3.5 Concluding Remarks We have seen th a t hypernets form families of networks w ith constant degree and 0(log N ) diam eter. Several other networks w ith constant degree and 0(log N ) diam eter have been proposed by researchers. The m ost notable of them are the cube-connected cycles [PV81], the shuffle-exchange graph [Sto7l], and the DeBruijn networks [SP85]. Indeed, these three networks have better topological properties th an hypernets. The attractiveness of hypernets emerges when we also consider architectural issues such as VLSI im plem entation and 1/O capabilities. In particular, m odular expansion in hypernets involves using the spare links only. No reconfiguration of the other links is required. This property is not observed in the other three topologies m entioned above. For example, we cannot form a larger cube-connected cycle using two sm aller ones by the mere addition of extra links. The same holds for the DeBruijn and shuffle-exchange graphs. The high performance of the hypernet under localized message p at­ terns is not surprising. We have shown th a t the natural message distribution for the hypernet approxim ately follows the threshold model for equal-capacity links, and the geometric distribution model if longer links have lesser capacities. The suitability of a hypernet for localized com m unications can be intuitively seen from its topology. Nodes within the same cubelet are m ost tightly linked while t he num ber of higher order links available falls as we go up the hierar­ 50 chy. Thus more connectivity is provided among neighboring nodes th an among distant ones. Instead of assuming all links to have the same capacity, we can model the capacity of a link as inversely proportional to its delay so th a t longer links support a lesser am ount of traffic. The natural message distribution pattern would then be even more localized. For the hypernet, the n atu ral distribution in this case turns out to be close to a geometric distribution w ith (3 = 1 — lf2 W ^ which is superimposed on a uniform local distribution. 51 Chapter 4 Parallel Processing on Hypernets T h ese are th e d a ys o f miracle an d w onder T h is is the long distan ce call — Paul Sim on, B oy in the Bubble (1986) 4.1 Introduction Communication patterns among processors of a m ulticom puter system are greatly influenced by the m apping of algorithm s onto them [Bok81,Mol83]. The m apping policy incorporates the choice of grain size, com m unicating prim ­ itives and synchronization. Two-dimensional meshes are often adequate for supporting m any algorithm s for low-level vision when only nearest neighbor comm unications are needed. On the other hand consider the branch and bound technique for com binatorial search problems [WLY85]. W henever a processor finds a solution better than the current “best solution” , it needs to transm it this as this solution to all other processors so th a t they know when to “bound” . Simi­ larly, end-of-com putation signals (success/failure) need to be broadcast globally. This can be facilitated by having a global bus or a shared d ata register. An al­ ternative is to have global messages passed from one processor to its neighboring ones in a distributed m anner [MT85]._______________________________________ 52 The first step in m apping a parallel algorithm onto a m ulticom puter is to choose an appropriate system size for executing th a t algorithm . For example, to find the m inim um of 21 3 num bers given a (4,3)-net w ith 21 2 nodes, one can either use the entire network by initially allocating 2 num bers per node, or use a (4,2)-subnet w ith 64 num bers per node, or any eight (4,2)-subnets w ith 8 num bers per node, and so on. The hypernet provides a range of natural subsystem sizes to choose from. The choice may depend on the expected size of the problem , degree of parallelism and granularity of execution desired, and the existing load on the network. When date-parallel algorithm s [HS86] are assigned to a subnet for execution in SIMD mode, the special I/O node for th a t subnet can efficiently serve as a controller for broadcasting commands and gathering results, as it has a direct interface to the external world. Such subsystem s can operate independently, so th a t the hypernet as a whole runs as an multiple-SIM D machine. Data-dependent problem s, on the other hand, may use a variable subsystem size as they grow and shrink during execution, and require more interaction among subsystems. 4.2 Mapping of Data-Parallel Algorithms D ata-parallel algorithm s have been proposed as the m ost suitable class of algorithm s for fine-grained parallel com puter. T he key feature of these algo­ rithm s is th at . . . their parallelism comes from sim ultaneous operations across large sets of data, rather than from m ultiple threads of control [HS86]. M ost of these algorithm s have been proposed for execution on the Connection Machine in an SIMD fashion. However, this program m ing style is not synony­ mous w ith the hardw are design issue of MIMD versus SIMD com puters. Indeed, MIMD com puters can be well suited for the execution of such program s, par­ ticularly if duplication of code drastically reduces synchronization costs. We illustrate the execution of data-parallel algorithm s on the hypernet through algorithms for finding the m inim um (MIN) and perform ing fast-fourxer 53 transform (FFT) over N sample points. The first is based on viewing hypernets as hierarchical networks, while the second illustrates how hypernets may be used to sim ulate hypercube connections. In w hat follows, we assum e th a t the num bers are initially stored in memory locations A(Ar), 0 < k < N — 1, where A (*) is located in the k th processor/m em ory node. Moreover, the instruction A (A ;) « — A(Aj (k)) results in the transfer of d ata to processor k from a memory location of the processor which is directly connected to the k th processor by a Ai. This particular instruction is ignored by processors which do not have a Ai. 4 .2 .1 H y p e r n e ts as H iera rch ica l S tr u c tu r e s The following M IN algorithm calculates the m inim um of N = 2id~l elements for a {d,2)~net using the recursive doubling technique [GR84]: lin e p ro c e d u re MIN (d) 1 fo r j:= 1 to d d o 2 fo r a ll k in p a ra lle l d o 3 if A : m o d 2; = 0 th e n 4 A (l) + — m in {A(/fc), A(A; + 2,_1)} 5 e n d if 6 e n d d o 7 e n d d o 8 fo r a ll k in c u b e le t 0 in p a ra lle l d o 9 A(Jfc) <- A(Ai(fc)) 10 e n d d o 11-17 (repeat lines 1-7 for all k in cubelet 0 only) Figure 4.1 shows the d ata movements when the MIN algorithm is ex­ ecuted on a (3,2)-net. Figures 4.1.a- 4.1.c highlight the links used for the d ata transfer step (line 4) for j = 1,2 and 3 respectively. The processors which po­ tentially contain the m inim um element after the completion of th a t step, are darkened for emphasis. In lines 1-7, each cubelet independently com putes the 54 0 > ) / / / / (e) (c) (f) (d) Figure 4.1: D ata movements during execution of the MIN algorithm 55 m inim um of all the num bers th a t it contains in parallel. T he “winners” are then transferred to cubelet 0 in one step (lines 8-10). This is the only step which involves communication over a A*, as can be seen from Fig. 4.1.d. This enables the search for the m inim um of the rem aining candidates (lines 11-17) to be confined to a single cubelet, leaving the rest free for other com putations. Therefore we have shown the data transfer step for this p a rt of the com putation (line 14) only for cubelet 0 (Figs. 4.1.e- 4.1.g). The MIN function is representative of binary com m utative and as­ sociative operations such as MAX, ADD and COUNT; all of which can be implemented on hypernets in 0(log N ) tim e. Such algorithm s have a natural m apping on hierarchical networks such as trees. The hypernet can be viewed as a two-tiered structure while executing the MIN algorithm . As shown in Fig. 4.2, all cubelets form the ground level and nodes of cubelet 0 w ith bo = 0 duplicate as logically higher level nodes. In general, a (d,h)~net can em ulate a hierarchical structure w ith h logical levels. The entire network serves as the ground level. T he next level is logically represented by nodes of any one (d,/i-i)-subnet. This process is carried out recursively till we reach the highest level which is m apped onto nodes belonging to a single cubelet. If pipelining is desired however, then we cannot allow a node to duplicate as logical nodes at various levels. This problem can be avoided if we reserve a subnet at each level for em ulating higher level nodes only. 4 .2 .2 H y p e r n e ts as V ir tu a l H y p e r c u b e s The “decimation-in-frequency” F F T algorithm [HB84] for an N = 22n point sequence is m apped below for im plem entation on a subnet of a (n+1,2)- net formed by those nodes w ith bo = 0, i.e. only even num bered nodes are used. Initially, the ith num ber is stored in A(2»). Index (j,k) returns the appropriate power to which z — e2rt/N has to be raised during the various com putations. 56 010 000 100 110 010 010 000 110 110 /ioq cubelet Aooo 010 oiq cubelet 100 110 110 cubelet Figure 4.2: Viewing a (3,2)-net as a two-tiered structure a (b) (d) c (f) (g) Figure 4.3: D ata movements during execution of the F F T algorithm 57 The d ata movements for each step are shown in Fig. 4.3, where active links are darkened. The following F F T algorithm computes the F F T of N = 22n points: line procedure F F T (n) 1 for all k in parallel do A(&) < — A(Ai(k)) enddo 2 for j:= n to 1 by -1 do 3 for all k in parallel do 4 if (bj of k = 0) then A(A) < — A (k) + A (k + 23) 5 else A (Jfc) <- [A(k - 2*) + A (k)]zIndex^ 6 endif 7 enddo 8 enddo 9-16 (same as lines 1-8) 17 for all k in parallel do A(k) < — A(Ai(&)) enddo 18 Relabel (cubelet addresses) This algorithm executes in 0(log N ) tim e. If X is the vector sequence representing the F F T of the input sequence, then at the end of line 15, A(2k) = X(reverse(fc)), i.e. the outputs are available in bit-reversed order. This can be rectified in one more data-exchange step (line 17), as shown in Fig. 4.3.g. this step. The reader can verify th a t if the cubelets are relabeled w ith their addresses in reverse, then the address of an internal node w ith the sam e position relative to the cubelet’s I/O node is also bit-reversed. So by simply considering the I/O channel of cubelet 01 to be th a t of cubelet 10 and vice versa while extracting the output values, one obtains the values in proper sequence. W hen compared w ith a straightforw ard im plem entation of the -in­ frequency F F T algorithm on a 2n-cube, one can see th at the above procedure is based on the emulation of boolean 2n-cube connections on the hypernet. The d ata is initially loaded onto 2n n-cubes. Step 1 does a parallel exchange of d ata such th a t num bers which would have been adjacent in some dimension between n and 2n of a 2n-cube now become physically adjacent w ithin an n-cube. In lines 2-8, d ata is exchanged between d ata which are logically adjacent in some 58 dimension between n+1 and 2n. Thus we see th a t a k th dimension link serves as a virtual (k + n )th dimension link in the first p art of the algorithm . After the exchange in step 9, physical adjacency is restored for logical links between 1 and n. F F T is a transitive function [GR84] since every output is a nontrivial function of every input. One would expect such functions to generate globally uniform traffic. However, we are able to localize m ost of the comm unications on the hypernet by a few extra data-exchange steps (Steps 1 and 9 in the given algorithm ). By paying this overhead, we achieve a better m atch between interprocessor traffic patterns and the capabilities of the underlying network. 4 .2 .3 S p e ed u p P erfo rm a n ce A n a ly sis In th e previous section, algorithm s were presented for hypernets of fixed sizes. The MIN algorithm is based on the recursive-doubling technique, where the problem size is halved after every com putational step. For a problem size of 2fc , an algorithm is in the Descend class if it performs a sequence of basic operations on pairs of d ata th a t are successively 2fc_1,2*~2, ...2° = 1 (logical) locations ap art [PV81]. The FF T algorithm presented above is based on this class of algorithm s. A dual class of Descend is Ascend, where the operations are perform ed on d a ta pairs th a t are successively 1,2,..,2*_1 locations apart. The F F T algo­ rithm can be transform ed to an Ascend algorithm if the input sequence is per­ m uted according to the bit-reversal perm utation. M any fundam ental parallel algorithm s such as bitonic merging, sorting, convolution, m atrix m ultiplication and cyclic shift are either instances of these classes or simple com binations of such instances. Further details and examples are given in [PV81,DNS81]. We now analyze the m apping of these three classes of parallel algo­ rithm s onto hypernets of arbitrary sizes. We assume th a t initially one d ata element is assigned to each com putational node. Thus the size of the problem is equal to the network size. The lower bound on the num ber of communication steps required is log(N) for each of these three algorithm ic classes. T his.bound_ _ 59 is obtained if any node has the ability to directly talk to any other node and there are no contention problems. Using the hierarchical em ulation property, the MIN algorithm can eas­ ily be generalized for (d,h)-nets. Com putation is first done in all (dsh-l)-nets in parallel, and the partial results transferred to a chosen [d,h-l)-net for further com putation. The process is continued until all rem aining candidates converge to a single cubelet. The num ber of steps R(d,h) required for a (d,h) net w ith one element per node initially, are determ ined by the recursion R (d, 2) = 2d ; R {d, h ) = 2R (d, h - l ) - h + 3 for h> 2 (4.2.1) From this and Eq. 2.1.2, the algorithm s run in 0(log N) tim e. The specification and execution of an Ascend algorithm on a (d,h)~net is done recursively. Each (d,h-l)-subnet independently executes the algorithm for a [d,h-l)-net. This results in the completion of operations on d ata pairs th at are up to 2nh_1-1 distances apart. The d ata now m igrate to the nearest Aj,_iS and traverse them . This transfers data, which originally differed in only one dimension between n and n*_i, to the same (d,h-l)-subnet, where the algorithm is applied again. The num ber of d ata communication steps involved, ASC(d,h), is given by: A S C {d , 1) = d ; A S C (d ,i) = 2A S C {d ,i - 1) + 1, for i > 1 (4.2.2) To execute a problem in the Descend class, d ata is first transferred over Aa_is to bring pairs differing only in a fc-dimensional link (ra*-i < k < n^) w ithin the same (d,h-l)~subnet. The algorithm for a {d,h-l}-net is then run concurrently on all such nets. Finally d ata is transferred back to the original (d,h-l)-subnets and the algorithm applied again. We obtain: D E S {d , 1) = d ; D E S {d , i) = 2 D E S {d , i - 1) + 3, for i > 1. (4.2.3) As seen from Eq. 2.1.2,4.2.1-4.2.3, the d ata com m unication complexity is close to log N - the lower bound - for all the three classes of algorithm s. The extra steps are due to traversals over Ais or higher level links, which are needed 60 so th a t d ata pairs th a t are logically far apart are m ade physically adjacent. Thus a tradeoff is m ade between locality of com m unication, link complexity and total com putation tim e. To have a single figure of m erit, we define the communication cost for a given problem and network topology as follows: D e fin itio n 4 .1 The c o m m u n ic a tio n c o st of a problem on a given multicom­ puter network is the product of the link complexity of the network and the number o f data communication steps involved in solving the problem using the multicom­ puter network. The hypercube provides direct connections for the comm unication de­ m ands of algorithm s in the Ascend/Descend classes, and so is naturally suited to them . However, it has a high link complexity. For both hypernets and hy­ percubes, only links along a single “dimension” are active at every step. Since hypercubes have higher degree nodes, a greater fraction of the links are idle at any m oment. In Fig. 4.4, we compare the comm unication costs of hypernets of various sizes w ith th a t of the hypercube for the three class of algorithm s dis­ cussed above. We see th a t hypernets are superior for large network sizes. They also provide a wider selection in the trade-off point between link complexity and processing tim e. For the same network size, a hypernet based on smaller sized cubelets has fewer links, b u t takes more num ber of steps to implement any al­ gorithm in the abovementioned classes, as com pared to a hypernet constructed w ith larger cubelets. For com putations which can be fully pipelined on the hypercube [HJ86], a (d,h)-net is about h times slower. This is because a physical link of the hyper­ net has to em ulate links in up to h different dimensions, requiring an A-way mul­ tiplexing of data. For such com putations, the com m unication cost of hypernets become more than th a t of hypercubes of the same size. By slight modification of the algorithm s however, the speedup degradation can be reduced to within a factor of two, so th a t communication costs rem ain com parable. If we analyze the m apping of Ascend, Descend as well as recursive doubling algorithm s given previously, we see th a t each category involves log N 61 700 600 -- 500-- ^ 400 - 300 I 200 100 - n = log (# of network nodes ) — > Figure 4.4: Communication costs for algorithm s of the Ascend, Descend and Recursive Doubling Classes 6 2 200' 15CL_ t io a _ -p r—1 'O n = log_ (# of network nodes ) Figure 4.5: Average path delays in hypernets for three algorithm ic classes 63 Table 4.1: C o m p le x ity o f A lg o rith m s o n V a rio u s M u ltic o m p u te r T op o lo g ies Algorithm Complexity Hypernets Hypercube Binary Tree 2-D Mesh Vector Sum 0(log N ) O(logiV) O (log A T ) 0(AT1/2) Bitonic Merge O(log N ) O (log A T ) O(N) o (at1/2) Convolution O (A/2 + log A T ) 0 ( M 2 + logiV) O(JV’ ) 0 ( M J) M atrix M ultiply O(N) O(N) 0(JV3) O(N) Bitonic Sort O(logz N) 0 (lo g z N) O(N) OfAT1 /2) comm unication steps which use only internal links. Higher level links are used in the hypernet only for shifting comm unicating data/processes to w ithin the same cubelet. In contrast, each dimension of a hypercube is traversed once when these three classes of algorithm s are implem ented on it. Fig. 4.5 plots the average p ath delay (W = 2) for hypernets of various sizes. T he curves are based on the analysis above and in Section 3.1.1. Again, hypernets have an edge over hypercubes as well as cube-connected-cycles for large network sizes. Table 4.1 lists the complexity of several algorithm s implemented on hypernets of arbitrary sizes. M ost of these results follow from the hypernet’s capability to em ulate hypercubes. The analyses assume the num ber of proces­ sors available to be equal to the problem size. Convolution is perform ed using an MxM window on NxN sample points, and assumes O(M) space/processor. The m atrix m ultiplication result is based on the algorithm in [DNS81] for an NxN m atrix using N 2 PEs. Specific substructures of hypernets can be more efficient for certain problems. For example, matrix transpose can be achieved on the structure de­ fined for implementing the F F T algorithm in Section 4.2.2 in one step. D ata stored in row m ajor order are transferred to the column m ajor positions di­ rectly over the A,s. Due to the hierarchical nature of hypernets, they are also deemed. suitable.for_efficient.support of decomposable.a.nd_almost_decomposable_ 64 problems [SM86], block algorithm s for tridiagonal system s [KLK84], finite ele­ m ent analysis based on substructuring [AV84], m ultigrid algorithm s [CS86], and Al-oriented processing [HGC87]. We now focus on mem ory based reasoning [SW86,Wal87,SK86] as an application of highly parallel processing using hypernets. The algorithm s for the various steps involved in this form of reasoning are either instances of the three algorithm ic classes m entioned above or simple com binations of such instances. 4 .2 .4 M em o ry B a se d R ea so n in g At present, m ost expert systems are rule based or logic based. An alternative approach to machine reasoning is to use similarity-based induction, which is a form of memory based reasoning. Here, decisions are based on a direct inspection of the memory w ithout using rules as interm ediate structures. The mem ory is searched for past situations th at are sim ilar to th e current one, and the response to the current situation is determ ined from the successful responses th a t had been m ade under sim ilar circumstances. For example, suppose our goal is to diagnose the illness of a new patient given a large database of medical histories. We use the known characteristics of the patient such as sym ptom s, age, sex and medical record, as predictors to search for other patients w ith sim ilar backgrounds. The diagnoses for the selected patients are weighted by their closeness to the current case, and a ma­ jority count is taken to decide the m ost probable illness for the new patient. Since searching the database for the best m atch w ith the current situation is highly memory intensive, von Neum ann machines are not suitable for sup­ porting memory based reasoning. A massively parallel fine-grain system with distributed memory, on the other hand, can perform concurrent m anipulation of records to yield results in an acceptable tim e. Let O(N) records or cases be distributed uniformly among N processor/m em ory nodes. M emory is accessed for 4 basic operations: F irst, a count is m ade of the num ber of occurrences of each_predictor-value/goal-value pair. This is the m ost involved step and t akes 65 O (log2N ) tim e, as detailed later. These counts are used to produce a joint- m etric, which is then used to calculate the dissim ilarity between each case in the memory and the current one. These two steps can be perform ed on all records in parallel in constant tim e using their local processors. Finally, the best m atches are retrieved in O(logJV) time. In the first step, counting is done in parallel for all predictor value/goal value pairs as follows: A predictor (age) is chosen, and each record is sorted using the value of the predictor as the key. The sorting takes 0 (lo g 2N ) steps. This sorted database is segmented into regions where both the predictor-value (age range:60-70) and the goal-value (illness: Alzheimer’s disease) are uniform. Finally, the length of each of these segments is calculated in O(log JV) tim e. Conceptually, these operations are of the SIMD type and they are per­ formed over the entire database. However, by carefully partitioning the memory according to its contents, one can often localize the searches and m atches. By allocating the memory partitions to appropriate subnets of a hypernet accord­ ing to their sizes, several hypotheses can be pursued in parallel. Furtherm ore, each result is obtained more quickly since a sm aller database is involved. The interpretation of d ata may be context dependent. In such cases, the allocation of d ata relevant to m utually disjoint contexts to different subnets also serves to localize com putation. In both cases, hypernets provide the option of multiple- SIMD processing for further speedup. 4.3 Data-Dependent Processing 4 .3 .1 G en era l P rin cip le s D ata-dependent com putation is often encountered in dynamic envi­ ronm ents characteristic of m any AI applications. These problem s are typically nondeterm inistic. The am ount of com putation required may be unknown at compile tim e and vary significantly w ith the input data. For example, the tim e taken for_a.parallel search through a knowledge base depends on the item being 66 sought. W hen functional or logic program s are executed on a m ulticom puter, a static scheduling of parallely executable processes can result in poor resource utilization if the com putational and I/O requirem ents of these processes cannot be estim ated accurately. Such problems can be alleviated by dynamic allocation of processes and load balancing at run-tim e [CH88]. 4 .3 .2 D y n a m ic A llo c a tio n an d L oad B a la n c in g The viability of distributed, dynamic task scheduling was shown by Reed [Ree84]. The concept of an event horizon was introduced. Here, the sched­ uler at each node is assumed to have complete knowledge of network activity for all nodes w ithin its horizon, but no knowledge of the rest of the network. T he schedulers thus uses only local status inform ation, and prefers to allocate newly created tasks to nodes near the point of task creation. Reed showed th at a sm all horizon was adequate for good performance, and dem onstrated the lack of processor thrashing. Lin and Keller proposed a class of distributed scheduling algorithm s, based on gradient planes, for parallel functional program s [KL84]. They assume tasks to run independently once created, and th a t its results are consumed by a single recepient task. Van Tilborg and W ittie have developed a m ore general distributed scheduling technique called wave scheduling which can cater to structured sets of cooperating tasks [vW84]. We now present a dynamic task allocation strategy for m apping data- dependent problems on the hypernet. This strategy takes into account the hier­ archical structure of hypernets. An event horizon of one is used, and the tasks are assum ed to interact only at the instants of creation and completion. To illustrate the m apping technique, consider the TAK program given below: P ro c e d u r e TAK (x,y,z) if (x < y) th e n r e tu r n z else r e t u r n TAK{TAK(x-l,y,z), TA K (y-l,z,x), TA K (z-l,x,y)} e n d if (en d TAK) 67 The TAK program has been used as a benchm ark for testing function calls and recursions in Lisp machines [Gab85]. The num ber of function calls m ade by TAK depends on its initial argum ents, and ranges from 1 to over a million even for small integer argum ents. For example, TAK(12, 10, 8) makes 53 function calls while TAK(18, 12, 6) makes 63,609 function calls and perform s 47,706 subtractions by 1 before returning the result. A T A K process evaluates a single TAK function. Initially, a single (parent) process is created with the given argum ents. If the term ination condi­ tion is not satisfied, this TAK process spawns three TAK child-processes which return its three new argum ents. Once it has received all three argum ents from its child-processes, it will again spawn three new processes if the term ination condition is still not satisfied. The processes are not independent and can be linked as a precedence tree. This limits the am ount of parallelism th a t can be exploited in a TAK program . As the com putation proceeds, the depth of recur­ sion grows and shrinks. The com putation tree is usually not very high a t any time. For example, even though TAK(18, 12, 6) generates 63,609 processes, the depth of recursion is never greater th an 18. Since the num ber of parallely executable processes in an invocation of TAK can vary significantly w ith tim e, the m apping of a TAK program on a fixed system size can be very wasteful of resources. Such program s dem and a varying am ount of dynamically allocated resources by their very nature. They are more suited to cluster-based or hierarchical system s. Initially, the com puta­ tion is confined to a small cluster. If the com putation tree becomes too large for this subsystem , other clusters are incrementally enlisted to share its load. Later on, when the com putational requirem ents shrink to more m anageable propor­ tions, the enlisted clusters are gradually freed and processing of the program is confined to the original cluster again. The execution of a TAK program based on the above principles can be easily carried out on a hypernet. T he “root” process is initiated on one node. The children of this process are distributed among the neighboring nodes which have the lowest loads as m easured by the num ber of executable processes 68 Table 4.2: E x e c u tio n o f T A K (1 8 ,1 2 ,6 ) o n a (3 ,3 )-n e t Bias on Links Execution Subnet Load Profile Link Traversals (wei|?hts) Tim e (%of processes executed) ( % of total) levell level2 # of steps net 4 net 0,1 others levelO levell level2 0 0 1798 59.2 25.1 15.7 86.5 8.6 4.9 2 4 2012 69.8 18.8 11.4 91.1 6.9 2.0 5 25 2273 72.3 24.7 3.0 93.4 3.8 2.8 0 10 2487 82.2 14.1 6.7 89.1 10.0 0.9 10 10 4249 81.3 13.5 5.2 95.8 2.9 1.3 10 100 5082 85.4 14.6 0.0 96.0 2.9 0.1 in them . To m aintain the locality of comm unication and a m odular use of resources, we try to achieve the following ideal: At any tim e during the compu­ tation, all the busy nodes should belong to the sm allest possible subnet of the hypernet, w ithout compromising on the extra speedup which can be achieved by using more processors. Initially, the com putation is confined to one cubelet. If more processors are required, nodes of other cubelets w ithin the sam e (d,2)- subnet are favored in the enlistm ent process. Further spread of com putation among processor nodes is also m odular, and so is the shrinkage. Thus nodes belonging to the same (d, I?)-subnet as the root node are favored over nodes in other subnets if further enlistm ent is required. They are also released later than nodes in other (d,2)-subnets. Towards the end of the com putational processes, all active nodes are again confined to the cubelet of the root node. 4 .3 .3 S im u la tio n R e su lts The execution of TAK has been sim ulated on a (3,3)-net w ith 256 nodes, using a distributed sender-initiated load balancing algorithm [CH88]. To approxim ate a m odular spread of com putation, we bias the links so th a t higher level links have a larger bias. Each node sees a virtual load on its neighboring processors which is the sum of the actual load and the bias on the link connecting t he two nodes. The node uses this virtual load to determ ine which neighbors 69 it should send the processes spawned by its own processes. In case of a tie, a neighbor w ith a lower bias is preferred. Table 4.2 summarizes simulation results for various bias settings. In all cases, the root node is node 128, th at is, node 0 w ithin subnet 4. Level-0 links have no bias, while links at the other two levels are given progressively higher biases. The load profile gives the percentage of total processes th a t are executed w ithin a subnet. Since subnet 4 contains the root node, it has m axim um load. The entry point into subnets 0 and 1 are the closest to the root node, so they are the first to catch any spill in com putation. The other subnets have hardly any load. Overall, the higher the biases, the more confined is the com putation, and the slower the execution. W ith biases of 10 and 100, only nodes in subnets 4 and 1 are ever used, thus making the rem aining 6 subnets available for other com putations. Figure 4.6 shows how the bias on the links affects the load distribution. The tim e of execution is superim posed in the figure to show the tradeoff between spread of execution and the total parallel processing tim e taken. The effect of bias on the communication profile is shown in Fig. 4.7. W ith a small bias, this profile is sim ilar to the models of Section 3.3. A high bias on A2’s drastically reduces the traffic over these links when the bias on Ai’s is significantly lower. Due to the data-dependent nature of the problem , the above correla­ tions are not strict. For a given instance of the problem a slightly higher setting occasionally leads to quicker execution. This is because all processes do not take the same com putational tim e. We also see th a t m ost of the traffic are lim ited to internal links which are on-chip and thus shorter th an external links. The traffic p attern can also be controlled by the bias settings. Thus the user has a simple way of choosing the desirable operating region in the speedup vs. amount-of-resources-utilized tradeoff curve. _______ ____ __________ 100 Subnet load C < of processes executed) 75 50 25 □ (0, 0) (2 ,4 ) (0, 10) (5,25) 10,10 10,100 OTHERS EZ2NET 0,1 time of execut ion CHI NET 4 Biases ( level_0,level_1) Figure 4.6: Varying load distribution through bias control o Rverage lin k traversals 6 0 1 -------------------------------------- E S S 3 level _2 kkZ d level 1 (— i level _0 (0 ,0 ) C2,4) (0,10) (5,25) 10,10 10,100 Biases Clevel_0,level_1) Figure 4.7: Effect of bias on communication profile 72 4.4 Concluding Remarks We shown in Fig. 3.4 th a t hypernets outperform both hypercubes and CBTs in providing a shorter p ath delay under uniform global traffic. W hen path delays are estim ated for localized traffic, the superiority in perform ance is even m ore dram atic. Since the natural comm unication patterns for hypernets closely follow the two models in Section 3.3, any application dom ain which generates such patterns can be efficiently supported by hypernets. For example consider Al-superworkstations th a t provide a highly in­ teractive environm ent. They require an extremely large m em ory w ith efficient garbage collection, load balancing and procedure call routines for frequent cre­ ation, m anipulation and destruction of objects and dynamic data-structures. It m ay be possible to effectively implement an intelligent object-oriented memory for such systems using hypernets. A class of objects, together w ith all its sub­ classes and m ethods, is allotted to a subnet of appropriate size. Independent object classes are m anipulated and m anaged in a distributed fashion. W ith a suitable memory allocation policy [Sta84], one expects the hypernet to m atch the comm unication demands resulting from this im plem entation. It may also be possible to m ap production systems onto fine-grained hypernets for fast execu­ tion of rule-based expert system s, since the d ata and knowledge-bases can often be partitioned into segments which have little dependencies among themselves. However, the m apping of expert systems on hypernets will be different from th a t on the Dado [SM86] or the Connection Machine. The hypernet emphasizes distributed MIMD processing, while the Connection M achine and Dado operate in SIMD or multiple-SIM D lock step. 73 Chapter 5 Mapping Neural Networks onto Multicomputers L e t each thing a c t according to its nature, an d it will eventually co m e to r e s t in its ow n way. - LAO T Z U , Tao Teh Ching 5.1 Introduction Connectionist models of com putation [FB82] have been in the limelight recently as prom ising alternatives to traditional approaches to solving complex problems in artificial intelligence [RM86,FH87]. In particular, they have been advocated for inferencing systems w ith learning capabilities [Bar85], and for processing of images and speech signals in noisy environm ents. A salient feature of connectionist models is th a t they involve a large num ber of elem entary com putations th at can be perform ed in parallel [FH83]. Memory is completely distributed, processing is asynchronous, and the mod­ els are inherently tolerant to malfunctioning processing units or connections [HMT86]. Consequently, they are am enable to a highly concurrent hardw are im plem entation using thousands of simple processors, provided adequate sup- port is available for interprocessor communication.__________________________ 74 In this and the following chapter, we examine the architecture of highly parallel com puters from the viewpoint of their ability to effectively sim ulate large neural networks. The term “neural network” is used in a broad sense to cover artificial neural systems and other connectionist models, instead of having a restricted biological connotation. We do not concern ourselves w ith knowledge representation in a neural network. Instead, we explore ways of effectively implementing neural networks on a highly parallel com puter. O ur approach is outlined in Fig. 5.1. In Sec. 5.2, we present a m odel th a t captures the essential structural features of large neural networks in a few param eters. This model provides a way of m apping these networks onto m ulti­ processor topologies. We also characterize the functionality of individual cells in asynchronous, value-passing com putational models, and use this to develop a distributed processor/m em ory organization in Sec. 5.4. In the following chap­ ter, we use the structural model to determ ine the bandw idth requirem ents in connectionist machines and show the suitability of hypernets as a connectionist architecture. 5.2 Connectionist Models and Mechanisms 5 .2 .1 W h a t is C o n n ectio n ism A connectionist model is distinguished by the use of interconnections among a large num ber of elementary com puting units as the principal means of storing inform ation, and by the ability to sim ultaneously operate on this inform ation in a highly parallel and distributed fashion. To quote Feldm an and B allard [FB82], T he fundam ental premise of connectionism is th a t individual neu­ rons do not transm it large amounts of symbolic information. Instead they com pute by being appropriately connected to large num bers of sim ilar units. This viewpoint characterizes m ost artificial neural system s as well as massively parallel m arker/value passing networks [FH87] and constraint-satisfaction net- works [RM86]. These systems have some commonalities w ith the neural net- 75 Large neural network (INPUT) ( Characterize structure ) of network ^ ('Characterize functionality^ of the cells _ f Logical Level Physical Level Structural Model Mapping Policy ). Modeling and ^ Analysis Simulation Efficiency i f i i Interprocessor t communication Analysis i * i network Processor Design Memory organization ?------------ Execution time estimates architectural design selection^ (topology, size ) OUTPUT TECHNOLOGY OPTICS VLSI Figure 5.1: Critical issues in developing a connectionist architecture 76 works of the hum an neocortex, where knowledge is stored and processed by a large num ber of interconnected neurons. Neurons have elem entary com puta­ tional capabilities, b u t they operate in a highly parallel fashion. The concurrent processing by aggregates of neurons in a cooperative and com petitive m anner is fundam ental to the functioning of the brain. A connectionist system consists of a very large num ber of intercon­ nected com puting units or cells. Each cell is capable of perform ing basic arith­ m etic or logic operations, has little internal mem ory b u t a large num ber of connections to other cells. Every connection has a strength or a weight associ­ ated w ith it, and the p attern of weights represents the long-term knowledge of the system. The power of a neural system comes from its ability to sim ultaneously apply the entire knowle dge base to the problem at hand. All the cells operate concurrently and com putations are directly affected by knowledge encoded in the network connections. By contrast, a symbolic m achine can only bring to bear those representations which can be retrieved from memory. A connection­ ist machine is not a stand-alone com puter. Our view of its place in a cognitive system is shown in Fig. 5.2. A sim ilar viewpoint has been expressed by re­ searchers in the Cognitive Architecture Project [HMT86]. The neural network has two sources of inputs: (i) Raw sensory data, preprocessed through filters, systolic arrays an d /o r sec­ ondary neural networks, and (ii) Feedback, training and control signals from the user or the inference engine, sent through a host com puter. These inputs are at the subsymbolic level [TH85]. The inference engine is built on top of the neural network. This uses m ore conventional symbolic processing to m ake decisions based on the output of the neural network, and on its own store of high-level knowledge such as rules and m eta-rules. This inference engine m ay run on the host com puter or on a dedicated machine. For interactive system s, the host will also control actuators th a t interact directly w ith the environm ent. _____ _________________________________ 77 Inference Engine Host C om puter Connectionist Network Preprocessors Preprocessors User Raw Sensory D ata Figure 5.2: A rchitecture of a cognitive system 78 5 .2 .2 P h y sic a l S tr u c tu r e o f N e u r a l N e tw o r k s The physical structure of a neural network is given by its underlying network graph. Each vertex of this graph represents a cell, and a directed edge exists between two vertices if and only if there is a non-zero weight associated w ith the corresponding pair of cells. The connectivity of a set of cells is the ratio of the num ber of actual connections among these cells to the num ber of connections if the cells were fully connected. The network graph has a direct correlation w ith the function of dif­ ferent constituents of the system. Groups of cells representing related functions will have a significant num ber of connections am ong themselves. Conversely, a set of sparsely connected cells represent functions th a t have little influence on one another. Table 5.1 outlines the structural characteristics of some common net­ work classes. Several common features become apparent in large networks. Full connectivity may be assumed to m athem atically show th a t the system will con­ verge to a desired state corresponding to the local m inim um of some global energy [Hop82,CG83]. However, for practical neural networks w ith thousands of cells, full connectivity is com putationally prohibitive. This is because the total am ount of com putation required is directly proportional to the num ber of - interconnects, and thus increases quadratically w ith respect to the system size. In this dissertation, we consider the average connections per cell to be in the range of a thousand or less. Secondly, it is possible to partition the network into groups of cells such th a t the connectivity w ithin a group is much higher than the overall connectivity. Typically the cells of such groups are used to effect the same function or related functions. For example, a winner-take-all network (WTA) [FB82], used for selecting one out of a set of alternatives, consists of a group of cells th a t are fully interconnected (conceptually) through inhibitory links. A nother commonly occuring configuration is th a t of a group of cells which form a m utually supportive coalition by having a num ber of reinforcing connections am ong themselves._____________________________________________ 79 Table 5.1: Structural Characteristics o f Various Connectionist N et­ works. Network Type Connectivity Special Features Crossbar Associative Networks [Koh84,Hop82] full for large networks, a sizeable num ber of connections may have zero weights Semantic Networks (local representation) [CL75,SF84] sparse hierarchical organization Semantic Networks (distributed representation) [TH85] m oderate high connectivity w ithin clusters; few connections am ong clusters M ultilayered Networks [RH W 86,SR86] m oderate most connections are between adjacent layers or w ithin a layer Hum an neocortex [EM78,KN76] m oderate to high hierarchical organization; spatial locality of connections The connections among highly connected groups are not random . For exam ple, groups of cells representing com peting hypothesis will have intercon­ nects representing inhibitory influences among them but few connections w ith groups signifying independent hypotheses. The notions of competing hypotheses and stable coalitions is a recurrent them e in neural networks th a t are func­ tionally characterized by the cooperative and com petitive behavior of ensemble of neurons [AA82]. Com peting coalitions of units at various levels of abstrac­ tion form the basis of several neural models such as those for com puter vision [Mar82,AH87]. Both analytical [Gem8l] and neurological [EM78] evidences suggest th a t a comprehensive neural network for cognition will have a hierarchical or­ ganization. At the lower levels, prim ary features are extracted from input data w ith a high degree of concurrency and redundancy. Inform ation is represented and processed in a distributed fashion. At higher levels, inferences are m ade on the features extracted from lower levels. Representation is more local, and 80 a single concept or attrib u te is embodied in a sm all num ber of cells. Such a m odel concurs w ith our view of a cognitive system as reflected in Fig. 5.2. To illustrate the observations m ade above, consider the celebrated ap­ proach to the traveling salesm an problem (TSP) based on th e Hopfield model [Hop86,HT86]. An N = n X n m atrix of cells is employed to solve an n-city tour problem . Each cell is connected to all other cells in the same row or column, as well as to all cells in the two columns immediately preceeding or succeeding its own. A cell need not be connected to any other cell besides the An — 3 cells spec­ ified above. Thus the corresponding network graph has degree 0(y/N) , even though this im plem entation is based on a com putational m odel th a t assumes full connectivity. Suppose we form groups comprising of all the cells in two consecutive columns. The n 2 cells are then partitioned into n f 2 disjoint clusters of 2n cells each. For n = 1000, we obtain 50 clusters of 200 cells each. Consider two cells chosen a t random . Cells w ithin each cluster are fully connected. If they belong to adjacent clusters, then the probability of them being connected is about 0.25. The probability of a connection between two cells belonging to non-adjacent clusters is only 0.001. In other words, the connections of a cell are not random ly distributed in the network, but tend to concentrate towards particular groups of cells. 5.3 A Structural Model for Mapping Neural Networks In this section, we present a general model to characterize the structure of large neural networks. The following design goals are set: 1. The model should be general enough to characterize a large variety of neural networks. 81 2. The m odel should be able to capture the essential structural features of these networks in a few param eters. A particular network is characterized by giving specific values to these param eters. 3. The model should be able to indicate how these networks can be parti­ tioned for efficient m apping onto m ultiprocessor/m ulticom puter systems. 4. The model should be able to reasonably predict interprocessor communi­ cation dem ands in using the m apping scheme. The m odel is based on a structural analysis of a variety of neural networks such as those sketched in Sec. 5.3. A key observation is th a t such networks can be partitioned into core regions consisting of groups of cells th a t are relatively highly interconnected among themselves. For example, all the cells of a sim­ ple W TA network can form a fully connected core. A core need not be fully connected. However, the probability of finding a direct connection between two cells will be much higher if those cells belong to the same core. Of course, it m ay be possible to have a subgroup of cells w ithin a core, th a t is even more highly connected. 5 .3 .1 G en era l M o d el To characterize a network, we first partition it into disjoint cores of com parable sizes. If there are a large num ber of cores, then the distribution of their sizes can be approxim ated by a normal distribution w ith m ean Ge . It is desirable to partition the network into alm ost equal sized groups, w ith relatively few interconnections among groups. For a given network, good partitions can be achieved in a num ber of ways. The idea is not to spend a lot of com putation in trying to find optim um partitions, but to come up w ith a reasonable one. M ost networks yield fairly obvious ways of partitioning on a cursory ex­ am ination. The only practical way of specifying a very large network is through a hierarchical procedure where subnetworks are distinguished by their functional 8 2 properties and are gradually specified at finer levels of detail [HMT86]. P arti­ tioning schemes which follow from this hierarchical decom position usually give good results. This is because cells achieving a common function or goal tend to have more inter-relationships among themselves th an w ith cells belonging to subnetworks for disparate functions. For more random ly structured networks, simple techniques developed for partitioning of VLSI networks [KL71] prove effective. Let us again consider the network for a 2048-city traveling salesman problem . Suppose we define the core associated w ith a cell to be composed of all the cells in its column. The network has 2048 equal-sized cores which are fully connected internally. We associate an influence region w ith a set of cores whose constituent cells have a substantial num ber of inter-core connec­ tions. For instance, consider a core consisting of cells in a m utually-supportive coalition. The cores representing com peting coalitions would be natural choices for inclusion in the influence region of this core. For simplicity, we require the sym m etric and transitive properties to hold, so th a t, for any three core regions, A, B, and C, 1) B is in the influence region of A , if and only if A is in the influence region of B; 2) If B and G are in the influence region of A , then they are in the influence region of each other. Let us define the influence region of a cell to be identical to the influence region of the core to which the cell belongs. The transitive and sym m etric properties ensure th a t the entire network can be partitioned into disjoint sets of cells such th a t each set is precisely the union of the core and the influence region of an arbitrary cell in th a t set. Moreover, in the universe of a cell, we can now distinguish among three groups of cells, namely, its core, its influence region, and the rest of the cells forming the remote region for this cell, as illustrated in Fig 5.3. In the TSP example, we can partition the cells into disjoint sets by considering (say) all cells in four consecutive columns to be in the same set. T hen the influence region of a cell in column 1 com prises of all cells in cols. 2, 83 remote region I influence region connect cell core 3 core 2 core 1 Figure 5.3: The universe of a processing cell 3, and 4; the influence region of a cell in column 6 is precisely the cells of cols. 5, 7 and 8, and so on. This partitioning is consistent w ith the requirem ents of sym m etry and transitivity. The probability of a connection existing between a cell and another one in its influence region is about 0.5, while the probability of it being connected to one of the rem aining cells is less th an 0.0005. In general, the distribution of sizes of the influence region will be considered as binomial w ith average size = Gi . Moreover, the probability of having a connection to another cell decreases as we go from the core region to the influence region and finally to the rem ote region. A second key observation about the structure of large neural networks is th a t the connections of th a t cell going outside its core group tend not to be random ly distributed, even within the same region. Typically, the chances th a t a cell o is connected to a cell, b increases dram atically if there is another cell c in the core of b th a t is connected to a and b. We denote this conditional probability as pi . In fact, for the traveling salesm an example, this conditional probability is alm ost 1 if b is in the influence region of a. A sim ilar effect is usually found even if b is outside this region, and we capture this effect through another param eter. 84 pr, which is defined analogously. The TSP example, however, happens to be a notable exception in which pr is alm ost zero. We characterize the structure of a large neural network (CN) by an eight-tuple, CN = {M ,G c,G f,C c,C7,,Cr, p, , pr), where M is the total num ber of cells in the network; Ge and G ,- are the average num ber of cells in the core and influence regions respectively; and Cc, C,-, and Gr represent the average num ber of connections from a cell to another in the core, influence region and rem ote region respectively. Thus the average num ber of connections per cell is Ce + Ci + Cr. The conditional probabilities, p, and pr indicate the am ount of clustering or lack of random ness in the connection patterns of the network. In Section 6, we show th a t these two param eters have a dram atic effect on the comm unications support required to sim ulate the network. Table 6.1 in Section 6 contains five sets of param eters which character­ ize the networks which axe considered for perform ance analysis in th a t section. Note th a t all param eters in the eight-tuple are average values. They give the m eans of the actual distribution of these param eters over all cells. We assume th a t the variance of these distributions is small. This is reasonable if the cores are of com parable sizes, and the regions of influence are chosen appropriately. 5 .3 .2 C h a ra cterizin g a L ayered N e tw o r k A particular type of connectionist network can be structurally char­ acterized in greater detail using a more specific model then the general one developed in the previous section. A structural description given by this elab­ orate model can be translated to the general model. We illustrate this for layered networks by presenting a tailor-m ade description for such networks and then translating it in a straight-forw ard way to obtain the param eter values of the general model. A layered connectionist model can be visualized as consisting of m layers of cell, lo . . . lm- \. The cells in the bottom -m ost layer, l0 can receive inputs from the external world. Usually the output of the network is determ ined from 85 the outputs of the cells at the top-m ost layer, /m_i, and the num ber of cells in each layer progressively decreases as we go from bottom to top. Consider a cell, A , in layer 0 < * < m — 1. Based upon the location of the cells to which A is connected, we divide its connections into four m ain Categories. @ layeTtii@ <ibovei@ belovj ®lld C * mi sc • Let #Clayert i = average num ber of connections to cells in its own layer, i, 'a b o ve — average num ber of connections to the i + 1th layer, ifCbeiow = average num ber of connections to the i — 1th layer, and i^Cmisc = average num ber of connections to non-adjacent layers. Figure 5.4 shows the four categories of connections for a cell in layer 1 of a network w ith n + 1 layers. Note th a t the cells in each layer axe grouped into cores. If the num ber of cells in adjacent layers is large, then it is practically infeasible to have a full interconnection between them . We superim pose a tree structure on the cores of the network. Each node of this tree denotes a core, and a (parent) node at layer i is connected to a (child) node a t layer i-1 if and only if the corresponding cores have a disproportionately higher num ber of connections between themselves. The average num ber of children in this tree can be varied by making the requirem ent th a t at least a fraction f l of H ^Cabove of a core goes to its parent. A higher value of / will decrease the average degree of the tree. This also ensures th at the average num ber of connections going to a child node is at least / x #Cbeiow The set cells in the same layer induces a subnetwork th a t can be de­ scribed by the general model. We denote the fraction of the connections w ithin a layer th a t stay w ithin a core, by f2. T he rem aining connections link a cell to another in its region of influence or go beyond this region. The entire network is characterized by specifying the general model param eters as well as /I, Cabove > Cbeiow and Cm,sc for each layer. To transform this elaborated specification in term s of a single general model description, we first make k ‘vertical’ partitions of the cells of the network. The partitions are such th a t all the children of a parent core are in the same partition as their 86 outputs 1 1 Layer n (Top) Layer 2 cores above Layer 1 layer cells Layer 0 (Bottom) I I 1 inputs Figure 5.4: Characterization of a multilayered network 87 parent, except, possibly for parents in the topm ost layers which m ay have less th an k cores. This is always possible if k is small enough, and the m axim um num ber of partitions possible under this constraint is a function of f l . Each such partition forms a core of the general model. Such cores are highly interconnected if the original cores are. This is because at least m m ( / l ,/2 ) of all connections involving the cells in each of the k partitions are confined to the partition. Furtherm ore, if the num ber of children of a parent at a particular level is alm ost constant, then the resulting partitions are of sim ilar sizes. The region of influence of a partition A is formed by the partitions which are largely composed of cores in the influence region of the cores in partition A. The conditional probabilities pi and pr , can be calculated from the num ber of cores of each layer in a partition, and the values of p,- and pr of each layer. We see th a t some detail is lost when the m ultilayered network model specification is reduced to the general model specification. However, it is seen th a t a natural way of implementing a layered network on n processors is by first partitioning the connectionist network into n ‘vertical’ slices, and then allocating each slice to a separate processor. If & is a m ultiple of n, then the general model captures the essential features for prediction of interprocessor com m unication, since the details of interconnection patterns w ithin each slice becomes irrelevant. So, for our purposes, the general model is adequate. 5.4 Multiprocessors for Neural Network Sim­ ulation 5 .4 .1 F a ll v ersu s V ir tu a l Im p le m e n ta tio n An artificial neural system can either be fully implemented in hardw are or virtually implemented in software [Hec86]. A full im plem entation provides a dedicated processor for each cell, and a dedicated inform ation channel for each interconnect. M odest sized fully-implemented analog systems have been realized 88 Table 5.2: C o m m e rc ia lly A v a ila b le N e u ra l N e tw o rk S im u la to r P a c k ­ ag es Name, Company Processor Type; Host CUPS Capacity (Cells, Connects) Models Mark HI, TRW Eight 68020/6888l ’s (12.5 MHz); VAX/VMS 500K 65K , 1.3M most paradigms £-1, SAIC Pipelined Harvard architecture; PC/AT 10M 1M , 1M backward propagation, adaptive resonance, Hopfield etc. ANZA, HNC 68020 + memory; PC/AT 25K 30K , 300K most paradigms NDS, Nestor Software on Sun/ Apollo workstation or PC/AT 500K 150K , 15M pattern recognition (proprietary) on VLSI chips. Electro-optical im plem entations are also being explored [TH88]. Once built, fully implemented systems are difficult to reconfigure. However, they are attractive for certain special purpose applications such as sensory processing, vision and p attern classification [Hec86]. V irtual im plem entations sim ulate the function of a neural network using a fewer num ber of physical processors, by tim e m ultiplexing several cells on each processor. The sim ulation of a neural network of M cells on a com puter system w ith N physical processors is a virtual im plem entation if N < M . In this case, the network is partitioned into N disjoint groups of cells. These groups are then m apped onto a m ultiprocessor as illustrated in Fig. 5.5. T he activity of all cells in a group is sim ulated by a single processor. Interconnects among these home groups are m ultiplexed over the physical interprocessor links. The states of the cells, connection weights and other system param eters are stored in local memories. Since the im plem entation is essentially done in software, it is more flexible for general-purpose applications._____________________________ remote region , connection influence region Artificial Neural Network • / cell \ core 3 core % ' :ore M = 0 (1 0 )-(10 ) cells Mapping Processor Multiprocessor N = 0 (1 0 )-(10 ) processors Physical communication channel Figure 5.5: Mapping of neural nets onto parallel computers 90 Currently, all large neural systems are being virtually sim ulated. For example, “neurocom puters” th a t are being m arketed for implementing artificial neural system s are essentially conventional personal com puters w ith attached coprocessors or array-processors, and specialized software [Hec86]. Small bus- based multiprocessors take advantage of the much faster speeds of electronic systems as com pared to biological systems in order to cope w ith the complexities of the latter. Specifications of some commercially available neural sim ulation packages are given in Table 5.2. 5 .4 .2 P r o c e sso r a n d M em o ry R eq u irem en ts We are interested in the im plem entation of neural system s w ith possi­ bly millions of cells, on fine-grained m ultiprocessor systems whose architectures are tailored to m eet the requirem ents of efficiently supporting artificial neural system s and other value-passing connectionist models. Such com puters need adequate software support for specifying, editing and archiving the network, and for interpretation and presentation of the results. Moreover, comprehen­ sive interactive user interface is indispensable for these system s. We shall not explore these software design issues, and confine ourselves to hardw are design considerations in this dissertation. To determ ine the processing and memory requirem ents of a virtual connectionist machine, we characterize the behavior of each cell in a value- passing neural network as follows: The generic cell, shown in Fig. 5.6, receives inputs th a t are either global signals or the output signals of neighboring cells m odulated by the weight of the connection between them . A t any tim e, cell i has an internal state, <fc(t). The num ber of possible internal states is sm all, usually ten or less [FB82]. Each cell also has an output x,(t), which assumes integer values from a small range, say (-127 to 128), and can thus be represented by one byte. The state of a cell comprises of its internal state and output. Each cell continuously updates its state and passes its new o utput value to all neighboring cells.______________ ______________________________________________ _ inputs (to ail neighbors) (output) (from cell m) Figure 5.6: A generic processing cell of a neural network The state of a cell is updated according to the following equations: xt -(< + l) = Tw(*)E «;«j(0*y(*)l M - 1) i g,(t+ l) = f[qi(t),Xi(t),x0(t)] (5.4.2) Xj(t) is the output of cell j (as seen by cell t) a t tim e t, and iw ,,y [t) is the weight of the interconnect between cells * and j. The weighted sum of the outputs of all cells th a t share an interconnect w ith cell t is first calculated. This sum is the effective net input to the cell. By assum ing each cell to have a connection to itself so th a t iy,-t, ^ 0 in general, we also cater to autofeedback. The output is determ ined from the effective input by a transfer function, T, which m ay depend on the current internal state. The internal state can signify, for example, w hether the cell is disabled, saturated or in norm al operation. Typically, T is a nonlinear threshold function. Note th a t xq is a special dum my cell which is assum ed to be connected to every cell. It is used to cater to global signals which can be used for broadcasting the thresholds, enable/disable and begin/end signals. The form of the function / can vary, but will generally be restricted to.conditionals_and_simple.functions.______________________________ 92 Equations (5.4.1) and (5.4.2) capture the cell behavior of m ost non- adaptive neural networks w ith static interconnects. These two equations rep­ resent an asynchronous mode of com putation since the sequence of fs can be different for different cells. These equations encompass a discrete tim e domain. Models th a t use higher order predicates such as r-codons [GH87], can be cov­ ered by allowing the weight to be a function of some of the neighbors’ outputs. O ther models [FB82] attrib u te a potential to each cell, besides its state. In our characterization, this potential is not explicitly given, since it does not provide any additional inform ation th a t affects the functioning of the system. In addition to updating their states, the cells of an adaptive system have the added ability to modify the weights tw tiJ governing their behavior. A system is said to be capable of learning if it can eventually improve overall perform ance by self-adaptation at the cell level. Learning can be autonom ous [CG87,AA82,Bar85] or through training sessions as in backward propagation m ethods [RHW86]. Autonom ous learning mechanisms typically use a variation of Hebb’s law where the strength of a connection is changed by an increment proportional to the product of the outputs of the two cells at the ends of th at connection. Thus, A W i j ( t + 1) oc X i ( t ) x j { t ) . (5.4.3) Trained networks, on the other hand, usually employ a gradient descend algo­ rithm to minimize an error function. For example, each weight m ay be changed to reduce a local error given by the difference between the present output Xi(t) and a desired output x| (<). So AWij(t + 1) oc x'-(t) - Xi(t). (5.4.4) In both cases, we note th a t the weights can be updated locally. Figure 5.7 shows the design of a virtual connectionist multiprocessor. Figure 5.8 shows the detailed organization of each processor. A loosely-coupled system has been chosen instead of a tightly coupled one w ith shared memory. This is because both function and d ata can be partitioned into m any small 93 Processor 1 Processor N i------ network interface ALU User Interface Local Memory home memory local image Host Com puter INTERCONNECTION N ETW O RK Figure 5.7: A m ulticom puter architecture for implementing connectionist n et- works 94 Home Memory £ W t. Table w y Pipelined ALU weight update Transfer Table < 1 .(*+!),. _L 9i(t) State U pdate Home update *b(‘) O utput Table (remote) (home) Remote U pdate Network Interface and Buffers ------------------- - ----------------------------------------------- - _______ j Interconnection Network Figure 5.8: Detailed processor organization 95 segments w ithout much d ata sharing and dependence, and all cells can be pro­ cessed concurrently. Theoretically, w ith an infinite bandw idth interconnection network, one can achieve a speedup of M by having a dedicated processor per cell. The internal state of a cell as well as a list of its connections and their weights is accessed only by the processor sim ulating th a t cell. However, its output value needs to be conveyed to all its neighbors. These outputs, together w ith global signals or conditions, are the only source of d ata sharing. Repli­ cated shared d ata is seen to occupy only a sm all fraction of the total memory (Sec. 6.5.2). M ore im portantly, full consistency is not critical as the neural model of com putation is quite robust. It is not essential th a t the new output value of a cell be imm ediately available to the neighbors. Effective consis­ tency can be ensured if the interconnection network has adequate bandw idth. Therefore, a loosely-coupled system is deemed more suitable for neural network sim ulation. T he entire network of M cells is partitioned into N disjoint sets of cells of approxim ately equal sizes, one for each processor. D e fin itio n 5.1 The h o m e g ro u p of a processor is the set o f cells that are mapped onto that processor. The processor is the home processor for all cells of its home group. D e fin itio n 5.2 A processor is a v ir tu a l n e ig h b o r o f a cell if its home group contains some neighbor o f this cell. Two processors are virtual neighbors if one of them is a virtual neighbor of a cell in the other’s home group. Com puting w ith a neural model involves a ‘continuous’ updating the states of all cells. Some global functions on the cell states are also evaluated at regular intervals for determ ining convergence and system outputs. The com pu­ tation can be regarded as a sequence of iteration cycles. The virtual tim e of an iteration cycle corresponds to the m ean value of the tim e difference between suc- cessive state updates of a cell. In each cycle, all processors operate concurrently 96 and update the states of their respective home groups. For each processor, an iteration cycle involves about M / N unit update cycles, approxim ately one for each cell of the hom e group, and some overhead for synchronization and com putation of global functions. In the detailed organization of a processor given in Fig 5.8, there are three m ain memories: 1. The home memory stores the current internal state and a list of neighbors for each cell of the home group. 2. The weight table contains the weights of all connections involving cells of the home group. 3. The output table has two sections. The home section records the current o u tp u t of the home group. The rem ote section contains a local image of the rest of the network. This image consists'of the outputs of those cells th a t are not in the home group but have a connection w ith some cell w ithin th at group. Global signals th a t m ight influence the local com putation are also stored. Since comm unication among the processors is not instantaneous, the local images m ay not be consistent across the system . However, the local image should reflect outputs th a t are either current or a few iterations old. To update the state of a cell, its list of neighbors is read sequentially. This list provides the addresses of successive (wij,Xj) pairs which are fetched from the weight table and output table respectively. T he processing unit has a pipelined architecture which accum ulates the sum of successive products, tu,,yx/. For adaptive systems, the processing unit also updates the weight coefficients. T he transfer function, T , is implem ented by table lookup. If the 8 m ost signifi­ cant bits of the weighted sum are used to select one of 256 possible outputs, a 1Kbyte RAM suffices for a table lookup for up to four different transfer func- tions. The internal state of the cell is updated using a finite state m achine. This 97 is also implemented using a table lookup since a cell can assum e only a small num ber of internal states. The new output of a cell updates the home section of the output table. It is also sent to the network interface which conveys this inform ation to the cell’s virtual neighbors via the interprocessor com m unication network. This interface also updates the rem ote section of the o u tp u t table when it receives new output values from other processors. Each processor also has a link to the host for transm itting global signals, initialization, diagnostics, and readouts of system state. The capacity of a connectionist machine is m easured in the num ber of cells and interconnects th a t can be accom m odated. The speed of the m a­ chine is m easured in the num ber of Connection-Updates perform ed Per Second (CUPS). We now determ ine the am ount of memory required for a connectionist system w ith a given capacity and speed. O ther hardw are requirem ents will be assessed in Section 6.5.2. This is helpful in deciding on the system sizes th a t can be economically implemented given the state-of-the-art VLSI and packaging technologies. It also provides another m etric to determ ine the suitability of a parallel com puter architecture. Let H = M / N be the average size of a hom e group. To support networks w ith up to 224 cells and an average of C connections per cell, the home mem ory of each processor will require at least SHC bytes to store the neighbor lists. If outputs and weights are 1 byte each, then the weight table requires at least another HC bytes. For C in the range O(102 — 10s), the size of the output table is much less th an 4HC bytes because m any cells in the home group have common neighbors. The network interface of processor k stores the list of virtual neighbors for each cell of the home group. The virtual neighbors of the processor k have buffers or bins allocated to them . The updated output of a cell is stored in the bins of its virtual neighbors. The capacity of the bin allocated to processor I is mm(packet-size, ujk,j). W hen this bin gets full, its contents are bundled into a packet and sent to I._______________________________________________________ 98 A total of about 6HC bytes is required for the three memories and the neighbor lists and bins in the network interface. Less than 10% of this is due to the rem ote section of the output table. If a single system-wide m em ory had been used, then the total size of the home memory and weight table would be about N tim es th a t in the local memory of a processor, but the replication of d a ta in the rem ote output tables would have been avoided. We see th a t the increase in total memory due to the loosely-coupled im plem entation w ith distributed memory is not very significant. T he size of the output table and interface buffers scale alm ost linearly w ith the home size, H, provided H is not much less th an Ge. So the total mem ory required does not increase significantly w ith the system size until H -C Gc. For N = 21 2 and C = 1000, about 2 M bytes of memory is required per processor if networks w ith up to one million cells are to be supported. This is reduced to around 128 Kbytes if the num ber of processors is increased to 64K. This am ount of memory will dom inate the real-estate requirem ents for a processor, since the pipelined arithm etic unit and other logic require much less area on silicon. So the size of a processor is inversely proportional to the network size, as the total memory required is alm ost constant for the range of network sizes under consideration. 5 .4 .3 M a p p in g o f N e u r a l N e tw o r k s How should we partition the cells of a neural network into home groups? W hat should be the criteria for allocating these groups to proces­ sors? In a fully connected network, these choices are not critical, since all equal sized partitions are topologically equivalent. B ut in a partially connected net­ work, proper partitioning and allocation can significantly affect the volume and characteristics of interprocessor com m unications, and hence the dem ands on the processor interconnection network. Let Vk,i be the num ber of cells in the home group of processor k th a t have processor I as their virtual neighbor. Suppose each pair of pro- 99 cessors had a dedicated comm unication link, i.e., they were a t a distance of one. Then, an optim al partitioning of the cells into home groups to minimize the total interprocessor comm unication bandw idth would be th a t for which E * £ i vk,i , 1 < k < I < N is minim um. This would also be the objec­ tive function if a crossbar or a single bus was used to provide pseudo-complete connectivity among all processors. For a direct interconnection network, the distance between processors k and Z , d*,i, is not constant for all processor pairs. In this case, the partitioning and allocation policies are interrelated, and de­ pends both on the structure of the neural network as well as the topology onto which it is m apped. The joint optim izing function is rnin(%2 dk,i 1 < k < I < N k I Finding the optim al partitioning and allocation is an NP-complete problem . In fact, for a network specified by the general model of Section 5.3, an optim al solution is not possible because the overall network structure is given, but the individual connections are not enum erated. However, the model naturally leads to the following heuristic approach: P a r titio n in g P rin c ip le : Prefer cells belonging to the sam e core for inclusion in the same home group. If a core is larger th an a home group, then this -policy tends to minimize the num ber-of home groups-to which the cells of a core are distributed. Otherwise, it tends to minimize the num ber of cores m apped onto the same processor. M a p p in g P rin c ip le : If the home group of processor k consists largely of cells in the core or influence region of the cells m apped onto a processor /, then the distance, dkti should be as small as possible. More formally, we first observe th at interconnection networks can be readily partitioned into equal sized subnetworks such th a t the average distance between two processors in a subnetwork is less th an the average distance for the entire network. For example, an n-dimensional boolean hypercube can 1 0 0 be subdivided into 2q (n-q)-dimensional hypercubes. Moreover, the process is recursive in th a t the subnetworks can be further partitioned into sm aller networks having lesser values for the average distance. In the partitioning and m apping scheme given below, care is taken to ensure th a t all partitions of the interconnection network have this property. T he interconnection network is first divided into subnetworks of ap­ proxim ate size M f ((?,- + Gc) each. The cells th a t are m apped onto the proces­ sors of a subnetwork are in more frequently com m unicating cores or influence regions. If the core size is greater th an home size, let Gc — rM /N , for some integer, r. A core is divided into r sets of equal or com parable sizes, which form the home groups for the processors in a subnetwork of size r w ithin the larger subnetwork of size M/(Gi + Ge). Otherwise, we allocate all cells of a core to a single processor so th a t a home group consists of an integral num ber of cores. This allocation scheme is consistent w ith the two principles given above, and yields a good solution w ithout requiring extensive com putation. 5 .4 .4 C h o ice o f In te r c o n n e c tio n N e tw o r k W hen the output of a cell is updated, it needs to convey this new value to its neighboring cells through messages to each of its virtual neighbors. These messages are less th an 8 bytes long if an optional tim e-stam p or iteration num ber is not included. Due to their small size, it is im practical to send messages individually. The overheads involved in message setup, destination decode, store and forward, would easily dom inate the actual transm ission tim e. We therefore combine messages from cells of the same home group th at are destined for the sam e processor into message packets. Each packet contains an (internal) cell identity num ber and its output value per message, besides header inform ation containing the addresses of the source and destination processors as well as error correcting and control bits. The size of a packet which conveys 16 values will be about 64 bytes, which is acceptable for a highly concurrent, fine-grained system . ________________________________________________________________ 1 0 1 W hat are the basic considerations in choosing an appropriate intercon­ nection network for a connectionist machine w ith over a thousand processors? Ideally, the interconnection network should have adequate bandw idth to pass the new output of each cell to all its virtual neighbors w ithin one iteration cy­ cle tim e. Let B t be the bandw idth required of the interconnection network to achieve this goal. Due to the small size of the packets and the large num ber of virtual neighbors per processor, packet switching w ith distributed control of message routing is chosen over circuit switching, and the m ode of communica­ tion is asynchronous. This m eans th a t some packets m ay not be able to reach their final destination w ithin an iteration tim e because of contention delays. Experim ents have shown th a t the bandw idth of a large system is reduced by about 50% due to contention or unused wires [Hil85]. Thus, if a contention-free bandw idth of 2B t is provided, then m ost of the packets will reach their desti­ nation w ithin an iteration cycle. This is acceptable because the com putational model is robust enough to prevent a few stragglers from affecting the results. This also indicates th a t system throughput is more critical th an m etrics like average comm unication delays. M ultiprocessor systems can be based on four types of networks: direct interconnection networks w ith a point-to-point (static) topology, m ultistage in­ terconnection networks (M INs), crossbars which have a dynam ic topology, and bus-connected systems. Since we have a large num ber of concurrent processors, distributed control of message routing is desired. D istributed routing can be perform ed by MINs such as the D ata M anipulator and the Omega networks. However, the tim e to com m unicate between any two processors increases log­ arithm ically w ith the system size, even in the absence of resource contention. Since MINs do not provide quick comm unication between particular processors, they do not benefit from the locality of messages exhibited by the neural net­ works being considered. Hence they are not considered suitable building a connectionist sim ulator. Crossbar switches can be used to provide a pseudo-com plete connec­ t i v i t y among the processors. Currently, crossbar switches have fewer th an a 1 0 2 hundred inputs. M ultiple crossbars can be used in a hierarchical fashion to in­ terconnect thousands of processors. This however leads to com plicated control and increased delays. Recent research in the design of large crosspoint switches using m ultiple one-sided chips [VGG87] indicates th a t larger crossbar switches m ay soon be feasible. Still, the full connectivity offered by crossbars is superflu­ ous, since a good m apping strategy ensures th a t each processor comm unicates only w ith a fraction of the other processors. Bus-based systems entail more complex hardw are for bus allocation and arbitration. A single bus cannot support a large num ber of processors. However, m ultiple-bus systems or a system using both buses and direct inter­ connects can be viable for a m edium sized system , particularly if a lot of broad­ casting or partial broadcasting is involved. This issue is further investigated in Sec. 6.2.2.5. Point-to-point topologies are deemed suitable for system s w ith thou­ sands of nodes due to direct com m unication paths among the com puter modules, sim pler com m unication protocols and more efficient support for local commu­ nication [AJP86]. Massively parallel system s have been already constructed based on the hypercube, torus or binary tree topologies [HCG88]. Due to their extremely sparse connectivity, tree-based topologies are not being considered for constructing connectionist machines. Large interconnection networks based on hypercubes, hypernets and the torus [Seq8lj have attractive properties for im plem entation on VLSI chips. Efficient algorithm s for distributed routing and broadcasting also exist for them [HJ86]. Therefore, these three topologies, w ith possible enhancem ents, are assessed in Section 6 regarding their suitability for supporting a neural network. 103 Chapter 6 Communication-efficient Connectionist Machines C om m unication an d con trol belon g to th e essen ce o f m an 's inner life, even a s th e y belon g to h is life in society. — Norbert W einer, The Human Use o f Human Beings (1950) 6.1 Introduction T he prim ary factor th a t lim its the speed of a highly parallel simula­ tion of large neural networks is the interprocessor com m unication required to convey the activation level of each cell to its neighbors. In this chapter, we estim ate the m inim um bandw idth required of a m ultiprocessor interconnection network so th a t interprocessor comm unication does not become a bottleneck during the execution of a neural network specified by the general model of Sec­ tion 5.3. F irst, theoretical estim ates of interprocessor bandw idth requirem ents are derived. Experim ental results are then presented and com pared w ith the theoretically obtained values. The influence of the structural characteristics of a neural network on comm unication dem ands is studied in detail. T he effective­ ness of broadcasting messages is examined. Finally, hypercubes, hypernets and toruses are evaluated w ith respect to their ability to handle the desired volume of interprocessor communication.___________________________________________ 104 6.2 Communication Bandwidth Requirements An efficient sim ulation is based on a balanced utilization of all re­ sources: processors, memory, interconnection network and system I/O . For our purposes, the crucial balance is th at between processing and com m unication capabilities. So we need to estim ate the bandw idth required in term s of the average num ber of packets th at a processor needs to handle per iteration cycle. This m etric indicates the complexity and cost of the network interface switch in each processor, and the bandw idth required of the physical com m unication channels to sustain a given com putational rate. Alternatively, given th e speci­ fications of an interconnection network, it yields the num ber of iteration cycles th a t can be reasonably executed per second for a particular problem. A packet- switched network using a store-and-forward routine for packets in tran sit, is considered. 6 .2 .1 T h e o r e tic a l E stim a te s The packets handled by a processor are those for which the processor is either the origin or the final destination, as well as the packets th a t are handled by th a t processor on the way to their final destination. Let $,• be the average num ber of message packets th a t are sent by processor fc per iteration cycle to sim ulate the connections between its home group and those cells th a t are in the influence region of this group. We define 4 > c and 4 > r in a sim ilar fashion to cater to the connections in the core region and the rem ote region. We first estim ate Let pc = CcJ Gc, be the probability th a t two cells w ithin a core are connected, and H = M /N be the size of a home group. Following the m apping policy given in section 5.4.3, we have two cases: Cast 1: Ge > H Here, the home group of processor k consists solely of cells belonging to the same core. Consider a cell a in this home group which is connected to a cell b in its influence region. Let cell b be allocated to processor /. / ^ k. If 105 the connections to the influence region were random ly distributed, the average num ber of connections between cell a and the cells m apped onto processor I w C * * is m a x (l, ^ r* ). However, due to the clustering effect, the probability of a connection between cell a and a cell in I increases to pePi, if Pi is high enough. Thus, the average num ber of connections between cell a and a virtual neighbor is about max(pepiH, 1, These connections can be sim ulated by a single message to the neighbor for every output update. Thus, th e average num ber of messages sent per cell, which is the same as the average num ber of virtual C' neighbors of a cell, = where P = ------- ;------ (6 -2 1 ) max{pcpiH, 1, -Q ?) To find the num ber of packets sent by processor k, we first determ ine the expected value of If pH = m ax (l, ), then we can assume the connections to be random ly distributed, so th a t the average value of v^i is H { 1 - e -* c < /^*), as deduced later (see Eq. 6.2.6). Otherwise, let E be the set of cells on processor I which are neighbors of cell a, and F be the set of home neighbors of cell a. If e & E , and f € F are chosen at random , then the probability th a t there is no connection between e and / is 1 — pi- cell a has approxim ately pcH neighbors on its home processor, and pH neighbors on processor I. Thus, the probability th a t / has a neighbor in set F is 1 —(1 — Pi)pH which is approxim ately 1 — e~pPiH if p, <C 1. Furtherm ore, if peH[l — e~pPiH\ » 1, then we can assume th a t every cell in processor k ’s home group has a home neighbor th a t is also a neighbor of processor /. Then Vk,l H [l - e~ppiH] (6.2.2) Therefore, the average num ber of packets sent per processor is H[1 - e-p* H] C- X —— ------- ----------r-jrr (6.2.3) pH[ 1 - e~pP*H] K J P where p is given by Eq. 6.2.1, and /? is the m axim um num ber of messages per packet, as stated earlier.____________________________________________________ 106 Case 2: rGe = J E T , r > 1 Again, consider let I be the home processor for a cell b th a t is connected to cell a and is in its region of influence. Besides the core of 6, the home group of processor I also includes r-1 other cores. All these groups are in the influence region of one another, and by transitivity, in the influence region of cell a. However the existence of the connection to cell b does not affect the probability of cell a ’s having a connection w ith a cell in one of the other cores on processor L Therefore, the average num ber of connections from a to the home group of processor / is ?(■ — Pcpi + { r - m \ The first term is the num ber of connections to the core of 6, while the second is the expected connects to the r-1 other cores th a t share the sam e home group. /nr. The num ber of messages sent by a cell is where H p p = m ax i 9<E. I („ „ j. O ' - X ) CA 1 l’ G{ ’ r ( G i J ] ' W ith an analysis sim ilar to th a t for case 1, we can approxim ate the average num ber of messages reaching a virtual neighbor / of processor k by vk,i = H H 1 _ e - P PiH Vk,l 1 - e~ P P*H i , K T ~ if'-'iPe** G i J The average num ber of packets sent per processor is if p H 1 - e-P Pi* 1 > 1 or > 1. , , ( r - l ) C W f f l otherwise (6.2.4) Vk,l Cj P Vk,l (6.2.5) where Vk,i is given by (6.2.4). Equations (6.2.3) and (6.2.5) give the average num ber of packets sent per processor to sim ulate those connections of its home group th a t go to their regions of influence. From the m apping scheme, we see th a t the processors th a t receive these packets are random ly distributed in a subnetwork of size N G i /M th a t includes processor k._________________________ 107 A processor also sends $ r message packets to processors outside this subnetwork to cater to the connections of its home group to th e “rem ote” region. The value of $ r can be estim ated in the same way as th a t for by substituting pr for Pi, Gr for G, and Cr for C,-. An estim ate for < & cis obtained more simply. For case 2, all core con­ nections are internal to a processor. So no packets need to be sent to other processors, th a t is, < ! > « . = 0. In case 1, a core is spread over \G C /H ] processors. If peH » 1 then, w ith high probability, all the other \G e/H \ — 1 processors are virtual neighbors for a cell in any one of these processors. Therefore, H ( \G ,/H 1 - 1) 0 For reference, let us now estim ate the num ber of packets sent to pro­ cessors in the influence region if pi was not given and instead we assum ed th a t the connections to this region were distributed random ly. We call this the un­ conditional case. It gives an upper bound for the num ber of packets, and also indicates the effect of pi in reducing interprocessor com m unication. T he influence region of a cell is spread among G i/H home groups. Since a cell has Ct - connections to this region, the probability th a t it has no con- Q. nections to a specific home group is (1 — H / G,) *, which can be approxim ated by e~HG*/Gi if G i/H » 1. Thus the average num ber of processors th a t are virtual neighbors of a cell because of its connections to the influence region, is ! |( 1 - Since the connections are random , the num ber of cells in processor k th a t have processor i as a virtual neighbor is simply given by vkil = H ( l - e- HC^ G^ (6.2.6) Therefore, the average num ber of packets sent by processor k is * « = f¥l f f o r - 1 ______________ 1 0 8 = Gi ( l - e-nC i/G i'j elsewhere. (6.2.7) The value of $ r for the unconditional case is calculated in a sim ilar way, and can be calculated by simply substituting Gr for G{ and Cr for < 7 ,- in Eq. 6.2.7. The average num ber of packets handled by a processor depends on the m ultiprocessor topology used. Let de be the average distance (in term s of the num ber of links traversed from source to destination) to a virtual neighbor in the core region. Similarly, we define di and dr for virtual neighbors in the influence region and the rem ote region respectively. Then the to tal bandw idth required per processor per iteration cycle is $ c(dc + 1) + $i{di + 1) + $r(dr + l) (6.2.8) To estim ate the average distance to a processor in a particular region, we first observe th a t the processors in a specific region are essentially confined to a subnetwork of size M /G region. By assum ing the destination to be random ly placed w ithin this subnetwork, we get a conservative value of the average dis­ tance. Consider a subnetwork of size 2q. If this network is a boolean hypercube, the average distance is q/2. In the case of a torus the average distance is given by 0.5 x (1 — 2q/2) if q is even; 0.75 x (1 — 2 ^ -1^ 2) if q is odd For a (d, h )— hypernet constructed from cubelets, this distance is bounded above by 2h~2(d + 2) — 1. If the hypernet is constructed from buslets, the distance is reduced to 2h — 1 provided there is no bus contention. Thus, given the num ber of processors in each of the three regions and the m ultiprocessor used, we can calculate de, d, and dr. Since we already know how to estim ate < ! > < ., and $ r, we can now evaluate Eq. 6.2.8. 6 .2 .2 S im u la tio n R e su lts Table 6.1 gives the specifications for the network graphs of 6 hypo- thetical neural systems th at were used for sim ulation. The descriptions are in 109 Table 6.1: S p e c ific a tio n s o f F iv e H y p o th e tic a l N e u ra l N e ts Name M Ge Gi < ?c Ci c r net A 2U 512 65536 128 128 32 net B 2U 256 16384 64 128 128 net C 2U 512 8192 256 256 16 net D 2U 2048 65536 1024 1024 512 net E 2i0 128 8192 128 128 32 accordance w ith the structural model described in Sec.5.2, except th a t pi and pc are not specified. Nets A to E have about 16 million cells each. The average num ber of connections per cell ranges from 288 (net A) to 2500 (net D). Net A serves as a reference structure, com pared to which the connections of net B are more spread out while the connections of net C are more concentrated in regions of locality. Net D is characterized by alm ost an order of m agnitude greater num ber of connections as compared to the other nets. Finally, net E has a structure similar to net A, b u t has only 1 million cells. T he networks given in Table 6.1 were m apped on hypercube, hypernet and torus architectures using the principles given in Sec. 5.4.3. For the purposes of obtaining values of < & c, and $ r through sim ulation, the values of the model param eters were taken from a norm al distribution w ith standard deviation, a, equal to a tenth of the m ean values given in Table 6.1. Various values of p,- and pr were considered. The packet size, given by the m axim um num ber of messages sent in a packet, is 16 unless m entioned otherwise. As m entioned in Sec. 5.4.4, a packet contains a cell identity num ber and an output value for each message th a t it conveys. Besides, it carries header inform ation containing the addresses of the source and destination processors, and error correcting and control bits. Thus, the size of a full packet is about 50 bytes. Note th a t some packets will be smaller. For example, if 18 of the home cells of processor k have processor I as a virtual neighbor, then processor k needs to send two packets to I per iteration cycle in a synchronous execution. One of the packets conveys 16 messages while 110 the other only conveys two messages. However, since m ost of the overhead is expected to be in the receiving, storing and forwarding of packets rather th an on their actual transm ission over a physical link, we treat both packets equally in term s of their load on the interconnection network. 6.2.2.1 Effect o f clustering F irst, net A was m apped onto a 14-dimensional hypercube and the bandw idth required was calculated for various values of pi and pr. In the un­ conditional case where the connections w ithin each region are distributed ran­ domly, the average num ber of packets handled per processor is 265216, which corresponds to over 16 M bytes/iteration cycle. Figure 6.1 shows the num ber of packets handled for different values of the conditional probabilities. The dashed lines correspond to the theoretical estim ates, while the solid lines are the experim entally observed values. Y Packets 40 - • per iteration- (thousands) 32 - - x : P i 1 = 8 * : p T 1 = 16 : pT* = 32 2 4 - 1 6 - 8 - H —I —I —I —I —I —I —I —I —I —I —I —I —I —I —I — p~ ,- 1 8 16 24 32 40 48 56 64 Figure 6.1: Effect of clustering on bandw idth requirem ents I l l We see th a t for net A, pt - is somewhat more critical th an pr in de­ term ining the bandw idth. If both p ,- and pr are greater th an 1/16, then the bandw idth required is reduced to well within 1 M bytes/iteration-cycle, which is less th an 5% of the unconditional cane. This indicates th a t prom inent clustered features in the structure of a neural network are critical in bringing bandw idth requirem ents to a m anageable level. We also observe th a t the theoretical esti­ m ates are more conservative. The experim ental values obtained are from 2 to 9% greater th an the theoretical predictions. 6.2.2.2 Effect of system size Fig. 6.2.a shows how the bandw idth varies w ith the num ber of processors in the system. The curve is based on the m apping of net B w ith pi = 1/8 and pr = 1/16, onto hypercube sizes ranging from 1024 to 256K processors. The num ber of packets handled per processor first decreases when the system size is increased from 1024 to 64K processors, despite the fact th a t in a larger network, the average distance between sender and receiver is m ore. W hen 64K processors are used, the home group and core sizes axe approxim ately equal. The increase in bandw idth required when the system size is smaller th an 64K is attributable to two factors: • The size of a home group increases. A processor has to send more messages to its virtual neighbors. This factor is not so dom inant because a cell needs to send at m ost one message to a processor irrespective of the num ber of cells in th a t processor’s home group to which this cell is connected. • W hen more than one core is m apped onto a processor, the im pact of clus­ tering is diluted because some external core m ay have m any connections to one core but very few to the other cores in the sam e home group. This is largely responsible for the 20 fold increase in bandw idth for 1024 processors as com pared to the ‘optim al’ system size of 64K._____________ 112 Packets per iteration (thousands) Gbytes per iteration 300 - - 2 0 0-- 100 X 10 12 14 16 18 log (no. of processors) — > (a) per processor Y 1000-- 100- - 1 0 -- log(no. of processors) (b) overall system — Figure_6.2: Bandw idth dem ands for different system sizes 113 Interestingly, the bandw idth requirem ent shoots up dram atically when the system size is increased beyond 64K, despite the presence of sm aller home groups. This is mainly due to three factors: • The cores are now spread over several processors, and so messages to sim ulate core connections also need to be sent. • A cell has very few connections going to a neighboring processor. Since the total connections per cell is the same, the num ber of messages sent by a cell increases. • Vk,i is very small due to the sm all home sizes. Packets are alm ost empty. This is particularly significant for packets sent to th e rem ote region where the average num ber of messages conveyed per packet was found to range from 1.01 for net C to 7.52 for net D when 256K processors were used. T he causes m entioned above again weaken the clustering effect. T he message patterns approach the unconditional case as the system size approaches M. The total num ber of packets handled in the system per iteration cy­ cle of course is expected to increase w ith the system size. This is verified in Fig. 6.2.b. This means th a t the total communication dem ands increase faster th an the speedup achieved by adding more processors to the system , provided 'a ll processors have the-sam e com putational power. For example, suppose a processor can process 1 millions connections per second. Neglecting overheads, one iteration of net B will take about 1.25 secs on a system w ith 4K processors. This dem ands a total comm unication bandw idth of 3.2 G bytes/Sec. If we use 64K processors, then one iteration takes only 0.078 secs, provided the band­ w idth is 18.5 Gbytes/Sec. T he rapid increase in required total bandw idth w ith the system size is brought out in Fig. 6.3, where the bandw idth required to achieve a linear speedup is shown as a function of the system size, when net B (with pi— 1/8, pr= 1/16) is m apped. It is assum ed th a t each processor has a speed of 1 M CUPS. Actual bandw idth required will be less th an w hat the figure shows because the iteration cycles will be longer due to overheads. 114 Y * : net B 10000 System B andw idth 1000 - - 100 1 0 - *- X 10 12 14 18 16 log(no. of processors) — ► Figure 6.3: Interprocessor comm unication dem ands for linear speedup 6.2.2.3 Effect of network structure / Figure 6.4 shows the bandw idths required by the other 5 nets-for the same values of pi and pr , when m apped onto hypercubes of various sizes. All these nets need less bandw idth th an net B. This is particularly rem arkable for net D which has m any more connections th an net B. It was experim entally observed th a t for net D, more packets were sent in the influence region. How­ ever, the num ber of packets sent to the rem ote region was significantly less as com pared to th a t for net B. This counter-intuitive observation is due to two effects. Firstly, the num ber of messages sent per cell was actually less by 46% to 77% for net D because the denser cores of D caused the clusterings to be more prom inent. Secondly, v^i was larger for net D because of the denser cores and increased num ber of connections. Consequently processor had fewer virtual 115 Packets per iteration (thousands) x : net A ® : net C • : net D * : net E 1 0 0 - 80 6 0 - 4 0 - 2 0 - X 10 log(no. of processors) Figure 6.4: B andw idth required for different nets neighbors on the average when net D was m apped. In general, the sparseness and spread of connections is seen to be more dem anding on the system band­ w idth th an the actual num ber of connections. Some other observations m ade from Fig. 6.4 are • For all curves, the m inim as correspond to system sizes for which the home group and the core sizes become comparable. • Net D has a structure sim ilar to net A but alm ost 10 tim es as m any connections. However the bandw idth requirem ents do not increase pro- portionally. This reinforces our earlier observation th a t the distribution 116 of connections can be more critical th an the actual num ber of connections for determ ining the system bandw idth. • Net C requires the least am ount of interprocessor com m unication because of the greatly localized nature of its connections. • Net E has a structure sim ilar to net A, but contains only 1 / 16th the num ­ ber of cells. This does not cause a significant reduction in the bandw idth dem ands, though there is a prom inent shift in the optim um system size. 6 .2 .2 .4 C h o ice o f P a c k e t Size Bandw idth X : Hypercube, Net A 500 - 4 0 0 - 3 0 0 - 2 0 0 - 100 X 8 16 24 32 40 48 56 64 72 Packet size — > Figure 6.5: Effect of packet size on bandw idth ( pt - = 1/8, pr = 1/16) Till now we have assumed th a t each message packet contains at m ost 16 messages. Consider a system w ith 64K processors, used to implement networks of up to 16M cells. The source and destination processor addresses are two bytes each, and the internal address of a cell is one byte. The output of a cell is specified by one byte. Therefore, a packet containing p messages will be at least (4 + 2p) bytes in size. O ther inform ation such as routing control bits and 117 error correction bits are common to all messages in the packet. The overheads in packet storing and forwarding are also shared by its constituent messages. Thus we would like to have large packet sizes. Y X : Hypercube, Net A 2500 - - 2000 - - Bandw idth ’ KBytes/cycle 1500 - 1000 - - 5 0 0 - X 8 16 24 32 40 48 56 64 72 Packet size — ► Figure 6.6: Effect of packet size on bandw idth ( p, = pr — 1/32) If the packet size is fixed, then the packets sent by processors th at need to convey-only a few messages to particular neighbors, m ay not be full. So the desirable packet size is determ ined by the distribution of Vk,i in the three regions. This in tu rn depends on the particular network used. Fig. 6.5 shows the bandw idth required in K B ytes/iteration-cycle when net A (with p ,= 1/8, pr— 1/16) is implemented on a 64K processor system. In this case, m ost of the Vk,i values are greater th an 32. So using a packet size greater th an 16 is desirable. However, for the same net w ith pi— pr— 1/32, the values of Vk,t fall so th a t larger packets are no longer attractive, as shown in Fig. 6.5. We see th a t Vk,i is quite sensitive to the structure of the connectionist network. So an alternative is to use variable sized packets, or to specify the packet size at the 118 sta rt of the com putation after a compile-time estim ation of the distribution of Vk,l- 6.2.2.5 Effect of Broadcast Hierarchies In the previous section, the outputs of a cell were conveyed to its virtual neighbors through personalized packets. An alternative is to broadcast the output to all processors , or to the processors in a predefined neighborhood called the broadcast domain (partial broadcast). The latter approach is taken in [GH87] where three broadcast domains are used. On receiving a broadcast packet, a processor examines the sender’s address and records only those messages th a t are needed by its home group. O ther messages are simply ignored. Optim um distributed broadcast schemes have been proposed for the hypercube, hypernet and the torus[HJ86,HG87a]. For the first two architectures, a broadcast message is received by all processors in 0(log N ) time. For the torus, 0 (y /N ) tim e is required. In all three schemes, each node is visited only once. Broadcasting is more efficient th an sending personalized messages if, on the average, a sizable num ber of processors in the broadcast dom ain are actually virtual neighbors of the sender cell. Since the percentage of virtual neighbors is very low in the rem ote region, it is not advisable'to broadcast over the entire network. Total broadcasting requires a processor to handle about M j(3 packets per iteration. This is because a processor needs to send H j(3 packets per iteration, and each packet goes to M /H — 1 processors. For a network w ith 22 4 nodes and a packet size of 16, this translates to over 1 million packets per processor per iteration cycle. This far exceeds the requirem ents for personalized comm unication for all the networks considered. On the other hand, if case 2 holds, broadcasting to processors im plem enting the same core region always shows an improvement. Broadcasting to all processors implementing a influence region is prefer- able if Pi is less th an some threshold, Ph - To estim ate Ph, we first observe th a t 119 the num ber of packets th a t a processor needs to handle because of broadcasting w ithin this dom ain is given by For personalized messages, if Ge > H (Case 1) then, from Eq. 6.2.3 and 6.2.8, the num ber of packets handled per processor is greater th an . T hus, if Cidi P' PrG , then broadcasting is preferable. Note th a t C ,/ is the probability of connection in the unconditional case. If Ge < H (Case 2) then, from Eq. 6.2.5 and 6.2.8, th e num ber of packets, composed of personalized messages to the region of influence, th a t are handled per processor is more than ~ W ^ W ) (6'2' 10’ Com paring Eqs. 6.2.10 and 6.2.9, we deduce th a t broadcasting is surely less dem anding when ft < (6-2.11) P cS*i which is the same expression as in the first case. Equation 6.2.11 is based on conservative assum ptions and gives a lower bound on ph. Experim entally observed values are given in Table 6.2. To obtain this table, we first found the bandw idth dem ands when broadcasting is per­ formed for messages to the core and influence regions, and personalized mes­ sages are sent to the rem ote region, ph is the value of p ,- below which the above scheme for partial broadcast requires lower bandw idth th an a scheme w ithout any broadcasting. We see th a t the trade-off point is quite variable and depends on both th e network characteristics as well as the system size. 6.3 Suitability of Existing Parallel Computers At present, a num ber of research groups are using commercially avail­ able^ parafielj^ompjuters^fOTneuraljretov^^ 120 Table 6.2: Values of pb for Various N ets. net A net B net C net D net E 4K Processors 0.14 0.56 0.52 0.15 0.07 64K Processors 0.06 0.13 0.22 0.16 0.21 m edium -grained systems such as the transputer-based Com puting Surface with 42 processors, being used at Edinburgh[FRS*87], and the 128 node BBN B ut­ terfly at Rochester[GFL87]. Besides, fine-grained com puters such as the ICL DAP[FRS*87] and the Connection Machine, are also being used. Are any of these systems suitable for sim ulating very large neural nets? We attem pt to answer this question using the architectural requirem ents deter­ m ined in the previous sections. Our analysis suffices to detect bottlenecks and imbalances between processing, memory and com m unication requirements. A more thorough assessment would require examining both machine characteris­ tics and im plem entation strategies in great detail. Consider the sim ulation of a neural net w ith one million cells and an average of a thousand connections per cell. Table 6.3 presents typical require­ m ents per processor for system sizes of IK , 4K and 16K processors, given p,= 1/8 and pr= 1/16. Table 6.3: A rc h ite c tu r a l R e q u ire m e n ts fo r S im u la tin g a N e u ra l N e tw o rk w ith 1 M illio n C ells System Size Memory Com munication B andw idth Processor Speed ( # processors) (Mbytes) (Mbytes; (hypercube) iteratio n ) (torus) (M C U P/iteration) 1024 8 0.5 4 1 4096 2 0.5 8 0.25 16384 0.5 2 32 0.07 121 On com paring the specifications of existing m ultiprocessors w ith this table, we find th a t m edium -grained system s suffer from inadequate commu­ nication bandw idth to fully exploit their processing capabilities. Complex- instruction-set-com puter (CISC) based systems are an overkill since processing is largely confined to a fast execution of m atrix vector products which form the inner loop for neural com putations. A pipelined architecture w ith a lim ited instruction set or w ith a hardw are accelerator is more useful. Moreover, com­ puters such as the NCUBE and the FPS T-Series[SF88] do not have adequate local memory. The processors of fine-grained systems such as the CM-2 and the DAP show a b etter m atch w ith the com putational requirem ents. However, lack of local memory is even more critical for these machines. For example, in CM-2, each processor has only 8K bytes of local memory [Con87]. So we are forced to use secondary storage, which leads to a new bottleneck in I/O transfers. For the 1 million cell network, the processors are capable of completing 1 iteration per second, b u t the I/O bottleneck restricts speed to over 14 seconds per iter­ ation. On the other hand, if infinite I/O access speed was available, then the interprocessor comm unication abilities would have lim ited speed to about 10 seconds per iteration. Thus the I/O transfer rate is a more critical lim itation th an interprocessor communication for such systems. Table 6.4 shows the strengths and weaknesses of some existing parallel system s for sim ulating very large neural networks. T ransputer-based systems and the CM-2 seem to offer the best alternatives for neural sim ulation. The for­ m er has the advantages of high-speed concurrent comm unication links, hardw are support for process scheduling, and a large addressable memory space. How­ ever, forwarding of messages through interm ediate nodes can only be achieved by software routines. This slows down comm unication between non-adjacent processors. The high-resolution graphics display of the Connection M achine can provide a visual trace of the network execution. However, their shortcom ings 122 Table 6.4: A b ility o f E x is tin g P a ra lle l C o m p u te rs to S im u la te L a rg e N e u r a l N e tw o rk s. Parallel Com puter No. of Processors Processor Functions M emory Support Communication B andw idth Connection Machine-2 64K adequate poor inadequate FPS T-Series 4096 overkill inadequate inadequate DAP 4096 inadequate poor poor N C U B E/ten 1024 adequate poor adequate GF-11 576 overkill inadequate adequate Butterfly Plus 256 inadequate inadequate inadequate Com puting Surface 130 overkill poor poor are significant enough to provide m otivation for the long-term development of specialized com puters for highly concurrent sim ulation of large neural networks. 6.4 Optical Interconnects A solution to the above problem is to use optical interconnects which provide-higher bandw idths and also circumvent pinout lim itations. Optical crossbars can be implemented as m atrix-vector m ultipliers in a variety of ways [SJRV87]. Medium-sized crossbars w ith over 100 lines and signal bandw idths of 100 MHz per line have been constructed. However, the crossbar switch re­ configuration tim e is at least l^ s , which is ten tim es slower th an electronic im­ plem entations. Thus they are not so useful in packet-switching environments. Furtherm ore, they cannot be easily cascaded due to incom patible input-output form ats. For the same reason, optical MINs are not considered suitable either. Optical buses provide high bandw idth and fan-in/out capabilities. Also, time-division m ultiplexing can be used to share an optical bus among several processors. W ith current technology, the control requirem ents and power con­ 123 sum ption of optical buses is still formidable. Moreover, for large system s, mul­ tiple buses still need to be used. Direct interconnects among processor chips can be m ade through vol­ ume holograms or optical fibers. Optical cable, connectors and routing devices, light sources and detectors, which can be used to form 100 M bits/sec back­ plane interconnects, are commercially available, m ost notably in the AT&T ESS switch network [HHH87]. Moreover, fanouts of up to 10 loads from a single fiber are feasible. Such fibers can be used for interchip connections w ithin the same board to implement partial broadcast schemes. This can lead to lesser bandw idth requirem ents since cells w ithin the same influence region are typi­ cally m apped onto processors on the same board. If m ultiple processors are implemented on the same chip, then internal nodes can be accessed through d a ta paths in free space. This can be imple­ m ented using hybrid G aA s/Si chips and hologram routing elements [GLKA84]. Interchip comm unication using the same techniques is not so appropriate be­ cause of increased bulk and alignment problem s. Various optical interconnection networks are given in [GH88]. Among these choices, hologram networks are the m ost flexible for. inter^board comm unications. - - — By using optical fibers for direct links in a 1024 processor hypercube system , one can perform over 25 iterations/sec, This is considered adequate for m ost real-tim e system s, since convergence is typically achieved in a few iterations. To attain the desired operating speed, one can have several pipelined ALUs operating in parallel w ithin the same processor node, and sharing the same memory. The removal of the comm unication bottleneck and the ease of adjusting processing capabilities make it possible to design a balanced system for efficient neural network sim ulation. The principal penalty paid is the increased cost, power consum ption and bulk accompanying a hybrid approach._________ 124 6,5 Execution of Connectionist Models on the Hypernet In this section, we assess the suitability of hypernets as a virtual neu­ rocom puter for supporting distributed connectionist models. Highly intercon­ nected, fine-grain processors are needed to construct the basic modules for these hypernets. Examples of such modules would be high-dim ensional cubelets, buslets, and a set of nodes connected as a complete graph or a complete bi­ p artite graph. The modules should have as m any links per node as practically feasible under the state-of-the-art technology and economic constraints. For a 64K system , each processor requires about 128K bytes of local memory, as estim ated in Section 5.4.2. The local memory of each node is used to store the states and the weighted connectivity m atrix of all the neurons assigned to it. The states are updated using multiplexing and pipelining. A processing speed of 1 M CUPS enables over 50 iterations per second. We compare hypernets constructed from 5-cubelets and 5-buslets w ith hypercube and torus architectures of the same size regarding their communi­ cation support capabilities and hardw are requirem ents for supporting neural network sim ulation. We also indicate how a hierarchical view of hypernets can alleviate the problem s of knowledge representation and processing, and of per­ ceptual opaqueness [Par87] in the im plem entation of neural networks. 6 .5 .1 C o m m u n ica tio n S u p p o rt To examine the effect of the m ultiprocessor topology on bandw idth dem ands, net A (with p, = 1/8 , pr = 1/16 ) was m apped onto different ar­ chitectures. In the sim ulation of a buslet, an ethernet type of protocol was assumed w ith 25% chance of a packet being accepted. T he results are shown in Fig. 6.7.a. Not surprisingly, hypercubes, which have the highest degree of the four topologies, have to handle the least num ber of packets. However, hypernets are 125 Packets per • iteration (thousands) * : hypercube x : (5,h)-hypernet • : busnet 0 : torus 320 280 240-- 200 160 -- 120 -- 80-- 40-- X 12 14 16 18 10 log(no. of processors) (a) B andw idth demands bandw idth times link complexity log(no. of processors) (b) Normalized bandwidth demands * : hypercube x : (5,h)-hypemet 0 : torus 1280 1120 -- 960 -- 800 -- 640 -- 480 -- 320 -- 160-- X 12 14 16 18 10 Figure 6.7: Bandw idth requirem ents for different m ulticom puter topologies 126 seen to have a com parable performance while using lesser num ber of communi­ cation links per processor. The torus suffers from a large diam eter (Ofy^iV)), and so perform s poorly, particularly when the network size is large. W hen the bandw idth demanded is norm alized by m ultiplying it by the num ber of links per processor, hypernets are more effective th an hypercubes, as seen from Fig. 6.7.b. This is because hypernets provide more support for local comm unication while m aintaining a constant degree. Thus they are more suited for implementing networks in which m ost of the connections from a cell are confined to a small subnetwork. bandw idth tim es link complexity i i. * : hypercube X r (5,h)-hypemet 256- 224- 192 - 160- 128- 96 - 6 4 - 32 - X 18 10 12 14 16 log (no. of processors) — ► Figure 6.8: B andw idth dem ands in implementing Net C on hypernets and hy­ percubes The connections of net C are less spread out th an those of net A. Fig. 6.8 shows the norm alized bandw idth (for p,-= 1/8 and pr= 1/16) when net C is sim ulated. There is a relative improvement in the perform ance of hypernets, as anticipated. We conclude th a t hypernets are m ore cost effective in supporting 127 interprocessor communications when the neural models being im plem ented have a highly localized or clustered structure. 6 .5 .2 P a ck agin g an d V L S I C o n sid era tio n s Let us look at packaging considerations for building a system w ith 64K processors. In view of current technology, the following limits are assumed: M ax. processors/chip = 32 (area limited) M ax. I/O pins/chip = 256 (pin limited) M ax. chips/board = 256 (area limited) M ax. I/O p o rts/b o ard = 2400 (connector limited) Suppose each bidirectional link needs two I/O pins, namely serial in and serial out. Under the above lim itations, a 4x8 torus or a (5,l)-net (either bus-based or cubelet-based) can be implem ented on a chip. For the hypercube, however, only 8 processors can implem ented on a chip since a 4-cube will re­ quire 16x2x12 = 384 I/O pins, which exceeds the lim it. At the board level, the area lim itation is again more critical th an pin/connector lim itations for the torus. So a board can accomm odate the m axim um limit of 8K processors for a torus, so th a t all the processors require 8 boards. In the case of the hyper- net, four (5,2)-nets can be implem ented on a single board, and 32 boards are required to accomm odate all 64K processors. In contrast, only a 7-cube can be accom m odated on a board, by barely m eeting the I/O connector lim it. The total num ber of boards required is 16 tim es, and the num ber of inter-board connections is alm ost 18 times th a t for the hypernet. These packaging statistics are sum m arized in Table 6.5.a where only the links required to make the connections specified by the topology are con­ sidered while determ ining the I/O pins or off-board connectors used. T he low link complexity of the torus is reflected in its hardw are demands. However, the hardw are savings due to the interconnection network is not com m ensurate w ith the m uch greater bandw idth requirem ents as com pared w ith the hypercube or hypernet. For the hypercube, pinout lim itations become crucial a t both chip and board levels, while for the hypernet, the limit on I/O wires becomes sig- Table 6.5: Packaging R equirem ents for a C o n n ectio n ist M achine w ith 64K P ro cesso rs (a) Using bit-serial channels. No. of Processors No. of I/O pins or off-board connectors used Topology Torus Hypernet Hypercube Torus Hypernet Hypercube Per Chip 32 32 8 48 64 208 (8x4 torus) (one (5,l)-net) (one 3-cube) pins Per Board 8,192 2,048 128 768 2,164 2,304 (128x64 torus) (four (5,2)-nets) (one 7-cube) connectors Overall System 8 boards 32 boards 512 boards 6144 69248 9x21 7 (b) Using 4-bit wide channels. No. of Processors No. of I/O pins or off-board connectors used Topology Torus Hypernet Hypercube Torus Hypernet Hypercube Per Chip 32 (8x4 torus) 32 (one (5,l)-net) 2 192 256 pins 240 Per Board 4.096 (64x64 torus) 512 (one (5,2)-net) 16 (one 4-cube) 2,048 | 2,048 | 1,536 connectors Overall System 16 boards 128 boards 4K boards 21 5 218....... ' 3x220...... 129 nificant only at the board level. This results in the hypercube requiring much m ore hardw are and further underscores the im portance of reducing the num ber of off-chip and oif-board connections. It should be noted th at, if a smaller num ber of processors were used w ith more memory per processor, then the m axim um processors per chip would be less. In th a t case, the pin lim itations of the hypercube would not be so evident. For example, if we were allowed only 4 processors per chip, then a 12-dimensional hypercube would require 16 boards as com pared to 4 boards for the torus or the hypernet. O n the other hand, if a wider bus were used, the pin-out lim itations would become even more critical, m aking high degree topologies quite untenable for large systems. This is exemplified in Table 6.5.b which is based on allocating 8 pins per in p u t/o u tp u t channel pair. 6 .5 .3 H iera rch ica l O p era tio n The interconnection structure and I/O capabilities of hypernets pro­ vide hardw are support for a hierarchical representation, processing and control of neural nets. A m ajor drawback of distributed connectionist representations is th a t they are hard for an outside observer to understand or to modify. A scheme for learning or self-organization is therefore desirable. C urrent learning schemes such as backward error propagation [RM86] and stochastic m ethods [FH83,GG84] are extremely slow even for m odest examples, and scale poorly. A practical solution is to partition large networks into sm aller modules th at can learn more or less independent of one another. H ypernets provide subnets of various sizes w ithin which one can constrain error propagation or sim ulated annealing search to speed up the learning process. Subnets of appropriate sizes can be reserved for function-specific segments of the neural network. These sub­ nets also form natural broadcasting domains. Since each subnet has dedicated I/O channels, one can selectively m onitor and influence the functioning of a particular neural network segment. The I/O nodes are needed for supporting the intensive 1/O operations needed for error detection and feedback. 130 Sophisticated cognition models involve sim ultaneous and m utually re­ inforcing recognition of features at m ultiple levels of abstraction. In visual recognition, for example, high-level and low-level routines cooperate and com­ pete w ith each other in coming to a global comprehension of a scene [Mar82]. Connectionist networks corresponding to such situations involve layers of nodes w ith arbitrary interlevel constraints and interactions. In a simple pyram idal structure, nodes in one layer are directly connected only to nodes in th e two layers directly below and above them . This is clearly inadequate for com plicated multilevel interactions. W hen a hypernet is viewed as a multilevel hierarchy (Section 4.2.1), some nodes serve as virtual nodes at several levels. These nodes are prim e candidates for m apping of connectionist units which are sim ultane­ ously operative at various levels. 6.6 Discussion and Summary In C hapter 5, we provided a distributed processor/m em ory organiza­ tion for highly concurrent sim ulation of neural networks. The m ultiprocessor organization involves little duplication of com putation and storage. We also reasoned th a t architectures w ith direct interconnects or m ultiple buses for inter- processor comm unication are preferable to m ultistage networks. A fine-grained and loosely-coupled m ulticom puter is more suitable than a shared-m em ory mul­ tiprocessor for massively concurrent connectionist processing. In this chapter, we estim ated the interprocessor com m unication band­ w idth required to provide a balance between processing and com m unication when th a t network is sim ulated on a massively parallel m ultiprocessor system. The theoretical estim ates are observed to be slightly more conservative th an the experim entally determ ined figures. We observe th a t the interconnect patterns among the cells have a dram atic im pact on the bandw idth requirem ents. A m oderate am ount of clustering can reduce the bandw idth dem ands by over 90% when com pared to a totally random interconnect pattern. Similarly, networks w ith highly localized interconnects require less interprocessor com m unication 131 th an networks w ith fewer but more widely spread interconnects. Since the com­ m unication bandw idth is expected to be the principle bottleneck in a highly parallel im plem entation, this provides an incentive for developing neural nets th a t favor local and clustered interconnects. Instead of sending personalized messages between processors, one can use broadcasting techniques. It is seen th a t broadcasting messages throughout a large system is very expensive. However a partial broadcast to limited re­ gions leads to lesser communication for some networks. This suggests th a t the interconnection network can be profitably augm ented by adding local buses or fan-in/fan-out trees. Alternatively, one can use architectures such as bus-based hypernets th at employ both local buses and direct com m unication channels. The comm unication requirem ents of a processor decreases w ith in­ crease in network size till we reach a stage where the size of a home group becomes com parable to the average core size. Beyond this point, the require­ m ents increase dram atically. The increase in overall system bandw idth require­ m ents for attaining a linear speedup is more th an linear. Net C has the lowest requirem ents of the nets considered For pi— 1/8, pr= 1/16, even net C needs a total bandw idth of over 6 GBytes/sec when m apped onto a 16-dimensional hypercube, if iteration cycle tim e is to be kept w ithin one second. Im plemen­ tation of such large networks motivates the search for new technologies such as optics, th a t can provide very high bandw idths. Issues involved in using optical interconnects for constructing a neural net sim ulator, are explored in Ghosh and Hwang[GH88j. An alternative is to reduce bandw idth dem ands by modifying the com­ putational model. For example, a cell m ay broadcast its updated output only if it differs significantly from its old output. Moreover, it may be allowed to update its state even when it has not received the updated outputs of m any of its neighbors. The effect of these pragm atic policies on execution speed, and indeed on the results themselves, is an area not fully explored yet. 132 Chapter 7 Conclusions T h e w o o d s are lovely, dark and deep, B u t / have p ro m ises to keep. A n d m iles to g o befo re / sleep. — R O B E R T F R O S T , Stopping by Woods on a Snowy Evening (1923) Enough research will ten d to su p p o rt yo u r theory. — Arthur B loch, M urphy’s Law (1979) 7.1 Main Contributions In this dissertation, we studied a host of issues influencing the cost- effective support of interprocessor comm unication in large m ulticom puter sys­ tem s. These issues were elaborated through their application in the design and analysis of hypernets, and further concretized in a detailed case study of parallel architectures for neural network sim ulation. The key original contributions of this thesis are: • The hypernet family of hierarchical networks is introduced in C hapter 2. A construction methodology is provided and schemes for distributed routing and broadcasting of messages are developed for these networks. In subse­ quent chapters, hypernets are shown to exhibit m any properties desired of a communication-efficient architecture for massively parallel processing. A variety of algorithm s are m apped onto hypernets based on their ability 133 to em ulate hypercube connections or hierarchical organizations w ith little overhead. • T he concept of natural comm unication patterns, given in C hapter 3, pro­ vides a new m etric for determ ining the suitability of a given m ulticom puter network for supporting an application program and the associated m ap­ ping scheme. It thus relates the application dom ain to the architectural one. Similarly, the average p ath delay combines graph-theoretic consid­ erations w ith those of the underlying VLSI technology to provide a joint m etric for architecture evaluation. • The viability of dynamic task allocation and load balancing in the concur­ rent execution of data-dependent program s is shown through sim ulation studies presented in C hapter 4. O ur postulates about the localized nature of interprocessor comm unication in such environm ents are validated in the process. • A new model for characterizing the structure of large neural networks is presented in C hapter 5. This model points to an efficient way of m ap­ ping these networks onto m ulticom puters. Critical issues in th e highly concurrent sim ulation of artificial neural networks are examined. • From the structural characterization of neural networks using the model of C hapter 5, we estim ate the comm unication bandw idth dem ands in im­ plem enting these networks over a wide range of m ulticom puter sizes and topologies. These estim ates show a m ism atch between the capabilities of commercially available parallel processors and the projected architectural requirem ents. This motivates the development of new architectures for cost-effective sim ulation of large neural networks. We show th a t hyper­ nets based on the tailor-m ade processor/m em ory organization detailed in C hapter 5 are prim e candidates for this venture if adequate communica­ tion and system I/O bandw idth is provided. 134 7.2 Suggestions for Further Research There are several im portant related issues in the design and devel­ opm ent of highly concurrent architectures th a t m erit further research. These include: • Extension of hypernets: An recently proposed architectural solution to the Delta-I noise problem [Dav82] has greatly enhanced the viability of fabricating a large crosspoint switching network on a small VLSI chip- set [VGG88]. Consider the use of such switches for connecting subnets at the same level as a complete graph, instead of the point-to-point con­ nections proposed in C hapter 2. This will result in a dynamic network w ith enhanced comm unication abilities, at the cost of extra hardw are and more elaborate routing control. A study of the tradeoffs involved can lead to generalized hypernets w ith improved architectural features. Further exploration of the potentials of bus-based hypernets is also w arranted. • Routing switch design: Dedicated hardw are routing chips such as the hy­ perswitch [CMP88] and the torus routing chip [DS86] have been designed for m ulticom puters exploiting m edium -grain parallelism . Massively par­ allel architectures dem and simpler routing hardw are which can be inte­ grated into the processor chips, and can handle sm aller sized messages. The optim al design of such switches is an open problem . • Buffer management and flow control: The m inim um num ber of buffers needed to prevent deadlock in hypernets can be derived using the con­ cept of virtual circuits. Increasing the buffer size above this m inim um can improve the perform ance of the network. The optim um buffer size needs to be determ ined for various interprocessor com m unication patterns. An effective buffer m anagem ent policy is needed to prevent “buffer hogging” w ithout sacrificing buffer utilization [Irl78]. The policy should also sup­ port a virtual cut-through m echanism, since this greatly reduces commu­ nication delay when a large num ber of hops is needed. Furtherm ore, the 135 effectiveness of flow control [RF87], which aims at relieving message con­ gestion by selectively throttling certain links, needs to be determ ined. • Dynamic task allocation and load balancing: The load balancing scheme for the TAK program given in Sec. 4.3.2 does not distinguish among child processes when allocating them to processors, since it assumes th a t each child process needs the same am ount of com putation. In practice, pro­ cesses have varying requirem ents. A policy th a t incorporates process com­ plexity estim ates in m aking dynamic allocation decisions should yield bet­ ter results. Strategies th a t can cater to heterogeneous nodes and dynamic load pressures [Lin85] in the context of fine-grain concurrent processing also need to be developed. • Applications of hypernets: The potential of hypernets in supporting object- oriented languages and parallel production systems was outlined in Sec. 4.4. A nother promising area is com puter vision, where m any problem s are amenable to massively parallel processing [Mar82]. These areas can serve as detailed case studies for further research. • Operating system issues: The design of a distributed operating system for highly concurrent m ulticom puters has not been considered in this disser­ tation, though it is of vital im portance. In particular, facilities for dis­ tributed debugging and asynchronous checkpointing and recovery [Lin85] are needed for system reliability, availability and servicability. The asyn­ chronous nature of m any neural network models greatly reduces the over­ heads during a highly concurrent execution, as explained in C hapter 5. W hat other problems can be solved by asynchronous com putation mod­ els? Synchronization issues in highly concurrent processing need to be investigated. • Neural network studies: This area is underdeveloped and has a great po­ tential for further research. A fundam ental goal is to develop a sophisti­ cated brain sim ulation engine. In our case study, we focussed on a small 136 facet of this problem by investigating the architectural requirem ents and com m unication dem ands in sim ulating large neural networks. O ther im­ p o rtan t research areas for attaining this goal include effective knowledge representation and techniques for network compilation, fast learning mech­ anisms which can make run-tim e modifications to cater to new knowledge, network stability and convergence issues, extraction and interpretation of results from the networks, and the design of a user interface. Signifi­ cant progress in all these areas is needed to realize the full potential of connectionist com putation. 137 Bibliography [AA82] S. Am ari and M. A. Arbib, editors. Competition and Cooperation in Neural N ets. Springer-Verlag, 1982. [AG81] J. R. A rm strong and F. G. Gray. Fault diagnosis in a boolean n-cube array of microprocessors. IE E E Trans. Computers, C-30(8):587-590, Aug. 1981. [AH87J M. A. Arbib and A. R. Hanson. Vision, Brain and Cooperative Computation. M IT Press, Boston, MA, 1987. [A JP86] D. P. Agrawal, V. K. Janakiram , and G. C. P athak. Evaluating the perform ance of m ulticom puter configurations. IE E E Computer, 19(5):23-37, M ay 1986. [AV84] L. Adams and R. Voigt. A methodology for exploiting parallelism in the finite element process. In Proc. of the N A T O Workshop on High Speed Computations, pages 373-392, J. Kowalik (ed.), Springer- Verlag, West Germany, 1984. [Bar85] A. G. Barto. Learning by statistical cooperation of self-interested neuron-like com puting elements. Human Neurobiology, 4:229-256, 1985. [BDQ86] J. C. Bermond, C. Delorme, and J.-J. Q uisquater. Strategies for interconnection networks: some m ethods from graph theory. J. of Parallel and Distributed Computing, 3(4):433-449, Dec. 1986. [Bok8 l] [CG83] [CG87] [CH88] [Cho88] [CL75] [CMP88] [Coh79] [Coh81] [Con87] 138 S. H. Bokhari. On the m apping problem . IE E E Trans. Computers, 0-30:207-214, M arch 1981. M. A. Cohen and S. Grossberg. Absolute stability of global p at­ tern form ation and parallel memory storage by com petitive neural networks. IE E E Trans, on Systems, M an and Cybernetics, SMC- 13(5):815-826, Sept./O ct. 1983. G. A. C arpenter and S. Grossberg. A massively parallel architecture for a self-organizing neural p attern recognition machine. Computer Vision, Graphics, and Image Processing, 37:54-115, Jan. 1987. R. Chowkwanyun and K. Hwang. M ulticom puter architectural sup­ port and load balancing functions for concurrent lisp execution. In K. Hwang and D. DeGroot, editors, Parallel Processing for Super­ computing and Artificial Intelligence, M cGraw Hill, 1988. R. Chowkwanyun. Dynamic Load Balancing for Concurrent Lisp Execution on a Multicomputer System . PhD thesis, University of Southern California, Los Angeles, 1988. A. M. Collins and E. F. Loftus. A spreading-activation theory of sem antic processing. Psychological Review, 82:407-429, Nov. 1975. E. Chow, H. M adan, and J. Peterson. Hyperswitch network for the hypercube concurrent com puter. In Third Conference on Hypercube Concurrent Computers and Aplications, page , 1988. J. Cohen. Non-deterministic algorithm s. A C M Computing Surveys, 11:79-94, 1979. J. Cohen. Garbage collection of linked d ata structures. A C M Com­ puting Surveys, 13:341-367, Sept. 1981. Connection Machine Model CM-2 Technical Summary. Thinking Machines Corp., 1987. [CS86] [Dal86] [Dav82] [DNS81] [DP78] [DS86] [EM78] [FB82] [Fel82] [Fen81] [FH83] 139 T. F. Chan and Y. Saad. M ultigrid algorithm s on the hypercube multiprocessor. IE E E Trans. Computers, C-35:969-977, Nov. 1986. W. J. Dally. A V LSI Architecture for Concurrent Data Structures. Technical Report 5209:TR:86, Ph.D Thesis, California Inst, of Tech., Pasadena, CA, 1986. E. E. Davidson. Electrical design of a high speed com puter package. IB M Journal of Research and Development, 26(3):349-361, 1982. E. Dekel, D. Nassimi, and S. Sahni. Parallel m atrix and graph algorithm s. S IA M J. of Computing, 4:657-675, Nov. 1981. A. M. Despain and D. A. Patterson. X-tree: a tree structured mul­ tiprocessor com puter architecture. Proc. 5th A nn. Symp. on Com­ puter Arch., 144-151, Aug. 1978. W. J. Dally and C. L. Seitz. The torus routing chip. Journal of Distributed Computing, 1(3):, 1986. G. M. Edelm an and B. M ountcastle. The M indful Brain. M IT Press, Boston, MA, 1978. J. A. Feldman and D. H. Ballard. Connectionist models and their properties. Cognitive Science, 6:205-254, 1982. J. A. Feldman. Dynamic connections in neural networks. Biological Cybernetics, 46:27-39, 1982. T.-y Feng. A survey of interconnection networks. IE E E Computer, 12-27, December 1981. S. E. Fahlm an and G. E. Hinton. Massively parallel architectures for AI: NETL, Thistle and Boltzm ann Machines. Proc. National Conf. on Artificial Intelligence, 109-113, 1983. [FH87] [Fox86] [FP85] [FRS*87] [Fuj83] [Gab85] [GBH87a] [GBH87b] [Gem81] [GFL87] 140 S. E. Fahlm an and G. E. Hinton. Connectionist architectures for artificial intelligence. IE E E Computer, 100-109, Jan. 1987. G. Fox. Questions and Unexpected Answers in Concurrent Computa­ tion. Technical Report C P-288, Caltech. Concurrent Com putation Program , Pasadena, California, June 1986. N. F arhat and D. Psaltis. Optical im plem entations of the Hopfield model. Proceedings of the Topical M eeting on Optical Computing, M arch 1985. B.M. Forrest, D. Roweth, N. Stroud, D .J. Wallace, and G.V. Wilson. Im plementing neural network models on parallel com puters. The Computer Journal, 30(5):413-419, 1987. R. M. Fujimoto. V LSI Communication Components for M ulticom­ puter Networks. PhD thesis, University of California, Berkeley, 1983. R. P. Gabriel. Performance and Evaluation of Lisp Systems. The M IT Press, Cambridge, M ass., 1985. L. Gasser, C. Braganza, and N. Herman. Im plem enting D istributed AI Systems Using MACE. Proc. 3rd IE E E Conference on A I Appli­ cations, Feb. 1987. L. Gasser, C. Braganza, and N. Herm an. MACE: A flexible testbed for distributed AI research. In M. Huhns, editor, Distributed A rtifi­ cial Intelligence, pages 119-152, P itm an Publishers, 1987. S. Geman. Notes on a self-organizing machine. In G. E. Hinton and J. A. Anderson, editors, Parallel Models of Associative Memory, pages 237-263, Erlbaum , Hillsdale, N J, 1981. N. H. G oddard, M. A. Fanty, and K. Lynne. The Rochester Connec­ tionist Simulator. Technical R eport 233, Com puter Science D ept., University of Rochester, Rochester, NY, 1987. [GG84] [GGK*83] [GH87] [GH88] [GJS82] [GLKA84] [GR84] [GS81] [HB84] 141 S. Geman and D. Geman. Stochastic relaxation, Gibbs distribu­ tions, and the bayesian restoration of images. IE E E Trans. Pattern Analysis and Machine Intelligence, PAMI-6:721-741, Nov. 1984. A. Gottleib, R. G rishm an, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU U ltracom puter - designing an MIMD shared memory parallel com puter. IE E E Trans. Computers, C-32(2):175-189, 1983. R. D. Geller and D. W. Ham m erstrom . A VLSI architecture for a neurocom puter using high-order predicates. In Workshop on Com­ puter Architecture for Pattern Analysis and Machine Intelligence, pages 153-161, 1987. J. Ghosh and K. Hwang. Optically connected m ultiprocessors for sim ulating artificial neural networks. In SP IE Proceedings vol. 882, Jan. 1988. E. F. Gehringer, A. K. Jones, and Z. Z. Segall. The Cm* testbed. IE E E Computer, 15(10):38-50, Oct. 1982. J.W . Goodm an, F.I. Leonberger, S.-Y. Kung, and R.A. Athale. O ptical interconnections for VLSI systems. Proc. of the IEEE, 72(7):850-865, 1984. D. B. Gannon and J. V. Rosendale. On the im pact of comm unication complexity on the design of parallel num erical algorithm s. IE E E Trans. Computers, 0-33:1180-94, December 1984. J. R. Goodman and C. H. Sequin. Hypertree: A m ultiprocessor interconnection topology. IE E E Trans. Computers, C-30(12):923- 933, Dec. 1981. K. Hwang and F. A. Briggs. Computer Architecture and Parallel Processing. McGraw-Hill, New York, 1984. [HCG88] [Hec86] [HG87a] [HG87b] [HGC87] [HHH87] [Hil85] [HJ86] [HMT86] [Hoa78] 142 K. Hwang, R. Chowkwanyun, and J. Ghosh. Parallel architectures for implementing AI systems. In K. Hwang and D. DeGroot, editors, Parallel Processing for Supercomputing and Artificial Intelligence, McGraw Hill, 1988. R. Hecht-Nielsen. Performance limits of optical, electro-optical, and electronics neurocom puters. Proc. SP IE , 634:277-306, 1986. K. Hwang and J. Ghosh. Hypernet: a communication-efficient archi­ tecture for constructing massively parallel com puters. IE E E Trans. Computers, 0-36:1450-1466, Dec. 1987. K. Hwang and J. Ghosh. Supercom puters and artificial intelligence machines. In Milutinovic, editor, Computer Architecture: Concepts and System s, pages 307-354, Elsevier Science, New York, 1987. K. Hwang, J. Ghosh, and R. Chowkwanyun. Com puter architec­ tures for artificial intelligence processing. IE E E Computer, 19-29, Jan. 1987. L.D. Hutcheson, P. Haugen, and A. Husain. Optical interconnects replace hardw are. IE E E Spectrum, 30-35, M arch, 1987. W. D. Hillis. The Connection Machine. The M IT Press, Cambridge, M ass., 1985. C. T . Ho and S. Lennart Johnsson. D istributed routing algorithm s for broadcasting and personalized com m unication in hypercubes. In Proc. InVl Conf. on Parallel Processing, pages 640-648, Aug. 1986. D. H am m erstrom , D. Maier, and S. Thakkar. The cognitive archi­ tecture project. Computer Architecture News, 14:9-21, 1986. C. A. R. Hoare. Communicating sequential processes. Comm, of ACM , 21(8):666-677, Aug. 1978._________________________________ [Hop82] [Hop86] [HS86] [HT86] [Hwa87] [HZ81] [Irl78] [Kam76] [KL71] [KL84] 143 J. J. Hopfield. Neural networks and physical systems w ith emergent collective com putational abilities. Proceedings National Acadamy of Science, 79:2554-2558, Apr. 1982. J. J. Hopfield. Collective com putation, content-addressable memory, and optim ization problems. In Y. S. Abu-M ostafa, editor, Complex­ ity in Information Theory, Springer-Verlag, New York, 1986. W. D. Hillis and G. L. Steele. D ata parallel algorithm s. Comm, of the ACM , 29:1170-1183, Dec. 1986. J. J. Hopfield and D. W. Tank. Com puting w ith neural circuits: a model. Science, 625-633, Aug. 8, 1986. K. Hwang. Advanced parallel processing w ith supercom puter archi­ tectures. Proceedings of the IEEE, OC-75(10):1348-1379, October 1987. E. Horowitz and A. Zorat. The binary tree as an interconnection network: applications to m ultiprocessor systems and VLSI. IE E E Trans. Computers, C-30(4):247-253, Apr. 1981. M. I. Irland. Buffer m anagem ent in a packet switch. IE E E Trans, on Communications, COM-26(3):328-337, 1978. F. Kam oun. Design Considerations for Large Computer Com m uni­ cations Networks. PhD thesis, University of California, Los Angeles, 1976. B. Kernighan and S. Lin. An efficient heuristic procedure for p arti­ tioning graphs. Bell System s Tech. Jl., 49(2):291-307, 1971. R. M. Keller and F. C. H. Lin. Sim ulated perform ance of a reduction-based multiprocessor. IE E E Computer, 70-82, July 1984. [Kle85] [KLK84] [KN76] [Koh84] [Lan82] [Lei83] [Lei85] [LGFR82] [Lin85] [Mar82] 144 L. Kleinrock. D istributed systems. IE E E Computer, 90-103, November 1985. J. Kowalik, R. Lord, and S. K um ar. Design and perform ance of algorithm s for MIMD parallel com puters. In Proc. of the N A TO Workshop on High Speed Computations, pages 257-276, Springer- Verlag, West Germany, 1984. S. W. Kuffler and J. G. Nicholls. From Neuron to Brain. Sinauer, Sunderland, M ass, 1976. T. Kohonen. Self-Organization and Associative Memory. Springer- Verlag, Berlin, 1984. C. R. Lang. The Extension of Object-Oriented Languages to a Homogeneous Concurrent Architecture. Technical R eport 5014, Ph.D . Thesis, Com puter Science D ept., California Instt. of Tech., Pasadena, May 1982. C. E. Leiserson. Area Efficient V LSI Computation. M IT Press, Cambridge, M ass., 1983. C. E. Leiserson. Fat-Trees: Universal networks for hardw are- efficient supercom puting. IE E E Trans. Computers, C-34(10):892- 901, Oct. 1985. K. A. Lantz, K. D. Gradishnig, J. A. Feldman, and R. F. Rashid. Rochester’s intelligent gateway. IE E E Computer, 54-70, Oct. 1982. F. C. H. Lin. Load Balancing and Fault Tolerance in Applicative Systems. PhD thesis, University of U tah, 1985. D. M arr. Vision. W. H. Freeman Publishing Co., San Francisco, 1982. 145 [MGN79] [Mol83] [M0I86] [MT85] [NS82] [Par87] [Pfi85] [PN85] [PV81] G. M. M asson, G. C. Gingher, and S. Nakam ura. A sam pler of circuit switching networks. IE E E Computer, 12(6):32-48, 1979. D. I. Moldovan. On the design of algorithm s for VLSI systolic arrays. Proc. IE E E , 71:113-120, Jan 1983. D. I. Moldovan. A comparison between parallel processing of nu­ meric and symbolic algorithm s. In M. Cosnard et al., editors, Par­ allel Algorithms and Architectures, pages 325-333, North-Holland, 1986. D. I. Moldovan and Y. W. Tung. SNAP: A VLSI architecture for artificial intelligence processing. J. of Parallel and Distributed Com­ puting, 2(2):109-131, May 1985. D. Nassimi and S. Sahni. O ptim al BPC perm utations on a cube connected SIMD com puter. IE E E Trans. Computers, C-31(4):338- 341, 1982. D. Partridge. W hat’s wrong w ith neural architectures. Proc. COM- P C O N Spring, 35-38, Feb. 1987. G. F. Pfister et al. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. Proc. In t’ l Conf. on Parallel Processing, 764-771, 1985. G. F. Pfister and V. A. Norton. Hot spot contention and combining in m ultistage interconnection networks. Proc. In t’ l Conf. on Parallel Processing, 790-797, 1985. F. P. P reparata and Vuillemin. T he cube-connected cycles: a ver­ satile network for parallel com putations. Comm, of ACM , 300-309, May 1981. 146 [Ree84] [RF87] [RHW86] [RM86] [RS83] [RV88] [Seq81] [SF84] D. A. Reed. The performance of m ultim icrocom puter networks supporting dynamic workloads. IE E E Trans. Computers, C- 33(11) :1045-1048, 1984. D. A. Reed and R. M. Fujimoto. Multicomputer Networks: Message- Based Parallel Processing. The M IT Press, 1987. D. E. R um elhart, G. E. Hinton, and R. J. W illiams. Learning in­ ternal representations by error propagation. In D. E. R um elhart and J. L. McClelland, editors, Parallel Distributed Processing: Ex­ plorations in the Microstructure of Cognition, Bradford Books/M IT Press, Cambridge, M ass, 1986. D. E. R um elhart and J. L. McClelland, editors. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Brad­ ford B ooks/M IT Press, Cambridge, Mass, 1986. D. A. Reed and H. D. Schwetman. Cost perform ance bounds for m ultim icrocom puter networks. IE E E Trans. Computers, C- 32(l):83-95, Jan. 1983. C. S. Raghavendra and A. Varma. Fault-tolerant multiprocessors w ith redundant-path interconnection networks. IE E E Trans. Com­ puters, C-35(4):307-316, Apr. 1988. C. H. Sequin. Doubly twisted torus networks for VLSI proceessor arrays. In Proc. 8th An. Symp. Computer Architecture, pages 471- 480, 1981. L. Shastri and J. A. Feldman. Semantic Networks and Neural Nets. Technical R eport 133, Dept, of Comp. Science, Univ. of Rochester, Rochester, NY, June 1984. 147 [SF88] [SJRV87] [SK86] [SM86] [Smi80] [SP85] [SR86] [Sta84] [Sto7l] [SW86] Y. Shih and J. Fier. Hypercube system s and key applications. In K. Hwang and D. DeGroot, editors, Parallel Processing for Super­ computing and Artificial Intelligence, M cGraw Hill, 1988. A. A. Sawchuk, B. K. Jenkins, C. S. R aghavendra, and A. Varma. O ptical crossbar networks. IE E E Computer, 20:50-62, June 1987. C. Stanfill and B. Kahle. Parallel free-text search on the connection m achine system. Comm. ACM , 29:1229-1239, Dec. 1986. S. J. Stolfo and D. P. M iranker. The DADO production system machine. J. of Parallel and Distributed Computing, 3(2):269-296, June 1986. R.G. Sm ith. The contract net protocol: high-level comm unication and control in a distributed problem solver. IE E E Trans. Comput­ ers, c-29, No. 12:1104-1113, December 1980. M. R. Sam than and D. K. Pradhan. T he DeBruijn m ultiprocessor network: a versatile sorting network. In 12th International Sympo­ sium on Computer Architecture, pages 360-367, 1985. T. J. Sejnowski and C. R. Rosenberg. N ETtalk: A Parallel Network that Learns to Read Aloud. Technical R eport JH U /EE C S-86/01, Johns Hopkins Univ., Baltimore, Jan. 1986. J. W. Stamos. Static grouping of sm all objects to enhance per­ formance of a paged virtual memory. A C M Trans, on Computer System s, 2(2):155-180, May 1984. H. S. Stone. Parallel processing w ith th e perfect shuffle. IE E E Trans. Computers, C-20(2):153-161, 1971. C. Stanfill and D. Waltz. Toward m em ory-based reasoning. Comm. ACM , 29:1213-1228, Dec. 1986. 148 [TBH82] [TH85] [TH88] [TRW86] [Uhr84] [Uhr87] [U1184] [VGG87] [VGG88] P. C. Treleaven, D. R. Brownbridge, and R. P. Hopkins. D ata-driven and dem and-driven com puter architectures. A C M Computing Sur­ veys, 14(l):93-149, M arch 1982. D. S. Touretzky and Geoffrey E. Hinton. Symbols among the neu­ rons: details of a connectionist inference architecture. In Proc. IJ- CAI, pages 238-243, Aug. 1985. S. Toborg and K. Hwang. Exploring neural network and optical com puting technologies. In K. Hwang and D. DeGroot, editors, Parallel Processing for Supercomputing and Artificial Intelligence, McGraw Hill, 1988. T R W Mark III A N S Processor Product Description. TRW Rancho Carmel AI Center, San Diego, April 1986. L. Uhr. Algorithm-Structured Computer Arrays and Networks: A r­ chitectures and Processes for Images, Percepts and Information. Academic Press, New York, 1984. L. Uhr. M ulticomputer Architectures for Artificial Intelligence. Wi­ ley Interscience, New York, 1987. J. D. Ullman. Computational Aspects of VLSI. Com puter Science Press, Rockville, M aryland, 1984. A. Varma, J. Ghosh, and C. Georgiou. Design of Large Crosspoint Switches. Technical Report , Com puter Sc. D ept., IBM T .J. W atson Research Center, Yorktown H ts., NY, Dec. 1987. A. Varma, J. Ghosh, and C. Georgiou. Reliable design of large crosspoint switching networks. In 18th International Symposium on Fault-Tolerant Computing, Tokyo, June 1988. [vW84] [Wah87] [Wal87] [WF84] [Whi85] [Wit81] [WL81] [WL86] [WLY85] 149 A. M. van Tilborg and L. D. W ittie. Wave scheduling - decentral­ ized scheduling of task forces in m ulticom puters. IE E E Trans, on Computers, C-33(9):835-844, 1984. B. Wah. New com puters for artificial intelligence processing. IE E E Computer, 20:10-15, Jan. 1987. D. L. W altz. Applications of the connection m achine. Computer, 20:85-97, Jan. 1987. C.-l. Wu and T.-y Feng, editors. Interconnection Networks for Paral­ lel and Distributed Processing. IEEE Com puter Society Press, 1984. C. W hitby-Stevens. The transputer. Proc. 12th A nn. I n t’ l Symp. on Comput. Arch., 292-300, Aug. 1985. L. D. W ittie. Communication structures for large m ultimicrocom­ puter systems. IE E E Trans. Computers, C-30(4):264-273, April 1981. S. B. Wu and M. T. Liu. A cluster structure as an interconnec­ tion network for large m ultim icrocom puter system s. IE E E Trans. Computers, C-30(4):254-265, April 1981. B. W. W ah and G. J. Li, editors. Computers for Artificial Intelli­ gence Applications. IEEE Com puter Society Press, May 1986. B. W. Wah, G.-J. Li, and C. F. Yu. M ultiprocessing of com binatorial search problems. IE E E Computer, 93-108, June 1985. 150 Appendix A Hypernet Simulations This appendix docum ents the code used to sim ulate the execution of a TAK program on a 256 node hypernet. The program s were w ritten in CCLISP and run on a 16-node Intel iPS C /1 m ulticom puter w ith enhanced memory. A .l Source code listings In this section, the files containing the source code are listed as: filename (contents of file) These files are currently in priam :/usr2/ghosh/hyper/hypcode. A .1.1 In itia liz a tio n The file hypnet.lsp is used to load the other files in the correct order: hypnet.lsp * * * * * * * * * * * * * * * * * * * * * * * * (load "initial") (load "neighbors") (load "loadcalc") (load "destlist") (load "display") (load "awake”) 151 (load "joytak") (load "gotak") (load "bye") The parallel execution of TAK is modeled as a sequence of iteration cycles. In each iteration, the processor nodes of the hypernet which have ex­ ecutable processes are visited one by one. For each such node, one process is executed. This m ight entail spawning child processes or returning results to a parent process. Furtherm ore, each sim ulated hypernet node uses a sender- initiated load balancing scheme [Cho88] to determ ine the allocation of child processes to its neighbors. The file initial.lsp is used to initialize the system by setting the loop- count to zero and by letting no node to have any executable processes pending. Global variables are nam ed *global-variable-name* in all the program s used. We specify how often we want to record the state of the execution by setting the * display-interval* and *smalldisplay* variables. The bias on the three classes of links in a (3,3)-hypernet, is also given. Some initialization is also needed at the beginning of each iteration loop to reset certain loop param eters. This is achieved by the function clean­ up. initial.lsp (setq *loopcount* 0) (setq *maxload* 0) (setq *maxmaxload* 0) (setq *busynodes* 0) (setq *display-interval* 500) (setq *smalldisplay* 100) (setq *level0-bias* 0) (setq *levell-bias* 5) (setq *level2-bias* 10) ; A function to be used later. (defun div (a b) (- a (mod a b))) (defun clean-up () (setq *runList* nil) (setq *awakeList* nil) (setq *maxload* 0) (setq *busynodes* 0)) (defun initialize () (setq *loopcount* 0) 152 (setq *runQueue* nil) (dotimes (i *numberOfNodes*) (setq *runQueue* (cons (list i nil) *runQueue*))) (setq *subnetload* nil) (dotimes (i 8) (setq *subnetload* (cons (list i 0) *subnetload*))) (setq *timeprofile* nil) (dotimes (i 8) (setq *timeprofile* (cons (list i 0) *timeprofile*))) (setq *suspendQueue* nil) (dotimes (i *numberOfNodes*) (setq *suspendQueue* (cons (list i nil) *suspendQueue*))) (setq *lastID* nil) (dotimes (i *numberOfNodes*) (setq *lastID* (cons (list i 0) *lastID*))) (setq *traversal* nil) (dotimes (i *numberOfNodes*) (setq *traversal* (cons (list i ’(000)) ^traversal*)))) (initialize) A .1.2 H y p e r n e t d escrip tio n The TAK program is sim ulated on a (3,3)-hypernet constructed from 3-cubelets. Such a hypernet has 256 nodes, num bered 0 to 255. This hypernet is specified by providing for each node a list of nodes th at are its direct neighbors. In a (3,3)-hypernet, each node has three “vanilla” neighbors th a t belong to the same cubelet. These neighbors are connected by level-0 links. The function biased-trouble-neighbor finds out if a node also has a fourth neighbor, and the level of the link connecting to this neighbor, if one exists. neighbors.lsp ;initializes the biases for the various links (setq dim 3) ; *neighborsBias* lists for every node of the hypernet, ;all its neighboring nodes and the biases on the ;corresponding links. 153 (setq *neighbors* (neighbor-list)) (setq *neighborsBia8* (biased-neighbor-list)) ; list of initializations for next neighbor with bias (setq *numberOfNodes* 256) (defun biased-neighbor-list () (let ((nname nil)) (dotimes (i *numberOfNodes*) (setq nname (cons (list i (neighbors i)) nname))) nname)) (defun neighbors (nodeNo) (let ((nlist (biased-vanilla-neighbors nodeNo)) (tlist (reverse (biased-trouble-neighbor nodeNo)))) (cond (tlist (cons tlist nlist)) (t nlist)))) (defun neighbor-list () (mapcar ’list (ilist *numberOfNodes*) (mapcar ’neighs (ilist *numberOfNodes*)))) (defun neighs (nodeNo) (let ((nlist (vanilla-neighbors nodeNo))) (append (cdr (biased-trouble-neighbor nodeno)) nlist))) (defun biased-trouble-neighbor (nodeno) (let ((possible 0)(ans 0)) (setq possible (cond ((not(logbitp 0 nodeno)) (list *levell-bias* (truncate (+ (div nodeno 32) (* 4 (mod nodeno 8)) (/( div (mod nodeno 32) 8)4))))) ((and (logbitp 0 nodeno) (not (logbitp 1 nodeno))) (list *level2-bias* (truncate (+ 1 (/ (div nodeno 32) 8) (* 8 (- (mod nodeno 32) (mod nodeno 4))))))) (t nil))) (setq ans (cond ((equal nodeno (cadr possible)) nil) (t possible))) ans)) (defun biased-vanilla-neighbors (nodeno) (let ((vlist(mapcar 'neigh (jlist nodeNo 3) (ilist 3)))) (mapcar ’list vlist (jlist *levelO-bias* 3)))) (defun vanilla-neighbors (nodeno) (mapcar ’neigh (jlist nodeNo 3) (ilist 3))) 154 (defim neighborlist (nodeNo dim) (mapcar ’list (neighbors nodeNo dim) (jlist 0 dim))) ;; returns association list of node id’s and neighbor list for that ;; node not used in load balancing ;; returns n complemented in bit position d (defun neigh (n d) (let ((index (expt 2d))) (if (logbitp d n) (abs (- index n)) (+ index n)))) ;; result of jlist is vector length 1 with each elt * to n (defun jlist (n 1) (let ((res nil)) (dotimes (i 1 res) (setq res (cons n res))))) ;; result of ilist is (0 ... n-1) (defun ilist (n) (do ((i (abs (- In)) (abs (- 1 i))) (res nil (cons i res))) ((zerop i) (cons i res)))) A .I .3 P r o c e ss m ig ra tio n The software organization governing process m igration is sim ilar to the approach detailed in [Cho88]. The processes th a t are to be executed on a given processor node are listed in its runqueue. The global variable, *runQueue * lists the runqueues of the individual nodes. At the sta rt of every loop, each node pulls out the first process in its runqueue and executes it. If this process spawns child processes, then these children are sent to neighboring nodes. To do so, each node should first know its own load, m easured by the size of its runqueue, as well as the load of all its neighbors. The global variable, *neighborsLoad*, records the required load information at the beginning of each iteration loop. The code in file loadcalc.lsp also keeps track of the m axim um load on a processor node for a given iteration (*maxload*), the m axim um load over all iterations (*maxmaxload*), the num ber of nodes currently active (*busynodes*), and the total load on each subnet for a given iteration cycle. loadcalc.lsp (defun load-calculator () (setq *neighborsLoad* nil) (dotimes (i *numberOfNodes*) 155 (setq *neighborsLoad* (cons (list i (load-calc i)) *neighborsLoad*))) (cond ((< *maxmaxload* *maxload*)(setq *maxmaxload* *maxload*)))) (defun load-calc (i) (let ((runqueue (cdr (assoc i *runQueue*)))) (cond ((null (car runqueue)) 0) (t (setq *busynodes* (+ 1 *busynodes*)) (setq 1 (length runQueue)) (cond ((< *maxload* 1)(setq *maxload* 1))) 1)))) (defun addlist (listing) (let ((sura 0)) (dolist (elt listing) (setq sura (+ sum elt))) sum)) (defun subnet-load (netno loadlist) (let ((base 0) (subloadlist nil)) (setq base (* 32 netno)) (dotimes (i 32 subloadlist) (setq subloadlist (cons (load-part (assoc (+ i base) loadlist)) subloadlist))) (addlist subloadlist))) T he neighboring loads seen by a processor is a virtual load which is actually the sum of the actual load of a neighbor and the bias on the link between the two processors. Each processor sends the child processes belonging to the process th a t has ju st been executed on th a t node to its neighbors w ith the lowest loads. At m ost one child processes is sent to a neighbor. The child process received by a node is put at the end of its runqueue. Thus a first-in- first-out (FIFO) queueing policy is implemented on each node. The procedures mentioned above are carried out by the code in file destlist.lsp. destlist.lsp ; destination-liat returns the 3 neighbors with lowest load (defun destination-list (node) (let* ((neighbors (car (cdr (assoc node ^neighbors*)))) (neighborsLoad (neighbors-load node neighbors)) (exteraalLoad nil) ; (retumList nil)) (dotimes (i 3) (setq exteraalLoad (min-load neighborsLoad)) 156 (setq retumList (cons (car extemalLoad) retumList)) (rplacd (assoc (car extemalLoad) neighborsLoad) 31000)) retumList)) (defun neighbors-load (node neighborlist) (let* ((nlist nil) ) (mapcar #'(lambda (x) (setq nlist (cons (load-bias node x) nlist))) neighborlist) nlist)) (defun load-bias (thisNode x) (let ((loadX (load-part (assoc x *neighborsLoad*))) (biasX (load-part (assoc x (cadr (assoc thisNode *neighborsBias*)))))) (list x (+ loadX biasX)))) (defun min-load (loadlist) (let ((minLoad (car loadlist))) (dolist (eltpair loadlist) (if (< (load-part eltpair) (load-part minLoad)) (setq minLoad eltpair))) minLoad)) (defun load-part (twohey) (if (listp (cdr twohey)) (second twohey) (cdr twohey))) (defun migrator () (dotimes (thisNode (length *runQueue*)) (let ((runQueue (cdr (assoc thisNode *runQueue*)))) (cond ((not (null (car runQueue))) (rplacd (assoc thisNode *runQueue*) (cdr runQueue)) (let ((next (car runQueue))) (tak thisNode next))))))) A .1.4 P r o c e ss su sp en sio n an d aw ak en in g A TAK process cannot be executed unless all three of its argum ents are available. So when such a process spawns three child processes, it is p u t in the * suspendQueue* where it awaits the values to be returned by its children. As and when these values are determ ined, they are put it the correct slots of the parent process argum ents. If all three argum ents of a process are available, the process is awakened and added to the back of the runqueue of its processor node, as indicated by the following functions: 157 awake.lsp ;; awake accepts calculated parameters for a process with id# pid ;; the returned data value "no" is put into the "position" slot ;; if all data values are present the process is awakened and added ;; to the run queue using function "run-queue" and macro "liveprocess" (defun awake (process) (let* ((pnode (process-ParentNode process)) (pid (process-ParentID process)) (position (process-Position process)) (no (process-z-slot process)) (found (assoc pid (assoc pnode *suspendQueue*)))) (cond ((equal position 1) (setf (process-x-slot (cdr found)) no)) ((equal position 2) (setf (process-y-slot (cdr found)) no)) ((equal position 3) (setf (process-z-slot (cdr found)) no))) (let ((x (process-x-slot (cdr found))) (y (process-y-slot (cdr found))) (z (process-z-slot (cdr found)))) (cond ((or (zerop x) (zerop y) (zerop z)) (suspend pnode found)) (t (add-run (list pnode (load-part found)))))))) (defun add-to-run-list (destination process) (setq *runList* (cons (list destination process) *runList*))) (defun add-run-queue () (mapcar 'add-run *runList*)) (defun add-run (runner) (let* ((destination (car runner)) (process (cdr runner)) (runQueue (cdr (assoc destination *runQueue*)))) (cond ((null (car runQueue)) (rplacd (assoc destination *runQueue*) process)) (t (rplacd (assoc destination *runQueue*) (append process runQueue)))))) (defun add-to-awake (process) (setq *awakeList* (cons process *awakeList*))) (defun add-awake () (mapcar 'awake *awakeList*)) 158 A .1.5 T em p la te o f a T A K p ro cess Each TAK process has six param eters: (defstruct process :x-slot :y-slot :z-slot :parentnode :parentid :position) The x, y and z-slots contain the three argum ents, if available. The process also records the address and node location of its parent process. Fi­ nally, it has to know the slot of its parent process to which its com puted value should be returned. This slot num ber is kept it the position field of the process structure. The other functions in the file joytafc.lsp define TAK, and specify the spawning and allocation of child processes. Only the root process has a 0 in its ParentSlot. Hence this serves to detect for end of com putation, as can be seen in the function success-exit. joytak.lsp ;; parallelized breadth first search tak ;; ParentId is the id of the parent process and consists of a 3 element ;; list where the first element is the node where the parent resides ;; and the second element is a unique process id number ;; and the third element is the position of the return value (defun tak (thisNode parameters) (let ((x (process-x-slot parameters)) (y (process-y-slot parameters))) (cond ((not (< y x)) (success-exit thisNode parameters)) (t (failure-spawn thisNode parameters))))) (defun success-exit (thisNode parameters) (let ((z (process-z-slot parameters)) (ParentNode (process-ParentNode parameters)) (ParentProcess (process-ParentID parameters)) (ParentSlot (process-Position parameters))) (next-id thisNode) (cond ((equal ParentSlot 0) (terpri *tracefile*) (format *tracefile* M“%~%answer returned “dM z) (throw ’kill-kerael (format t ”"%Return "d" z))) 159 (t (add—to-awake (make-process :ParentNode ParentNode :parentid ParentProcess :position ParentSlot :z-slot z)))))) (defun failure-spawn (thisNode parameters) (let* ((x (process-x-slot parameters)) (y (process-y-slot parameters)) (z (process-z-slot parameters)) (ParentNode (process-ParentNode parameters)) (ParentProcess (process-ParentID parameters)) (ParentSlot (process-Position parameters)) (nextparent (next-id thisNode)) (xless (abs (1- x))) (zless (abs (1- z))) (yless (abs (1- y))) (trouble (cdr (biased-trouble-neighbor thisnode))) (destinations (destination-list thisnode))) (cond ((and (not (null trouble))(member (car trouble) destinations)) (cond ((logbitp 0 thisnode) (rplacd (assoc thisnode ‘traversal*) (list (mapcar #’+ *(2 0 1) (load-part (assoc thisnode ‘traversal*)))))) (t (rplacd (assoc thisnode ‘traversal*) (list (mapcar #*+ *(2 1 0) (load-part (assoc thisnode ‘traversal*)))))))) (t (rplacd (assoc thisnode ‘traversal*) (list (mapcar #’+ '(3 0 0) (load-part (assoc thisnode ‘traversal*))))))) (suspend thisNode (cons nextparent (make-process :x-slot 0 :y-slot 0 :z-slot 0 :ParentNode ParentNode :parentid ParentProcess :position ParentSlot))) (add-to-run-list (first destinations) (make-process :x-slot xless :y-slot y :z-slot z :ParentNode thisnode :parentid nextparent :position 1)) (add-to-run-list (second destinations) (make-process :x-slot yless :y-slot z :z-slot x :ParentNode thisnode :parentid nextparent :position 2)) (add-to-run-list (third destinations) (make-process 160 :x~slot zless :y-slot x :z-slot y :ParentNode thisnode :parentid nextparent :position 3)))) ;; returns next process id number for this node (defun next-id (thisNode) (let ((nextID (abs (1+ (load-part (assoc thisNode *lastID*)))))) (rplacd (assoc thisNode *lastID*) nextID) nextID)) ;; add Suspended to the queue of suspended processes (defun suspend (thisNode Suspended) (let ((suspendQueue (cdr (assoc thisNode ^suspendQueue*)))) (cond ((null (car suspendQueue)) (rplacd (assoc thisNode *suspendQueue*) (list Suspended))) (t (rplacd (assoc thisNode *suspendQueue*) (cons Suspended suspendQueue)))))) A . 1.6 E x e c u tio n o f T A K To execute the root TAK process, the user types “(gotak x y z)” at the lisp prom pt. The function gotak initiates three actions: 1. It opens up a tracefile to record state param eters of process execution as periodically dictated by the functions in the file display. Isp. 2. It creates a tem plate for the root process and inserts this in the runqueue of node 128, which is node 0 w ithin subnet 4. 3. Initiates the kernel routine. The kernel is organized as an “infinite” loop. A loop iteration starts with resetting of some loop param eters followed by the calculation of load of the neighbors and process m igration initiated by each active processor node. Each node w ith a non-em pty runqueue then picks up a TAK process in the front of its runqueue, and executes it. The results are used to update the status of the parent processes and awaken them if possible. The loop is exited when the root process is resolved and returns an answer. 161 gotak. lsp (defun gotak (x y z) (setq *tracefile* (open (format nil "tr'd" (sys:mynode)) :direction :output)) (initialize) (let ((rootprocess (make-process :x-slot x :y-slot y :z-slot z :parentid 0 :parentnode 0 :position 0))) (rplacd (assoc 128 *runQueue*) (list rootprocess))) (kernel) (close *tracefile*)) (defun kernel () (catch ’kill-kemel (loop (clean-up) (load-calculator) (migrator) ; calls tak (add-run-queue) (add-awake) (display) )) (print "exiting kernel") (bye)) A .1.7 D isp la y o f re su lts The program s in display.lsp and bye.lsp display the status of the TAK execution periodically, and also record them in a tracefile. The current loop-count is displayed at the console by the function display. W hen the loop-count is a m ultiple of *smallinterval*, the function subload-print is invoked and the load in each of the eight (3,2)-subnets of the hypernet is recorded in the tracefile. In addition, after every *display-interval* num ber of iterations, a more detailed trace of the system status is recorded. This includes a list showing the num ber of processes executed so far a t each node. The function filter is used to remove those processors th a t have never been used from the list. T he num ber of processes sent over the various comm unication links is also recorded to indicate the message patterns and intensity. T he functions bye and overall-load record the overall netw ork traffic and load on each subnet at the end of the com putation.______________________ 162 display.lsp (defun display () (setq *loopcount* (+ 1 *loopcount*)) (time-profile) (print *loopcount*) (subload-print) (cond ((zerop (mod *loopcount* *display-interval*)) (displayl)))) (defun trace-print (lispObject) (let ((lengObj (length lispObject)) (i 0)) (mapcar #’(lambda (x) (cond ((equal 0 (mod i 8)) (format *tracefile* "“%"%"))) (format *tracefile* ""A" x) (setq i (1+ i))) lispObject))) (defun displayl () (print "current loop-count is") (print *loopcount*) (format *tracefile* ""%"%lastid") (trace-print (reverse (filter *lastid*))) (format *tracefile* "~%~%no. of processes examined ~d” (addlist (processed-list))) (format *tracefile* ”~%~%no. of nodes used till now ~d” (addlist (mapcar ’nonzero (processed-list)))) (terpri *tracefile*) (traversal-count)) (defun filter (longfile) (let ((shortfile nil)) (dotimes (i (length longfile)) (cond ((null (load-part (assoc i longfile)))) ((zerop (load-part (assoc i longfile)))) (t (setq shortfile (cons (assoc i longfile) shortfile))))) shortfile)) (defun traversal-count () (let((levelO 0)(level1 0)(level2 0)(totlO 0)(totli 0)(totl2 0)(totltravel 0)) (dotimes (i *numberofnodes*) (setq levelO (cons (car (load-part (assoc i ^traversal*))) levelO)) (setq levell (cons (cadr (load-part (assoc i *traversal*))) levell)) (setq level2 (cons (second (cdr (load-part (assoc i ^traversal*)))) level2))) (format *tracefile* ""X'Xlevel 0 traversals -d" (setq totlO (addlist levelO))) (format *tracefile* "“X“%level 1 traversals “d" (setq totll (addlist levell))) 163 (format *tracefile* ""%"Xlevel 2 traversals "d" (setq totl2 (addlist level2))) (setq totltravel (+ totlO totll totl2)) (format *tracefile* ""%"Xfraction in 10 is "d" (/ totlO totltravel)) (format *tracefile* M~%-%fraction in 11 is ”d" (/ totll totltravel)) (format *tracefile* ""%-%fraction in 12 is -d" (/ totl2 totltravel)))) (defun load-spread (loadlist subloadlist) (let ((new 0)) (dotimes (i 8) (setq new (subnet-load i loadlist)) (rplacd (assoc i subloadlist) new)) subloadlist)) (defun time-profile () (dotimes (i 8) (cond ((not (zerop (subnet-load i *neighborsload*))) (rplacd (assoc i *timeprofile*) (1+ (load-part (assoc i *timeprofile*)))))))) (defun subload-print () (setq *8ubnetload* (load-spread *neighborsload* *subnetload*)) (cond ((zerop (mod *loopcount* *smalldisplay*)) (format *tracefile* ,,_%~%at loopcount “d" *loopcount*) (format *tracefile* "~%~%subload list is "d" *subnetload*)))) (defun nonzero (number) (cond ((zerop number) 0) (t 1))) (defun processed-list () (setq processedlist nil) (dotimes (i *numberofnodes* processedlist) (setq processedlist (cons (load-part (assoc i *lastid*)) processedlist)))) bye.lsp ;;The function {\em bye} gives a special display when the final result has been returned. (defun bye () (displayl) ; (format *tracefile* (format *tracefile* (format *tracefile* (format *tracefile* (format *tracefile* (format *tracefile* "~%“%final subload list is "d" *subnetload*) M“%"%overall load is "d" (overall-load)) "~%~%peak load on a processor ~d" *maxmaxload*) "~%“%the time profile is "d" *timeprofile*) "“%”%level 0 bias "d" *levelO-bias*) "“%”%level 1 bias “d" *levell-bias*) 164 (format *tracefile* ""X~Xlevel 2 bias ”dM *level2-bias*) (format *tracefile* "~%~%final loop-count "d” *loopcount*)) (defun overall-load () (let ((x 0) (total 0)) (setq total (addliBt (processed-list))) (dotimes (i 8) (format *tracefile* "~%overall load in subnet ~d" i) (format *tracefile* "-%total is “d" (setq x (subnet-load i *lastid*))) (format *tracefile* ""%fraction of load is ~dN (/ x total))))) A.2 An execution trace for TAK(18,12,6) T he following execution trace was obtained when TAK(18,12,6) was run on a sim ulated 256-node (3,3)-net w ith link biases of 0, 2 and 4. A to tal of 63,609 processes are executed in 2081 iterations. Comments on the trace are given in larger sized text. The subnet load was m onitored every 100 iteration cycles. Besides, at every 500 iteration cycles, the following param eters were recorded: • *lastid*, showing the num ber of processes executed on each node so far • The total num ber of processes examined so far • The num ber of nodes th a t currently have non-em pty runqueues • The cumulative num ber of message traversals in the three types of links. At the end of the com putation, the num ber of iteration cycles taken, and the overall load in each subnet is also returned. tr0.tex ;; The elements of the list are (subnet num ber, load) at loopcount 100 subload list is ((7 . 35) (6 . 65 ) (5 . 156) (4 . 1232) (3 . 18) (2 . 18) (1 . 264) (0 . 225)) at loopcount 200 subload list is ((7 . 14) (6 . 26) (5 . 132) (4 . 1746) (3 . 2) (2 . 0) (1 . 230) (0 . 285)) at loopcount 300 subload list is ((7 . 0) (6 . 0) (5 . 102) (4 . 1970) (3 . 1) (2 . 1) (1 . 171) (0 . 279)) at loopcount 400 subload list is ((7 . 1) (6 . 1) (5 . 77) (4 . 1901) (3 . 0) (2 . 0) (1 . 107) (0 . 221)) at loopcount 500 subload list is ((7 . 0) (6 . 1) (5 . 62) (4 . 1605) (3 . 0) (2 . 0) (1 . 71) (0 . 175)) ______________________________________ 165 ;; Now the loopcount is a multiple of *display-inttrval*. So a list, whose elements are (node num ber, processes executed so far), is given. Since the root process was initiated on node 128, we see th at the nodes in subnet 4 (i.e. nodes 128 to 159) are the m ost busy ones. lastid (0 . Ill) (1 . 128)(2 . 67)(3 . 64)(4 . 256)(5 . 205)(6 . 131)(7 . 148)(8 . 43)(9 . 22)(10 . 15) (11 . 13)(12 . 115)(13 . 57)(14 . 51)(15 . 30)(16 . 448)(17 . 489)(18 . 325)(19 . 345)(20 . 487) (21 . 488)(22 . 379)(23 . 475)(24 . 52)(25 . 19)(26 . 26)(27 . 9)(28 . 113)(29 . 20)(30 . 22) (31 . 7)(32 . 239)(33 . 21l)(34 . 101)(35 . 86)(36 . 367)(37 . 285)(38 . 221)(39 . 139)(40 . 65) (41 . 34)(42 . 29)(43 . 9)(44 . 131)(45 . 37)(46 . 37)(47 . 18)(48 . 466)(49 . 491)(50 . 337) (51 . 424)(52 . 41l)(53 . 490)(54 . 307)(55 . 385)(56 . 133)(57 . 89)(58 . 70)(59 . 35)(60 . 168) (61 . 95)(62 . 69)(63 . 31)(64 . 3)(65 . 3)(68 . 19)(69 . ©)(70 . 3)(71 . 3)(80 . 96) (81 . 278)(82 . 52)(83 . 95)(84 . 103)(85 . 146)(86 . 47)(87 . 96)(88 . 1)(89 . 1)(92 . 8) (93 . 2)(94 . 1)(95 . 1)(96 . 42)(97 . 23)(98 . 11)(99 . 8)(100 . 73)(101 . 48)(102 . 39) (103 . 15)(104 . 2)(105 . 2)(108 . 11)(109 . 4)(110 . 2)(lll . 2)(112 . 128)(113 . 234)(114 . 67) (115 . 102)(116 . 104)(117 . 144)(118 . 54)(119 . 76)(120 . 11)(121 . 7)(122 . 4)(123 . 1)(124 . 14) (125 . 7)(126 . 4)(127 . 2)(128 . 499)(129 . 498)(130 . 498)(131 . 498)(132 . 498)(133 . 497)(134 . 497) (135 . 497)(136 . 494)(137 . 491)(138 . 492)(139 . 491)(140 . 492)(141 . 491)(142 . 492)(143 . 489)(144 . 494) (145 . 382)(146 . 493)(147 . 378)(148 . 443)(149 . 289)(150 . 492)(151 . 322)(152 . 495)(153 . 494)(154 . 494) (155 . 493)(156 . 494)(157 . 493)(158 . 473)(159 . 492)(160 . 272)(161 . 481)(162 . 122)(163 . 204)(164 . 383) (165 . 482)(166 . 165)(167 . 326)(168 . 54)(169 . 106)(170 . 17)(171 . 25)(172 . 82)(173 . 100)(174 . 22) (175 . 44)(176 . 264)(177 . 271)(178 . 146)(179 . 150)(180 . 143)(181 . 147)(182 . 63)(183 . 73)(184 . 135) (185 . 162)(186 . 37)(187 . 97)(188 . 64)(189 . 151)(190 . 23)(191 . 55)(192 . 72)(193 . 34)(194 . 21) (195 . 8)(196 . 134)(197 . 97)(198 . 70)(199 . 32)(200 . 17)(201 . 11)(202 . 12)(203 . 5)(204 . 48) (205 . 19)(206 . 21)(207 . 14)(208 . 204)(209 . 345)(210 . 129)(211 . 173)(212 . 184)(213 . 250)(214 . 120) (215 . 149)(216 . 33)(217 . 17)(218 . 17)(219 . 12)(220 . 50) (221 . 23) (222 . 22)(223 . 9) (224 . 52) (225 . 59)(226 . 19) (227 . 16)(228 . 151)(229 . 105)(230 . 53)(231 . 60)(232 . 23)(233 . 15)(234 . 11) (235 . 12)(236 . 39)(237 . 22) (238 . 18) (239 . 10) (240 . 174) (241 . 262) (242 . 141) (243 . 147) (244 . 140) (245 . 153) (246 . 83)(247 . 109)(248 . 11)(249 . 6)(250 . 2)(251 . 2)(252 . 23)(253 . 6)(254 . 2) (255 . 2) no. of processes examined 37656 no. of nodes used till now 242 busynodes 40 current maximum load 94 level 0 traversals 29443 level 2 traversals 1569 level 1 traversals 2117 The last three statistics give the communication trafffic on the three categories of links at loopcount 600 subload list is ((7 . 0) (6 . 0) (5 . 37) (4 . 1322) (3 . 0) (2 . 0) (1 . 36) (0 . 104)) at loopcount 700 subload list is ((7 . 0) (6 . 1) (5 . 13) (4 . 1286) (3 . 0) (2 . 0) (1 . 13) (0 . 83)) at loopcount 800 subload list is ((7 . 0) (6 . 0) (5 . 1) (4 . 1201) (3 . 0) (2 . 0) (1 . 0) (0 . 54)) at loopcount 900 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 1077) (3 . 0) (2 . 0) (1 . 0) (0 . 13)) at loopcount 1000 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 909) (3 . 0) (2 . 0) (1 . 0) (0 . 0)) ___________________ 166 lastid(0 . 114) (1 . 130)(2 . 68)(3 . 64)(4 . 274)(5 . 208)(6 . 133)(7 . 149)(8 . 43)(9 . 22)(10 . 15) (11 . 13)(12 . 115)(13 . 57)(14 . 51)(15 . 30)(16 . 603)(17 . 945)(18 . 355)(19 . 434)(20 . 713) (21 . 837)(22 . 427)(23 . 567)(24 . 52)(25 . 19)(26 . 26)(27 . 9)(28 . 113)(29 . 20)(30 . 22) (31 . 7)(32 . 252)(33 . 223)(34 . 104)(35 . 94)(36 . 384)(37 . 302)(38 . 229)(39 . 146)(40 . 65) (41 . 34)(42 . 29)(43 . 9)(44 . 13l)(45 . 37)(46 . 37)(47 . 18)(48 . 512)(49 . 747)(50 . 352) (51 . 473)(52 . 490)(53 . 704)(54 . 323)(55 . 446)(56 . 133)(57 . 89)(58 . 70)(59 . 35)(60 . 168) (61 . 95)(62 . 69)(63 . 31)(64 . 3)(65 . 3)(68 . 19)(69 . 6)(70 . 3)(71 . 3)(80 . 96) (81 . 285)(82 . 52)(83 . 95)(84 . 103)(85 . 146)(86 . 47)(87 . 96)(88 . 1)(89 . 1)(92 . 8) (93 . 2) (94 . 1) (95 . 1)(96 . 42) (97 . 23) (98 . 11) (99 . 8) (100 . 73) (101 . 48) (102 . 39) (103 . 15)(104 . 2)(105 . 2)(108 . 11)(109 . 4)(110 . 2)(111 . 2)(112 . 135)(113 . 253)(114 . 70) (115 . 106)(116 . 107)(117 . 148)(118 . 54)(119 . 76)(120 . 11)(121 . 7)(122 . 4)(123 . 1)(124 . 14) (125 . 7)(126 . 4)(127 . 2)(128 . 999)(129 . 998)(130 . 998)(131 . 998)(132 . 998)(133 . 997)(134 . 997) (135 . 991)(136 . 994)(137 . 833)(138 . 992)(139 . 952)(140 . 989)(141 . 708)(142 . 992)(143 . 721)(144 . 946) (145 . 485)(146 . 856)(147 . 482)(148 . 618)(149 . 329)(150 . 788)(151 . 425)(152 . 995)(153 . 92l)(l54 . 994) (155 . 945)(156 . 876)(157 . 656)(158 . 781)(159 . 662)(160 . 333)(161 . 715)(162 . 122)(163 . 265)(164 . 464) (165 . 706)(166 . 165)(167 . 407)(168 . 54)(169 . 106)(170 . 17)(171 . 25)(172 . 82)(173 . 100)(174 . 22) (175 . 44)(176 . 264)(177 . 27l)(l78 . 146)(179 . 150)(180 . 143)(181 . 147)(182 . 63)(183 . 73)(184 . 135) (185 . 162)(186 . 37)(187 . 97)(188 . 64)(189 . 151)(190 . 23)(191 . 55)(192 . 72)(193 . 34)(194 . 21) (195 . 8)(196 . 134)(197 . 97)(198 . 70)(199 . 32)(200 . 17)(201 . 11)(202 . 12)(203 . 5)(204 . 48) (205 . 19)(206 . 21)(207 . 14)(208 . 208)(209 . 374)(210 . 129)(211 . 177)(212 . 184)(213 . 254)(214 . 120) (215 . 149)(216 . 33)(217 . 17)(218 . 17)(219 . 12)(220 . 50)(221 . 23)(222 . 22)(223 . 9)(224 . 52) (225 . 59)(226 . 19)(227 . 16)(228 . 151)(229 . 105)(230 . 53){231 . 60)(232 . 23)(233 . 15)(234 . 11) (235 . 12)(236 . 39)(237 . 22){238 . 18)(239 . 10)(240 . 174)(241 . 263)(242 . 141)(243 . 147)(244 . 140) (245 . 153)(246 . 83)(247 . 109)(248 . 11)(249 . 6)(250 . 2)(251 . 2)(252 . 23)(253 . 6)(254 . 2) (255 . 2) no. of processes examined 52564 no. of nodes used till now 242 busynodes 19 current maximum load 105 level 0 traversals 36885 level 2 traversals 2049 level 1 traversals 2940 at loopcount 1100 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 792) (3 . 0) (2 . 0) (1 . 0) (0 . 0)) at loopcount 1200 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 654) (3 . 0) (2 . 0) (1 . 1) (0 . 1)) at loopcount 1300 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 504) (3 . 0) (2 . 3) (1 . 0) (0 . 0)) at loopcount 1400 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 362) (3 . 0) (2 . 0) (1 . 0) (0 . 0)) at loopcount 1500 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 243) (3 . 0) (2 . 0) (1 . 0) (0 . 0)) lastid(0 . 114) (1 . 130)(2 . 68)(3 . 64)(4 . 274)(5 . 208)(6 . 133)(7 . 149)(8 . 43)(9 . 22)(10 . 15) (11 . 13)(12 . 115)(13 . 57)(14 . 51)(15 . 30)(16 . 624)(17 . 995)(18 . 367)(19 . 448)(20 . 723) (21 . 855)(22 . 428)(23 . 571)(24 . 52)(25 . 19)(26 . 26)(27 . 9)(28 . 113)(29 . 20)(30 . 22) (31 . 7)(32 . 252)(33 . 223)(34 . 104)(35 . 94)(36 . 384)(37 . 302)(38 . 229)(39 . 146)(40 . 65) (41 . 34)(42 . 29)(43 . 9)(44 . 13l)(45 . 37)(46 . 37)(47 . 18)(48 . 512)(49 . 765)(50 . 352) (51 . 473)(52 . 490)(53 . 704)(54 . 323)(55 . 446)(56 . 133)(57 . 89)(58 . 70)(59 . 35)(60 . 168) (61 . 95)(62 . 69)(63 . 3l)(64 . 3)(65 . 3)(68 . 19)(69 . 6)(70 . 3)(71 . 3)(80 . 167 128) (81 . 346)(82 . 67)(83 . 112)(84 . 120)(85 . 165)(86 . 47)(87 . 98)(88 . l)(89 . l)(92 . 8) (93 . 2 ) (94 . 1)(95 . l)(96 . 42)(97 . 23)(98 . ll)(99 . 8)(100 . 73)(101 . 48)(102 . 39) (103 . 15)(104 . 2)(105 . 2)(108 . 11)(109 . 4)(110 . 2)(111 . 2)(112 . 135)(113 . 253)(114 . 70) (115 . 106)(116 . 107)(117 . 148)(118 . 54)(119 . 76)(120 . 11)(121 . 7)(122 . 4)(123 . 1)(124 . 14) (125 . 7)(126 . 4)(127 . 2)(128 . 1499)(129 . 1468)(130 . 1498)(131 . 1479)(132 . 1498)(133 . 1330)(134 . 1497) (135. 1297)(136 . 1494)(137. 1005)(138 . 1402)(139. 1125)(140 . 1221)(141 . 773)(142 . 1303)(143 . 813)(144 . 1152) (145 . 550)(146 . 943)(147 . 526)(148 . 683)(149 . 375)(150 . 845)(151 . 471)(152 . 1308)(153 . 973)(154 . 1235) (155 . 1042)(156 . 923)(157 . 694)(158 . 853)(159 . 684)(160 . 333){161 . 715)(162 . 122)(163 . 265)(164 . 464) (165 . 706)(166 . 165)(167 . 407)(168 . 54)(169 . 106)(170 . 17)(171 . 25)(172 . 82)(173 . 100)(174 . 22) (175 . 44)(176 . 264)(177 . 271)(178 . 146)(179 . 150)(180 . 143)(181 . 147)(182 . 63)(183 . 73)(184 . 135) (185 . 162)(186 . 37)(187 . 97)(188 . 64)(189 . 151)(190 . 23)(191 . 55)(192 . 72)(193 . 34)(194 . 21) (195 . 8)(196 . 134)(197 . 97)(198 . 70)(199 . 32)(200 . 17)(201 . 11)(202 . 12)(203 . 5)(204 . 48) (205 . 19)(206 . 21)(207 . 14)(208 . 216)(209 . 390)(210 . 129)(211 . 185)(212 . 184)(213 . 262)(214 . 120) (215 . 149)(216 . 33)(217 . 17)(218 . 17)(219 . 12)(220 . 50)(221 . 23)(222 . 22)(223 . 9)(224 . 52) (225 . 59)(226 . 19)(227 . 16)(228 . 151)(229 . 105)(230 . 53)(231 . 60)(232 . 23)(233 . 15)(234 . 11) (235 . 12)(236 . 39)(237 . 22)(238 . 18) (239 . 10) (240 . 174) (241 . 263) (242 . 141)(243 . 147) (244 . 140) (245 . 153) (246 . 83) (247 . 109)(248 . 11) (249 . 6) (250 . 2)(251 . 2) (252 . 23)(253 . 6)(254 . 2) (255 . 2) no. of processes examined 59958 no. of nodes used till now 242 busynodes 9 current maximum load 92 level 0 traversals 40411 level 2 traversals 2115 level 1 traversals 3362 at loopcount 1600 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 170) (3 . 0) (2 . 0) (1 . 0) (0 . 0)) at loopcount 1700 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 109) (3 . 0) (2 . 0) (1 . 0) (0 . 0)) at loopcount 1800 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 64) (3 . 0) (2 . 0) (1 . 0) (0 . 0)) at loopcount 1900 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 23) (3 . 0) (2 . 0) (1 . 0) (0 . 0)) at loopcount 2000 subload list is ((7 . 0) (6 . 0) (5 . 0) (4 . 4) (3 . 0) (2 . 0) (1 . 0) (0 . 0)) lastid(0 . 114) (1 . 130)(2 . 68)(3 . 64)(4 . 274)(5 . 208)(6 . 133)(7 . 149)(8 . 43)(9 . 22)(10 . 15) (11 . 13)(12 . 115)(13 . 57)(14 . 51)(15 . 30)(16 . 624)(17 . 995)(18 . 367)(19 . 448) (20 . 723) (21 . 855) (22 . 428) (23 . 571) (24 . 52) (25 . 19) (26 . 26) (27 . 9) (28 . 113) (29 . 20) (30 . 22) (31 . 7) (32 . 252) (33 . 223) (34 . 104) (35 . 94) (36 . 384) (37 . 302) (38 . 229) (39 . 146)(40 . 65) (41 . 34)(42 . 29)(43 . 9)(44 . 13l)(45 . 37)(46 . 37)(47 . 18)(48 . 523)(49 . 788)(50 . 354) (51 . 48l)(52 . 493)(53 . 713)(54 . 324)(55 . 446)(56 . 133)(57 . 89)(58 . 70)(59 . 35)(60 . 168) (61 . 95)(62 . 69)(63 . 31)(64 . 3)(65 . 3)(68 . 19)(69 . 6)(70 . 3)(71 . 3)(80 . 131) (81 . 352)(82 . 68)(83 . 114)(84 . 122)(85 . 168)(86 . 47)(87 . 99)(88 . l)(89 . l)(92 . 8) (93 . 2)(94 . 1)(95 . 1)(96 . 42)(97 . 23)(98 . 11)(99 . 8)(100 . 73)(101 . 48)(102 . 39) (103 . 15)(104 . 2)(105 . 2)(108 . 11)(109 . 4)(110 . 2)(111 . 2)(112 . 135)(113 . 253)(114 . 70) (115 . 106)(116 . 107)(117 . 148)(118 . 54)(119 . 76)(120 . 11)(121 . 7)(122 . 4)(123 . 1)(124 . 14) (125 . 7)(126 . 4)(127 . 2)(128 . 1922)(129 . 1655)(130 . 1998)(131 . 1744)(132 . 1869)(133 . 1478)(134. 1888) (135. 1499)(136. 1603)(137. 1032)(138. 1432)(139. 1142)(140. 1249)(141 . 794)(142 . 1320)(143 . 820)(144 . 1197) (145 . 559)(146 . 956)(147 . 531)(148 . 692)(149 . 168 379)(150 . 851)(151 . 473)(152 . 1428)(153 . 1026)(154 . 1276) (155 . 1062)(156 . 976)(157 . 727)(158 . 872)(159 . 699)(160 . 333)(161 . 715)(162 . 122)(163 . 265)(164 . 464) (165 . 706)(166 . 165)(167 . 407)(168 . 54)(169 . 106)(170 . 17)(171 . 25)(172 . 82)(173 . 100)(174 . 22) (175 . 44)(176 . 264)(177 . 271)(178 . 146)(179 . 150)(180 . 143)(181 . 147)(182 . 63)(183 . 73)(184 . 135) (185 . 162)(186 . 37)(187 . 97)(188 . 64)(189 . 15l)(190 . 23)(191 . 55)(192 . 72)(193 . 34)(194 . 21) (195 . 8)(196 . 134)(197 . 97)(198 . 70)(199 . 32)(200 . 17)(201 . 11)(202 . 12)(203 . 5)(204 . 48) (205 . 19)(206 . 21)(207 . 14)(208 . 216)(209 . 390)(210 . 129)(211 . 185)(212 . 184)(213 . 262)(214 . 120) (215 . 149)(216. 33)(217 . 17)(218 . 17)(219 . 12)(220 . 50) (221 . 23) (222 . 22) (223 . 9) (224 . 52) (225 . 59) (226 . 19) (227 . 16) (228 . 151)(229 . 105)(230 . 53)(231 . 60)(232 . 23)(233 . 15)(234 . 11) (235 . 12)(236 . 39)(237 . 22) (238 . 18) (239 . 10) (240 . 174) (241 . 263) (242 . 141) (243 . 147) (244 . 140) (245 . 153) (246 . 83) (247 . 109)(248 . 11) (249 . 6)(250 . 2) (251 . 2)(252 . 23)(253 . 6) (254 . 2) (255 . 2) no. of processes examined 63223 no. of nodes used till now 242 busynodes 3 current maximum load 2 level 0 traversals 41808 level 2 traversals 2130 level 1 traversals 3495 answer returned 7 lastid(0 . 114) (1 . 130)(2 . 68)(3 . 64)(4 . 274)(5 . 208)(6 . 133)(7 . 149)(8 . 43)(9 . 22)(10 . 15) (11 . 13)(12 . 115)(13 . 57)(14 . 51)(15 . 30)(16 . 624)(17 . 995)(18 . 367)(19 . 448)(20 . 723) (21 . 855)(22 . 428)(23 . 571)(24 . 52)(25 . 19)(26 . 26)(27 . 9)(28 . 113)(29 . 20)(30 . 22) (31 . 7)(32 . 252)(33 . 223)(34 . 104)(35 . 94)(36 . 384)(37 . 302)(38 . 229)(39 . 146)(40 . 65) (41 . 34)(42 . 29)(43 . 9)(44 . 131)(45 . 37)(46 . 37)(47 . 18)(48 . 523)(49 . 788)(50 . 354) (51 . 481)(52 . 493)(53 . 713)(54 . 324)(55 . 446)(56 . 133)(57 . 89)(58 . 70)(59 . 35)(60 . 168) (61 . 95)(62 . 69)(63 . 31)(64 . 3)(65 . 3)(68 . 19)(69 . 6)(70 . 3)(71 . 3)(80 . 131) (81 . 352)(82 . 68)(83 . 114)(84 . 122)(85 . 168)(86 . 47)(87 . 99)(88 . l)(89 . 1)(92 . 8) (93 . 2)(94 . 1)(95 . 1)(96 . 42)(97 . 23)(98 . 11)(99 . 8)(100 . 73)(101 . 48)(102 . 39) (103 . 15)(104 . 2)(105 . 2)(108 . 11)(109 . 4)(110 . 2)(111 . 2)(112 . 135)(113 . 253)(114 . 70) (115 . 106)(116 . 107)(117 . 148)(118 . 54)(119 . 76)(120 . 11)(121 . 7)(122 . 4)(123 . 1)(124 . 14) (125 . 7)(126 . 4)(127 . 2)(128 . 1972)(129 . 1704)(130 . 2038)(131 . 1793)(132 . 1912)(133 . 1532)(134 . 1931) (135. 1543)(136. 1605)(137. 1032)(138. 1432)(139. 1142)(140. 1249)(141 . 794)(142 . 1320)(143 . 820)(144 . 1202) (145 . 561)(146 . 957)(147 . 532)(148 . 693)(149 . 380)(150 . 851)(151 . 473)(152 . 1429)(153 . 1026)(154 . 1276) (155 . 1062)(156 . 976)(157 . 727)(158 . 872)(159 . 699)(160 . 333)(161 . 715)(162 . 122)(163 . 265)(164 . 464) (165 . 706)(166 . 165)(167 . 407)(168 . 54)(169 . 106)(170 . 17)(171 . 25)(172 . 82)(173 . 100)(174 . 22) (175 . 44)(176 . 264)(177 . 271)(178 . 146)(179 . 150)(180 . 143)(181 . 147)(182 . 63)(183 . 73)(184 . 135) (185 . 162)(186 . 37)(187 . 97)(188 . 64)(189 . 151)(190 . 23)(191 . 55)(192 . 72)(193 . 34)(194 . 21) (195 . 8)(196 . 134)(197 . 97)(198 . 70)(199 . 32)(200 . 17)(201 . 11)(202 . 12)(203 . 5)(204 . 48) (205 . 19)(206 . 21)(207 . 14)(208 . 216)(209 . 390)(210 . 129)(211 . 185)(212 . 184)(213 . 262)(214 . 120) (215 . 149)(216 . 33)(217 . 17)(218 . 17)(219 . 12)(220 . 50)(221 . 23)(222 . 22)(223 . 9)(224 . 52) (225 . 59)(226 . 19)(227 . 16)(228 . 151)(229 . 105)(230 . 53)(231 . 60)(232 . 23)(233 . 15)(234 . 11) (235 . 12)(236 . 39)(237 . 22)(238 . 18)(239 . 10)(240 . 174)(241 . 263)(242 . 141)(243 . 147)(244 . 140) (245 . 153)(246 . 83)(247 . 109)(248 . 11)(249 . 6)(250 . 2)(251 . 2)(252 . 23)(253 . 6)(254 . 2) (255 . 2) no. of processes examined 63609 no. of nodes used till now 242 busynodes 1 current maximum load 1 169 level 0 traversals 42075 level 2 traversals 2130 level 1 traversals 3501 ;;Extra statistics given at the end of the com putation by the functions in bye.lsp. overall load in subnet 0 total is 6765 overall load in subnet 1 total is 6906 overall load in subnet 2 total is 1152 overall load in subnet 3 total is 1281 overall load in subnet 4 total is 37535 overall load in subnet 5 total is 5608 overall load in subnet 6 total is 2433 overall load in subnet 7 total is 1929 peak load on a processor 117 level 0 bias 0 level 1 bias 2 level 2 bias 4 final loop-count 2081 170 Appendix B Neural Network Simulations This section contains some of the source code used to obtain the sim­ ulation results given in C hapter 6.2, and documents representative execution traces. B .l Source Code Listings In this section, the files containing the source code are listed as: filename (contents of file) These files are currently in priam :/usr2/ghosh/cog/lisp B .1 .1 L ogical co m m u n ica tio n traffic for ra n d o m co n n ectio n s When the condition probability Pi is very small, the connections to (~t 4 the influence region show very little clustered characteristics. If pcpi < then these connections are taken to be random ly distributed over the influence region. The code in calc.lsp calculates the communication traffic among pro­ cessors needed to convey the activation level of cells to their neighbors in the influence region per iteration cycle, when such random connectivity is assumed. Eqs. 6.2.6 and 6.2.7 are used. The function nocond.total takes in a list of packet sizes (given by the m axim um num ber of states conveyed by a packet) and home sizes (the average num ber of cells in a home group), and returns comm unication statistics for each 171 combination of packet size and home size. A sample trace of the values returned for net A is given in appendix B.2.1. This function is also used to return the statistics for the case of random interconnects to the rest region since the form of the equations for average number of packets sent per processor is the same. One simply needs to specify the average group size and connectivity for the rest region instead of these parameters for the influence region. Note that the statistics returned are independent of the multicomputer topology because it just gives the communication with virtually neighboring processors in the influence/rest region without caring where these neighbors are physically located or how the packets are routed. calcl.lsp (defun nocond (region_size c_region pkt.size home.size total.pes) (setq region.no (/ region.size home_size)) (setq prob.success (- 1 (exp (- (/ c_region region.no))))) (setq msg_from_pe (max 1 (* region_no prob.success))) (setq msg_from_pu (* home_size msg_from_pe)) (setq msg_to_pu (max 1 (/ msg_from_pu region_no))) (terpri) (format t ""8A "9,2F '12,2F "12,2F "12,2F " (/ total.pes home_size) msg_from_pe msg_from_pu region_.no msg_to_pu )) (defun nocond_density (region.size c.region pkt.size home_size total_pes) (setq region_.no (/ region_size home_size)) (setq prob.success (- 1 (exp (- (/ c.region region_no))))) (setq msg_from_pe (max 1 (* region_.no prob.success))) (setq msg_from_pu (* home_size msg_from.pe)) (setq msg_to_pu (/ msg_from_pu region_no)) (cond ((>= msg.to.pu 1) (terpri) (format t ""8A "8,2F "15.2F "15.3E" (/ total.pes home_size) pkt.size (* region_.no (ceiling msg_to_pu pkt.size)) (* (/ total.pes home.size) region_.no (ceiling msg_to_pu pkt_size))) (terpri)) (t (format t ""8A "8,2F "15,2F "15,3E" (/ total_pes home_size) pkt_size (* region_size (- 1 (exp (- (/ c.region region.no))))) (* total.pes region.no (- 1 (exp (- (/ c.region region_no)))))) (terpri)))) (defun nocond_total (region.size c.region total.pes pkt.size.list home.size.list) 172 (print "# of PUs, mags per PE, mags per PU, no. of PUs hit; msg.per.hit") (terpri) (dolist (home.size home.size.list) (nocond region.size c.region (car pkt.size.list) home.size total.pes)) (terpri) (print "# of PUs, packet size , # packets/PU , total # of packets") (terpri) (doliBt (home.size home.size.list) (dolist (pkt.size pkt.size.list) (nocond.density region.Bize c.region pkt.size home.size total.pes)))) B .1 .2 L ogical co m m u n ica tio n traffic for clu ste r e d co n n ectio n s In general, the communication statistics are given by Eqs. 6.2.2 and 6.2.3 if the average core size is at least as large as the home size (Case 1) or by Eqs. 6.2.4 and 6.2.5 if the home size is larger than the core size (Case 2). Using the same display form at as the code for random connections, the function msg- calc-total takes in a list of packet sizes, values for p, and home sizes and returns communication statistics for each combination of packet size, p and home size by successive calls to the function msg-calc. The latter function first detects if the given p is low enough to w arrant an approxim ation by random connects. If this is true, then the message “same as noconditional case” is printed. Otherwise, the core size is compared with the home size and the appropriate equations are applied to generate the communication profile th at is independent of the packet size. Msg-calc-total also makes successive calls to msg-profile which takes the packet size into consideration to determine the average num ber of packets needed to be sent to virtual neighbors per iteration. A sam ple trace of the comm unication profile returned for net A with various values of p, and pr is given in appendix B.2.2. calc.lsp (defun msg.calc (region.size core.size c.region pkt.size rho.core home.size rho.region total.pes) (cond ((>= core.size home.size) ;;;;;case 1; home has core only (setq rhol (* rho.region rho.core)) (setq rho (max rhol (/ home.size c.region))) (cond ((> (/ home.size rho) 1) (setq msg.from.pn (* c.region rho))____________________________ 173 (setq msg_to_pu (* home.size (- 1 (exp (- (/ home.size rho rho.region)))))) (cond ((< msg.to.pu rho.core) (setq msg.to.pu (* msg.to.pu (/ (1- rho.core)))))) (terpri) (format t "~8A "4A "9P2F "11,2F "10,2F "9,2F" (/ total.pes home.size) rho.region (/ (* c.region rho) home.size) (* c.region rho) (/ m8g_from_pu msg_to_pu) msg_to_pu )) ( t (print "same as no conditional case") (nocond region.size c.region pkt.size home.size total.pes)))) ;case 2 home has r cores (t (setq no.cores (/ home.size core.size)) (setq rhol (+ (/ 1 no.cores rho.core rho.region) (/ (* (1- no.cores) c_region) (* no.cores region.size)))) ;;;;here rho has been defined straight instead of in inverse form (setq rho (min rhol (/ c.region home.size))) (cond ((> (* home.size rho) 1) (setq msg_from_pu (/ c.region rho)) (setq msg.to.pu (* home.size (- 1 (exp (- (/ (* home.size rho) rho.region)))))) (setq region.no (/ region.size home.size)) (cond ((and (< msg.to.pu rho.core) (< (/ msg.from.pu rho.core) region_no) 1) (setq msg.to.pu (* (/ msg.to.pu no.cores) (+ 1 (* (1- no.cores) (/ msg.from.pu rho.core) region.no))))))) (terpri) (format t ""8A "4A "9,2F "11.2F "10.2F "9.2F" (/ total.pes home.size) rho.region (/ c.region (* rho home.size)) (/ c.region rho) (/ msg.from.pu msg.to.pu) msg.to.pu )) ( t (print "same as no conditional case”) (nocond region.size c.region pkt.size home.size total.pes)))))) (defun msg.calc.total (region.size core.size c.region pkt.size.list rho.core home.size.list rho_region_list total.pes) (print "# of PUs, rho ,# msgs/ PE, per PU, # of PUs hit; msg.per.hit") (terpri) (doliBt (home.size home.size.list) 174 (terpri) (dolist (rho.region rho_region_list) (msg.calc region.size core.size c.region (car pkt.size.list) rho.core home.size rho.region total.pes))) (terpri) (print "# of PUs, rho ; pkt size ; # pkts/PU ; total pktsH) (terpri) (msg_profile region.size core.size c.region pkt.size.list rho.core home.size.list rho_region_list total.pes)) (defun msg.profile ( region.size core.size c.region pkt.size.list rho.core home.size.list rho_region_list total.pes) (dolist (pkt.size pkt.size.list) (print "# of PUs, rho ; pkt size ; # pkts/PU ; total pkts") (terpri) (dolist (home.size home.size.list) (terpri) (dolist (rho.region rho.region.list) (msg.pkt region.size core.size c.region pkt.size rho.core home.size rho.region total.pes))))) (defun msg_pkt (region.size core.size c_region pkt.size rho.core home.size rho.region total.pes) (cond ((>= core.size home.size) ;;;;;case 1; home has core only (setq rhol (* rho.region rho.core)) (setq rho (max rhol (/ home.size c.region))) (cond ((> (/ home.size rho) 1) (setq msg_from_pu (* c.region rho)) (setq msg.to.pu (* home.size (- 1 (exp (- (/ home.size rho rho_region)))))) (cond ((< msg.to.pu rho.core) (setq msg.to.pu (* msg.to.pu (/ (1- rho.core)))))) (terpri) (format t "~8A ~4A "6A "10,2F "12.3E" (/ total.pes home.size) rho.region pkt.size (* (/ msg.from.pu msg_to_pu) (ceiling msg.to.pu pkt.size)) (* (/ total.pes home.size) (/ msg.from.pu msg.to.pu) (ceiling msg.to.pu pkt.size)))) ( t (print "same as no conditional case") ( nocond region_size c.region pkt.size home.size total.pes)))) ;;;;;;;;;;;;;case 2 home has r cores (t (setq no.cores (/ home.size core.size)) (setq rhol (+ (/ 1 no.cores rho.core rho_region) (/ (* (1- no.cores) c.region) (* no.cores region.size)))) ;;;;here rho has been defined straight instead of in inverse form 175 (setq rho (min rhol (/ c.region home.size))) (cond ((> (* home.size rho) 1) (setq msg_from_pu (/ c.region rho)) (setq msg.to.pu (* home.size (- 1 (exp (- (/ (* home.size rho) rho.region)))))) (setq region.no (/ region.size home.size)) (cond ((and (< msg.to.pu rho.core) (< (/ msg_from_pu rho.core region.no) 1)) (setq msg.to.pu (* (/ msg.to.pu no.cores) (+ 1 (* (1- no.cores) (/ msg.from.pu rho.core region.no))))))) (terpri) (format t ""8A "4A ~6A "10,2F "12.3E" (/ total.pes home.size) rho.region pkt.size (* (/ msg.from.pu msg_to_pu) (ceiling msg.to.pu pkt.size)) (* (/ total.pes home.size) (/ msg.from.pu msg.to.pu) (ceiling msg.to.pu pkt.size)))) ( t (print "same as no conditional case") ( nocond region.size c.region pkt.size home.size total.pes)))))) B .1 .3 D e sc r ip tio n o f h y p o th e tic a l n eu ral n e ts The file calc.lsp specifies the five artificial neural networks described in Table 6.1. For each network model, the argum ents required to determ ine the comm unication statistics for both random interconnects and clustered connec­ tions are given. So the descriptions also serve as m aster execution functions which, when executed w ith a list of packet sizes, will call the functions given in Appendix B.1.2 and B.1.2. calc3.1sp ;;;MODEL A (defun modela(pkt-list) (print "noconditionals, influence, model A***************") (terpri) (nocond_total 65536 128 (expt 2 24) pkt-list *(4096 256 64)) (print "noconditionals, rest, model A***************") (terpri) (nocond_total (expt 2 24) 32 (expt 2 24) pkt-list ’(4096 256 64 )) 176 (print "conditionals, influence, model A***************") (terpri) (msg_calc_total 66536 512 128 pkt-list 4 ’( 4096 256 64 ) ’(8 32) (expt 2 24)) (print "conditionals, rest, model A***************") (terpri) (msg_calc_total (expt 2 24) 512 32 pkt-list 4 ’(4096 256 64 ) ’(16 32) (expt 2 24))) ;MODEL B (defun modelb(pkt-list) (nocond_total 16384 128 (expt 2 24) pkt-list ’(16384 4096 1024 256 64)) (nocond_total (expt 2 24) 128 (expt 2 24) pkt-list ’(16384 4096 1024 256 64)) (msg_calc_total 16384 256 128 pkt-list 4 '(16384 4096 1024 256 64 ) ’(4 8 16 32 64) (expt 2 24)) (msg_calc_total (expt 2 24) 256 128 pkt-list 4 ’(16384 4096 1024 256 64 ) ’(4 8 16 32 64) (expt 2 24))) ;MODEL C (defun modelc(pkt-list) (nocond_total 8192 256 (expt 2 24) pkt-list ’(16384 4096 1024 256 64)) (nocond_total (expt 2 24) 16 (expt 2 24) pkt-list ‘(16384 4096 1024 256 64)) (msg_calc_total 8192 512 256 pkt-liBt 2 ’(16384 4096 1024 256 64 ) ’(4 8 16 32 64) (expt 2 24)) (msg_calc_total (expt 2 24) 512 16 pkt-list 2 ’(16384 4096 1024 256 64 ) ’(4 8 16 32 64) (expt 2 24))) ;MODEL D (ten times connectivity) (defun modelD (pkt-list) (print "noconditionals, influence, modelD***************") (terpri) (nocond_total 65536 1024 (expt 2 24) pkt-list ’(16384 4096 1024 256 64)) (print "noconditionals, rest, modelD****************") (terpri) (nocond_total (expt 2 24) 512 (expt 2 24) pkt-list ’(16384 4096 1024 256 64)) (print "conditionals, rest, modelD****************") (terpri) (msg_calc_total 65536 2048 1024 pkt-list 2 *(16384 4096 1024 256 64 ) ’(8 16 32) (expt 2 24)) (print "conditionals, rest, modelD") 177 (terpri) (msg_calc_total (expt 2 24) 2048 512 pkt-list 2 '(16384 4096 1024 256 64 ) ’(8 16 32 ) (expt 2 24))) ;;MODEL E smaller no. of pes (defun modeIE (pkt-list) (print "noconditionals, influence, model e***************") (terpri) (nocond_total (expt 2 15) 128 (expt 2 20) pkt-list ’(1024 256 64 16 4)) (print "noconditionals, rest, model e***************") (terpri) (nocond.total (expt 2 20) 32 (expt 2 20) pkt-list ‘(1024 256 64 16 4)) (print "conditionals , influence, model (terpri) (msg.calc.total (expt 2 15) 512 128 pkt-list 4 ‘(1024 256 64 16 4) ’( 8 ) (expt 2 20)) (print "conditionals , rest, model e***************") (msg_calc_total (expt 2 20) 512 32 pkt-list 4 ’(1024 256 64 16 4) ’( 16 ) (expt 2 20))) B .1 .4 S p ecifica tio n o f m u ltic o m p u te r to p o lo g ie s If the m apping policy given in section 5.4.3 is followed, and if the num ber of processors onto which a particular group (core, influence or rest) is m apped is a power of 2, then then the addresses of these processors form a subspace of the boolean vector space of processor addresses. The destination processor of a message sent w ithin a group is taken to be random ly distributed w ithin its subspace. To find the average num ber of links traversed by such a message, we simply need to specify the expected num ber of links traversed in the preferred p ath between a random source-destination pair in the subspace for the given topology. This is specified by the spread function for th a t topology. For example, for a random message w ithin an x-dimensional hyper­ cube, there is a 50% chance of traversing a link in each of the lower x dimensions, while no higher dimensional link is traversed. This is indicated by spread-hc, the spread function for hypercubes shown below. Note th a t links in the same dimension have the same likelihood of being used, but this likelihood can differ from one dimension to another. 178 spread.lsp ;SPREADS FOR HYPERCUBES (defun spread-hc (x) (cond ((equal X 0) ’(0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)) ((equal X 1) ’ (0.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0)) ( (equal X 2 ) ’(0.5 0.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ) ) ((equal X 3 ) ’(0.5 0.5 0.5 0 0 0 0 0 0 0 0 0 0 0 0 0)) ((equal X 4 ) ’(0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 0 0 0 0)) ((equal X 5 ) ’(0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 0 0 0)) ((equal X 6 ) ’(0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 0 0)) ((equal X 7 ) '(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0 0)) ((equal X 8 ) ’(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0 0)) ((equal X 9 ) '(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0 0)) ((equal X 10) '(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0 0)) ((equal X 11) ’(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0 0)) ((equal X 12) ’(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0 0 0 0)) ((equal X 16) ’(0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5)) 0.5 0.5 0.5 0.5 (T ’(spread for required dimension is missing)))) For the torus, we simply need to know the num ber of links traversed for a random message w ithin a mesh of 2dim processors, since each link is as likely to be used. This is in contrast to hypercubes where links in different dimensions can have different probability of being used. ;SPREADS FOR MESHES (defun spread-mesh (dim) (cond ((evenp dim)(/ (- (expt 2 (/ dim 2)) 1) 2)) ((oddp dim)(* (- (expt 2 (/ (- dim 1) 2)) 1) 0.75)))) Two types of hypernets are specified. In a 5-hypernet built from 5- cubelets, we distinguish among the five dimensions of level-0 links, and between these links and links at levels 1 and 2. This is reflected in spread-5h. For a 5-hypernet built from buslets however, all level-0 links are treated identically so th a t we have only 3 classes of links as indicated in spread-bush5. Since only half the nodes have a level-1 link, and only a quarter of them have a level-2 link, the actual load on these links is double and quadruple respectively of the load determ ined on the basis of one link of each class per node. This necessitates the function load-5h. ;SPREADS FOR (5,H)-HYPERNETS (defun spread-5h (x) (cond 179 0 0 0 0 0 0 0)) .5 0 0 0 0 0 0 0 )) .5 0.5 0 0 0 0 0 0)) .5 0.5 0.5 0 0 0 0 0)) ’(0.5 0.5 0.5 0.5 0 0 0 0)) ’(0.5 0.5 0.5 0.5 0.5 0 0 0)) (0.75 0.516 0.5 0.5 0.5 0.5 0 0)) ’(0.875 0.875 0.875 0.5 0.5 0.75 0 0)) ’(0.9375 0.9375 0.9375 0.9375 0.5 0.875 0 0)) ’(0.96875 0.96875 0.96875 0.96875 0.96875 0.9375 0 0)) ’(1.22 1.22 .97 .97 .97 .94 .5 0)) ’(1.34 1.34 1.34 1.34 .97 .94 .75 0)) ’(1.40 1.40 1.40 1.40 1.40 .94 .875 0)) ’(1.67 1.45 1.44 1.44 1.44 1.4 .9375 0)) ’(1.82 1.82 1.82 1.46 1.46 1.7 .97 0)) ’(1.89 1.89 1.89 1.89 1.46 1.83 .99 0)) ’(1.93 1.93 1.93 1.93 1.93 1.87 1 0)) ’(2.59 2.38 2.37 2.37 2.37 2.31 1 0.875)) (T ’(spread for required dimension is missing)))) ;;;SPREADS FOR BUS BASED 5-HYPERNETS ( (equal X 0) ’ ((equal X 1) • ((equal X 2) ■ ((equal X 3) ’ ((equal X 4 ) ((equal X 5 ) ((equal X 6 ) ((equal X 7 ) ((equal X 8 ) ((equal X 9 ) ((equal X 10 ) ((equal X 11 ) ((equal X 12 ) ((equal X 13 ) ((equal X 14 ) ((equal X 15 ) ((equal X 16 ) ((equal X 19 ) (defun spread-bushS (x) (cond ((equal x 0) ’(000)) ((or (equal x 1)(equal x 2)(equal x 3)(equal x 4)(equal x 5)) ’ (1 0 0)) ((or (equal x 6)(equal x 7)(equal x 8)(equal x 9)) ’(1.5 2 0)) ((or (equal x 10)(equal x 11)(equal x 12)(equal x 13)(equal x 14) (equal x 15)(equal x 16)) ’( 2.25 4 4)) ((or (equal x 17) (equal x 18) (equal x 19) (equal x 20)) ’(3.125 8 8 8)) (T ’(spread for required dimension is missing)))) ;;;load/link for 5-h nets (defun load-5h (x) (mapcar #’* ’(1 1 1 1 1 2 4) x)) B .1 .5 B a n d w id th req u irem en ts ca lcu la tio n Consider the m apping of a neural net from Table 6.1 onto a m ulti­ com puter of known size. The average sizes of cores, influence and rest regions translate into the dimensions of the subspaces onto which these three regions are m apped. These are labeled as dim l, dim2 and dimS respectively. The code in sections B.1.1 and B.1.2 yields the num ber of messages sent to each of these regions per iteration for any of the networks described in B.1.3. Let these values be freql, freqS and freqS. 180 The average load on the communication links of a given topology due to messages sent to the influence region is simply fre q 2 x (spread for dimension dim2). The total average load is obtained by sum m ing the load for messages to all three regions. This is achieved by the function msg-load which in tu rn calls the msg-load functions for the individual topologies, as shown below. bwidth.lsp (defun msg-load (freql diml freq2 dim2 freq3 dim3) (print *(load profile for hypernet-5 is)) (print (msg-load-5h freql diml freq2 dim2 freq3 dim3)) (print ’(load profile for bus-based h-5 is)) (print (msg-load-bushS freql diml freq2 dim2 freq3 dim3)) (print ’(load profile for hypercube is)) (print (msg-load-hc freql diml freq2 dim2 freq3 dim3)) (print ’(load profile for mesh is)) (print (msg-load-mesh freql diml freq2 dim2 freq3 dim3))) (defun msg-load-5h (freql diml freq2 dim2 freq3 dim3) (addplus (so freql (spread-Sh diml)) (addplus (so freq2 (spread-5h dim2)) (so freq3 (spread-5h dim3))))) (defun msg-load-bushS (busfactor freql diml freq2 dim2 freq3 dim3) (setq y (addplus (so freql (spread-bushS diml)) (addplus (so freq2 (spread-bushS dim2)) (so freq3 (spread-bushS dim3))))) (cons (* busfactor (car y)) (cdr y))) (defun msg-load-hc (freql diml freq2 dim2 freq3 dim3) (addplus (so freql (spread-hc diml)) (addplus (so freq2 (spread-hc dim2)) (so freq3 (spread-hc dim3))))) (defun msg-load-mesh (freql diml freq2 dim2 freq3 dim3) (+ (* freql (spread-mesh diml)) (+ (* freq2 (spread-mesh dim2)) (* freq3 (spread-mesh dim3))))) The function band-width-table calculates the dimensions of the three subspaces given num ber of cells in the core region, influence region and in the over neural network. It takes in a variable busfactor which gives the probability of successfully sending a message on a bus in one trial, and is used to account for bus-contention effects in the buslet-based hypernet. The variable avbias is used to account for the effect of uneven loading on the various links. It is used to yield a weighted average of the bandw idths calculated using the average load and the m axim um load on the link classes, as specified in band-width._________ 181 (defun av (x) (/ (apply #*+ x)(length x))) (defun band-width (x a) (+ (* a (av x)) (* (- 1 a) (apply #'max x)))) (defun band-width-ho (a y dim) (setq x (load-Sh y)) (* 11.5 (+ (* a (av x)) (* (- 1 a) (apply #’max x))))) (defun band-width-hc (a y dim) (* (* 2 dim) (+ (* a (av y)) (* (- 1 a) (apply #’max y))))) (defun band-widths (busfactor no_pus diml dim2 dim3 a freql freq2 freq3) (format t " "8,IF "8,IF "8,IF "8,IF "10,2E "10,2E "10,2E "10.2E • • (band-width-hc a (msg-load-hc freql diml freq2 dim2 freq3 dim3) dim3) (band-width-ho a (msg-load-5h freql diml freq2 dim2 freq3 dim3) dim3) (apply #• + (msg-load-bushS busfactor freql diml freq2 dim2 freq3 dim3)) (* 8.8 (msg-load-mesh freql diml freq2 dim2 freq3 dim3) ) (* (expt 2 no_pus) (band-width-hc a (msg-load-hc freql diml freq2 dim2 freq3 dim3) dim3)) (* (expt 2 no_pus) (band-width-ho a (msg-load-5h freql diml freq2 dim2 freq3 dim3) dim3)) (* (expt 2 no_pus) (apply #’+ (msg-load~bush5 busfactor freql diml freq2 dim2 freq3 dim3))) (* (expt 2 no_pus) (*8.8 (msg-load-mesh freql diml freq2 dim2 freq3 dim3) ))) (terpri)) (defun header (no_pus) (format t "#PUs is ~8A" (expt 2 no_pus)) (terpri) (print "rhol, rho2, hypercube, hypernet, busnet, mesh, (total packets for the same) ") (terpri)) (defun band-width-table (busfactor core.pes infl.pes total.pes no.pus rhol rho2 a freql freq2 freq3) (setq diml (max (+ no.pus (- core.pes total.pes)) 0)) (setq dim2 (max 0 (+ no.pus (- infl.pes total.pes)))) (setq dim3 no.pus) (format t ""4A "4A" rhol rho2) ( band-widths busfactor no.pus diml dim2 dim3 a freql freq2 freq3) (terpri)) 182 B.2 Execution Traces In this section, we provide sample comm unication profiles generated for both random and clustered connections. All execution traces are for net A unless specified otherwise. The default topology is the hypercube of given size. Comments have been inserted in norm al font size to explain or clarify the execution trace, and garbage collection messages have been deleted from the original traces. B .2 .1 L ogical traffic for ran d om co n n ectio n s random >(load "init.lsp") "/usr/priam/ghosh/cog/lisp/init.lap" > (modela '(1 2 4 8 16 32 64)) Network A is being used w ith packet sizes from 1 to 64. F irst, the communi­ cation profile due to random connections is calculated for the influence region. The traffic in the rest region is given after the “noconditionals, re s t,m e s s a g e . "noconditionals, influence, model A***************" "# of PUs, msgs per PE, msgs per PU, no. of PUs hit; msg_per. 4096 15.99 65514.02 16.00 4094.63 65536 100.73 25786.41 256.00 100.73 262144 120.32 7700.68 1024.00 7.52 "# of PUs, packet size , # packets/PU , total # of packets" 4096 1.00 65520.00 2.684E+8 4096 2.00 32768.00 1.342E+8 4096 4.00 16384.00 6.711E+7 4096 8.00 8192.00 3.355E+7 4096 16.00 4096.00 1.678E+7 4096 32.00 2048.00 8.389E+6 4096 64.00 1024.00 4.194E+6 65536 1.00 25856.00 1.694E+9 65536 2.00 13056.00 8.556E+8 65536 4.00 6656.00 4.362E+8 65536 8.00 3328.00 2.181E+8 65536 16.00 1792.00 1.174E+8 65536 32.00 1024.00 6.711E+7 65536 64.00 512.00 3.355E+7 262144 1.00 8192.00 2.147E+9 183 262144 2.00 4096.00 1.074E+9 262144 4.00 2048.00 5.369E+8 262144 8.00 1024.00 2.684E+8 262144 16.00 1024.00 2.684E+8 262144 32.00 1024.00 2.684E+8 262144 64.00 1024.00 2.684E+8 " h o condit ionals, rest, model A***************" "# of PUs, msgs per PE, msgs per PU, no. of PUb hit; msg_per_hit" 4096 31.88 130561.00 4096.00 31.88 65636 31.99 8190.00 65536.00 1.00 262144 32.00 2048.00 262144.00 1.00 "# of PUb , packet size , # packets/PU , total # of packets" 4096 1.00 131072.00 5.369E+8 4096 2.00 65536.00 2.684E+8 4096 4.00 32768.00 1.342E+8 4096 8.00 16384.00 6.711E+7 4096 16.00 8192.00 3.355E+7 4096 32.00 4096.00 1.678E+7 4096 64.00 4096.00 1.678E+7 65536 1.00 8190.00 5.367E+8 65536 2.00 8190.00 5.367E+8 65536 4.00 8190.00 5.367E+8 65536 8.00 8190.00 S.367E+8 65536 16.00 8190.00 5.367E+8 65536 32.00 8190.00 5.367E+8 65536 64.00 8190.00 5.367E+8 262144 1.00 2048.00 5.369E+8 262144 2.00 2048.00 5.369E+8 262144 4.00 2048.00 5.369E+8 262144 8.00 2048.00 5.369E+8 262144 16.00 2048.00 5.369E+8 262144 32.00 2048.00 6.369E+8 262144 64.00 2048.00 5.369E+8 B .2 .2 L ogical traffic for lo ca lized co n n ectio n s ********************* clustered First, the communication profile is calculated for the influence region for a variety of combinations of pt, m ulticom puter size ( # of PUs) and for different packet sizes. All this is achieved by a single call the the msg-calc-total function, thus delineating the power of the dolist function in Lisp. 184 The param eter “rho” is actually 1 fpi , and this inverted form is being used for convenience. Note th a t for large system sizes a n d /o r for very low values of p,-, the message patterns tend towards the random connect case of the previous section. > (msg_calc_total 65536 512 128 ’(32 16 8 4) 4 ’(16384 4096 1024 256 64) ‘(2 4 8 16 32 64) (expt 2 24)) * * * * * * * * * * * * * * * 0 f p u b = * * * * * * * * * * " 1024 ii****************p£cket size is****************" 32 "rho # msgs/ PE, per PU, # of PUs hit : msg_per_.hit; # pkts/PU ; total 2 1.35 22075.29 1.35 16384.00 691.20 7.078E+5 4 2.03 33288.13 2.03 16384.00 1040.25 1.065E+6 8 2.72 44620.25 2.73 16337.98 1395.58 1.429E+6 16 3.28 53773.13 3.60 14952.38 1683.06 1.723E+6 32 3.66 59918.63 5.50 10896.05 1875.20 1.920E+6 64 3.88 63550.06 9.63 6600.67 1992.96 2.041E+6 »****************packet size is****************" 16 "rho # msgs/ PE, per PU, # of PUs hit ; msg.per..hit; # pkts/PU ; total 2 1.35 22075.29 1.35 16384.00 1381.05 1.414E+6 4 2.03 33288.13 2.03 16384.00 2080.51 2.130E+6 8 2.72 44620.25 2.73 16337.98 2791.16 2.858E+6 16 3.28 53773.13 3.60 14952.38 3362.53 3.443E+6 32 3.66 59918.63 5.50 10896.05 3750.40 3.840E+6 64 3.88 63550.06 9.63 6600.67 3976.29 4.072E+6 size is****************" 8 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.35 22075.29 1.35 16384.00 2760.76 2.827E+6 4 2.03 33288.13 2.03 16384.00 4161.02 4.261E+6 8 2.72 44620.25 2.73 16337.98 5579.59 5.713E+6 16 3.28 53773.13 3.60 14952.38 6725.07 6.886E+6 32 3.66 59918.63 5.50 10896.05 7495.30 7.675E+6 64 3.88 63550.06 9.63 6600.67 7952.58 8.143E+6 *****************packet size is****************" 4 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.35 22075.29 1.35 16384.00 5520.17 5.653E+6 185 4 2.03 33288.13 2.03 16384.00 8322.03 8.522E+6 8 2.72 44620.25 2.73 16337.98 11156.44 1.142E+7 16 3.28 53773.13 3.60 14952.38 13446.54 1.377E+7 32 3.66 59918.63 5.50 10896.05 14985.09 1.534E+7 64 3.88 63550.06 9.63 6600.67 15895.54 1.628E+7 *************** 0f po8 = *********** 4096 ••****************pac]£et size is****************" 32 "rho # msgs/ PE, per PU. # of PUs hitI msg_per..hit; # pkts/PU ; total 2 1.80 7384.34 1.80 4096.00 232.56 9.526E+5 4 3.28 13443.28 3.28 4095.76 420.13 1.721E+6 8 5.57 22795.13 5.90 3864.92 713.65 2.923E+6 16 8.53 34952.54 14.03 2491.98 1094.03 4.481E+6 32 11.64 47662.55 40.00 1191.50 1520.08 6.226E+6 64 14.22 58254.22 108.41 537.33 1843.03 7.549E+6 •i****************packet size is****************” 16 "rho # msgs/ PE. per PU, # of PUs hit; msg_per_.hit; # pkts/PU ; total 2 1.80 7384.34 1.80 4096.00 463.32 1.898E+6 4 3.28 13443.28 3.28 4095.76 840.25 3.442E+6 8 5.57 22795.13 5.90 3864.92 1427.31 5.846E+6 16 8.53 34952.54 14.03 2491.98 2188.05 8.962E+6 32 11.64 47662.55 40.00 1191.50 3000.16 1.229E+7 64 14.22 58254.22 108.41 537.33 3686.06 1.510E+7 n****************packet size is****************" 8 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts” 2 1.80 7384.34 1.80 4096.00 924.85 3.788E+6 4 3.28 13443.28 3.28 4095.76 1680.51 6.883E+6 8 5.57 22795.13 5.90 3864.92 2854.61 1.169E+7 16 8.53 34952.54 14.03 2491.98 4376.11 1.792E+7 32 11.64 47662.55 40.00 1191.50 5960.31 2.441E+7 64 14.22 58254.22 108.41 537.33 7372.12 3.020E+7 »****************pac}cet size is****************" 4 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total 2 1.80 7384.34 1.80 4096.00 1847.89 7.569E+6 4 3.28 13443.28 3.28 4095.76 3361.02 1.377E+7 8 5.57 22795.13 5.90 3864.92 5703.33 2.336E+7 186 16 8.53 34952.54 14.03 2491.98 8738.19 3.579E+7 32 11.64 47662.55 40.00 1191.50 11920.63 4.883E+7 64 14.22 58254.22 108.41 537.33 14635.83 5.995E+7 * * * * * * * * * * * * * * * of PUs = **********" 16384 H j|t* 3 |t * ] |E * :4 c * * * :|c * * * * * p £ c ] £ Q '£ S X Z C X S * * * * * * * * * * * * * * * * M 32 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.97 2016.49 1.97 1024.00 64.98 1.065E+6 4 3.88 3971.88 3.88 1023.73 124.15 2.034E+6 8 7.53 7710.12 8.55 901.70 247.97 4.063E+6 16 14.22 14563.56 33.06 440.54 462.82 7.583E+6 32 25.60 26214.40 176.97 148.13 884.87 1.450E+7 64 42.67 43690.67 931.72 46.89 1863.44 3.053E+7 i * ****** ****** *** *packet size is* ***************1 1 16 "rho # msgs/ PE, per PU, # of PUs hit; msg_per..hit; # pkts/PU ; total 2 1.97 2016.49 1.97 1024.00 128.00 2.097E+6 4 3.88 3971.88 3.88 1023.73 248.31 4.068E+6 8 7.53 7710.12 8.55 901.70 487.39 7.985E+6 16 14.22 14563.56 33.06 440.54 925.63 1.517E+7 32 25.60 26214.40 176.97 148.13 1769.73 2.900E+7 64 42.67 43690.67 931.72 46.89 2795.17 4.580E+7 > |******:|c****:i!**:icI |cpacket size is****************" 8 "rho # msgs/ PE, per PU. # of PUs hit; msg_per..hit; # pkts/PU ; total 2 1.97 2016.49 1.97 1024.00 254.03 4.162E+6 4 3.88 3971.88 3.88 1023.73 496.61 8.137E+6 8 7.53 7710.12 8.55 901.70 966.22 1.583E+7 16 14.22 14563.56 33.06 440.54 1851.26 3.033E+7 32 25.60 26214.40 176.97 148.13 3362.49 5.509E+7 64 42.67 43690.67 931.72 46.89 5590.33 9.159E+7 "****************packet size is****************" 4 "rho # msgs/ PE, per PU, # of PUs hit; msg.per..hit; # pkts/PU ; total 2 1.97 2016.49 1.97 1024.00 506.09 8.292E+6 4 3.88 3971.88 3.88 1023.73 993.23 1.627E+7 8 7.53 7710.12 8.55 901.70 1932.44 3.166E+7 16 14.22 14563.56 33.06 440.54 3669.46 6.012E+7 32 25.60 26214.40 176.97 148.13 6724.98 1.102E+8 64 42.67 43690.67 931.72 46.89 11180.66 1.832E+8 * * * * * * * * * * * * * * * 0 f p u 8 = * * * * * * * * * * * 65536 •'****************paC} Cet size is****************" 32 "rho # mags/ PE, per PU, # of PUs hit; mag_per_hit; # pkts/PU ; total pkts" 2 4.00 1024.00 4.00 256.00 32.00 2.097E+6 4 8.00 2048.00 8.15 251.31 65.19 4.273E+6 8 16.00 4096.00 25.31 161.82 151.87 9.953E+6 16 32.00 8192.00 144.67 56.63 289.33 1.896E+7 32 64.00 16384.00 1056.33 15.51 1056.33 6.923E+7 "same as no conditional case" *packet size is***********♦****•• 16 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 4.00 1024.00 4.00 256.00 64.00 4.194E+6 4 8.00 2048.00 8.15 251.31 130.39 8.545E+6 8 16.00 4096.00 25.31 161.82 278.43 1.825E+7 16 32.00 8192.00 144.67 56.63 578.66 3.792E+7 32 64.00 16384.00 1056.33 15.51 1056.33 6.923E+7 "same as no conditional case" "♦***************packet size is****************" 8 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 4.00 1024.00 4.00 256.00 128.00 8.389E+6 4 8.00 2048.00 8.15 251.31 260.78 1.709E+7 8 16.00 4096.00 25.31 161.82 531.54 3.484E+7 16 32.00 8192.00 144.67 56.63 1157.33 7.585E+7 32 64.00 16384.00 1056.33 15.51 2112.67 1.385E+8 "same as no conditional case" *****************packet size is****************" 4 "rho # msgs/ PE, per PU, # of PUs hit; msg_per..hit; # pkts/PU ; total 2 4.00 1024.00 4.00 256.00 256.00 1.678E+7 4 8.00 2048.00 8.15 251.31 513.40 3.365E+7 8 16.00 4096.00 25.31 161.82 1037.78 6.801E+7 16 32.00 8192.00 144.67 56.63 2169.99 1.422E+8 32 64.00 16384.00 1056.33 15.51 4225.33 2.769E+8 "same as no conditional case" 188 *************** 0f PUs = **********" 262144 H^^e^c^e^cj^^c^e^c^c^e^^ejic^e^epgicke'fc SXZG XS * * * * * * * * * * * * * * * * n 32 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 16.00 1024.00 16.30 62.83 4 32.00 2048.00 50.62 40.46 8 64.00 4096.00 289.33 14.16 "same as no conditional case" "same as no conditional case" "same as no conditional case" ••****************packet size is****************" 16 32.60 101.25 289.33 8.545E+6 2.654E+7 7.585E+7 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 16.00 1024.00 16.30 62.83 4 32.00 2048.00 50.62 40.46 8 64.00 4096.00 289.33 14.16 "same as no conditional case" "same as no conditional case" "same as no conditional case" "***!t*4c;iE*:ic*!ic:ic:ic*:ic*packet size is****************" 8 65.19 151.87 289.33 1.709E+7 3.981E+7 7.585E+7 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 16.00 1024.00 16.30 62.83 4 32.00 2048.00 50.62 40.46 8 64.00 4096.00 289.33 14.16 "same as no conditional case” "same as no conditional case" "same as no conditional case” n****************pa<:ket size is****************" 4 130.39 303.74 578.66 3.418E+7 7.962E+7 1.517E+8 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 16.00 1024.00 4 32.00 2048.00 8 64.00 4096.00 "same as no conditional case" "same as no conditional case” "same as no conditional case” NIL 16.30 50.62 289.33 62.83 40.46 14.16 260.78 556.86 1157.33 6.836E+7 1.460E+8 3.034E+8 189 Now we calculate the profile due to messages to the rest region in an analogous fashion. Here the values of 1 / pr are being called “rho” . Again, for large system sizes, the profile tends towards a random connection profile if pr is low enough. > (msg_calc_total (expt 2 24) 512 32 ’(32 16 8 4) 4 ’(16384 4096 1024 256 64) ’(2 4 8 16 32 64) (expt 2 24)) n************** of PUs = ************ 1024 •i****************packet size is****************" 32 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 16384.00 1.00 16384.00 512.00 5.243E+5 4 1.00 16384.00 1.00 16378.50 512.17 5.245E+5 8 2.00 32706.12 2.31 14175.04 1022.14 1.047E+6 16 3.98 65288.93 10.10 6465.39 2049.94 2.099E-I-6 32 7.94 130087.45 67.10 1938.84 4092.82 4.191E+6 64 15.76 258235.17 504.77 511.59 8076.27 8.270E+6 size is****************" 16 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 16384.00 1.00 16384.00 1024.00 1.049E+6 4 1.00 16384.00 1.00 16378.50 1024.34 1.049E+6 8 2.00 32706.12 2.31 14175.04 2044.27 2.093E+6 16 3.98 65288.93 10.10 6465.39 4089.78 4.188E+6 32 7.94 130087.45 67.10 1938.84 8185.64 8.382E+6 64 15.76 258235.17 504.77 511.59 16152.54 1.654E+7 size is****************" 8 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 16384.00 1.00 16384.00 2048.00 2.097E+6 4 1.00 16384.00 1.00 16378.50 2048.69 2.098E+6 8 2.00 32706.12 2.31 14175.04 4088.54 4.187E+6 16 3.98 65288.93 10.10 6465.39 8169.46 8.366E-)-6 32 7.94 130087.45 67.10 1938.84 16304.19 1.670E+7 64 15.76 258235.17 504.77 511.59 32305.09 3.308E+7 ti****************packet size is****************" 4 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 190 2 1.00 16384.00 1.00 16384.00 4096.00 4.194E+6 4 1.00 16384.00 1.00 16378.50 4096.37 4.195E+6 8 2.00 32706.12 2.31 14175.04 8177.08 8.373E+6 16 3.98 65288.93 10.10 6465.39 16328.83 1.672E+7 32 7.94 130087.45 67.10 1938.84 32541.28 3.332E+7 64 15.76 258235.17 504.77 511.59 64610.18 6.616E+7 *************** 0f p(J8 = * * * * * * * * * * 1 1 4096 ■t****************packet size is****************" 32 "rho # msgs/ PE, per PU, # of PUs hit; msg_per..hit; # pkts/PU ; total 2 1.00 4096.00 1.00 4096.00 128.00 5.243E+5 4 1.00 4096.00 1.00 4094.63 128.04 5.245E+5 8 2.00 8188.50 2.31 3542.14 256.60 1.051E+6 16 4.00 16370.01 10.15 1612.71 517.68 2.120E+6 32 7.99 32712.10 67.86 482.06 1085.73 4.447E+6 64 15.95 65312.76 516.53 126.44 2066.13 8.463E+6 "****************pecket size is****************" 16 "rho # msgs/ PE, per PU, # of PUs hit; msg.per..hit; # pkts/PU ; total 2 1.00 4096.00 1.00 4096.00 256.00 1.049E+6 4 1.00 4096.00 1.00 4094.63 256.09 1.049E+6 8 2.00 8188.50 2.31 3542.14 513.21 2.102E+6 16 4.00 16370.01 10.15 1612.71 1025.21 4.199E+6 32 7.99 32712.10 67.86 482.06 2103.61 8.616E+6 64 15.95 65312.76 516.53 126.44 4132.26 1.693E+7 «****************packet size is****************" 8 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 4096.00 1.00 4096.00 512.00 2.097E+6 4 1.00 4096.00 1.00 4094.63 512.17 2.098E+6 8 2.00 8188.50 2.31 3542.14 1024.10 4.195E+6 16 4.00 16370.01 10.15 1612.71 2050.42 8.399E+6 32 7.99 32712.10 67.86 482.06 4139.36 1.695E+7 64 15.95 65312.76 516.53 126.44 8264.52 3.385E+7 »****:t*:t:tc*:|c*:t*:|c*:|!pg.eket Size is****************" 4 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_h.it; # pkts/PU ; total 2 1.00 4096.00 1.00 4096.00 1024.00 4.194E+6 4 1.00 4096.00 1.00 4094.63 1024.34 4.196E+6 191 8 16 32 64 2.00 4.00 7.99 15.95 8188.50 16370.01 32712.10 65312.76 2.31 10.15 67.86 516.53 3542.14 1612.71 482.06 126.44 2048.20 4100.85 8210.86 16529.05 8.389E+6 1.680E+7 3.363E+7 6.770E+7 **************** 0f pU g 16384 ' ■ * * * * size is****************" 32 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts” 2 1.00 1024.00 1.00 1024.00 32.00 5.243E+5 4 1.00 1024.00 1.00 1023.66 32.01 5.245E+5 8 2.00 2047.87 2.31 885.43 64.76 1.061E+6 16 4.00 4095.50 10.16 402.95 132.13 2.165E+6 32 8.00 8190.00 68.05 120.35 272.20 4.460E+6 64 15.99 16376.00 519.54 31.52 519.54 8.512E+6 size is*****************1 16 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 1024.00 1.00 1024.00 64.00 1.049E+6 4 1.00 1024.00 1.00 1023.66 64.02 1.049E+6 8 2.00 2047.87 2.31 885.43 129.52 2.122E+6 16 4.00 4095.50 10.16 402.95 264.26 4.330E+6 32 8.00 8190.00 68.05 120.35 544.41 8.920E+6 64 15.99 16376.00 519.54 31.52 1039.08 1.702E+7 •>****************packet size is****************" 8 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 1024.00 1.00 1024.00 128.00 2.097E+6 4 1.00 1024.00 1.00 1023.66 128.04 2.098E+6 8 2.00 2047.87 2.31 885.43 ;;; GC: 24692 words [98768 bytes] of dynamic storage in use. ;;; 56976 words [227904 bytes] of free storage available before a GC. ;;; 138644 words [554576 bytes] of free storage available if GC is disabled. 256.73 4.206E+6 16 4.00 4095.50 10.16 402.95 518.35 8.493E+6 32 8.00 8190.00 68.05 120.35 1088.82 1.784E+7 64 15.99 16376.00 519.54 31.52 2078.15 3.405E+7 Size 4 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 1024.00 1.00 1024.00 256.00 4.194E+6 4 1.00 1024.00 1.00 1023.66 256.09 4.196E+6 8 2.00 2047.87 2.31 885.43 513.45 8.412E+6 16 4.00 4095.50 10.16 402.95 1026.54 1.682E+7 32 8.00 8190.00 68.05 120.35 2109.58 3.456E+7 64 15.99 16376.00 519.54 31.52 4156.30 6.810E+7 «#*****+**#**♦* of PUs **********" 65536 size is****************" 32 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 256.00 1.00 256.00 8.00 5.243E+5 4 2.00 512.00 2.04 251.31 16.30 1.068E+6 8 4.00 1024.00 6.33 161.82 37.97 2.488E+6 16 8.00 2048.00 36.17 56.63 72.33 4.740E+6 32 16.00 4096.00 264.08 15.51 264.08 1.731E+7 "same as no conditional case" «****************packet size is****************" 16 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 256.00 1.00 256.00 16.00 1.049E+6 4 2.00 512.00 2.04 251.31 32.60 2.136E+6 8 4.00 1024.00 6.33 161.82 69.61 4.562E+6 16 8.00 2048.00 36.17 56.63 144.67 9.481E+6 32 16.00 4096.00 264.08 15.51 264.08 1.731E+7 "same as no conditional case" * l *:fc;)|c:|e:|c3|E*9|e*j|t3|c:|e*:|c3|ca|Epacket» 8 3 .Z 6 i s * * * * * * * * * * * * * * * * ” 8 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 1.00 256.00 1.00 256.00 32.00 2.097E+6 4 2.00 512.00 2.04 251.31 65.19 4.273E+6 8 4.00 1024.00 6.33 161.82 132.89 8.709E+6 16 8.00 2048.00 36.17 56.63 289.33 1.896E+7 32 16.00 4096.00 264.08 15.51 528.17 3.461E+7 "same as no conditional case" * ***jp3,ck€»ti size is****************1 1 4 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts” 2 1.00 256.00 1.00 256.00 64.00 4.194E+6 4 2.00 512.00 2.04 251.31 128.35 8.412E+6 193 8 4.00 1024.00 6.33 161.82 259.44 1.700E+7 16 8.00 2048.00 36.17 56.63 542.50 3.555E+7 32 16.00 4096.00 264.08 15.51 1056.33 6.923E+7 "same as no conditional case" *************** of PUs - **********" 262144 *****************packet size is****************" 32 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 4.00 256.00 4.07 62.83 4 8.00 512.00 12.66 40.46 8 16.00 1024.00 72.33 14.16 "same as no conditional case" "same as no conditional case" "same as no conditional case" »****************packet size is****************" 16 8.15 25.31 72.33 2.136E+6 6.635E+6 1.896E+7 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 4.00 256.00 4.07 62.83 4 8.00 512.00 12.66 40.46 8 16.00 1024.00 72.33 14.16 "same as no conditional case" "same as no conditional case” "same as no conditional case” H******:|t:|c********pac}ce' | ; size is****************" 8 16.30 37.97 72.33 4.273E+6 9.953E+6 1.896E+7 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 4.00 256.00 4.07 62.83 4 8.00 512.00 12.66 40.46 8 16.00 1024.00 72.33 14.16 "same as no conditional case" "same as no conditional case" "same as no conditional case" •i****************pacj£et size is****************" 4 32.60 75.93 144.67 8.545E+6 1.991E+7 3.792E+7 "rho # msgs/ PE, per PU, # of PUs hit; msg_per_hit; # pkts/PU ; total pkts" 2 4 8 4.00 8.00 16.00 256.00 512.00 1024.00 4.07 12.66 72.33 62.83 40.46 14.16 65.19 139.21 289.33 1.709E+7 3.649E+7 7.585E+7 "same as no conditional case" 194 "same as no conditional case" "same as no conditional case" NIL > (quit) B .2 .3 F u n ctio n a rg u m en ts for b a n d w id th e stim a tio n The communication bandw idth required for a given system size and network model is estim ated by the function band-width-table. The argum ents of this function are theoretically approxim ated by the code in sections B.1.1 and B.1.2. For net A on 16K processors, the following values were obtained for a wide range of p, and pr. figl.lsp ;'.contains both theoretical and oberved message patterns for various models ;;MODEL A ;;Theoretical (pkt size - 16) ;;(header 14) (defun bwtab_theory () (header 14) (band-width-table 4 9 16 24 14 4 4 0.2 0 248 64) (band-width-table 4 9 16 24 14 4 8 0.5 0 248 129) (band-width-table 4 9 16 24 14 4 16 0.5 0 248 264) (band-width-table 4 9 16 24 14 4 32 0.5 0 248 544) (band-width-table 4 9 16 24 14 4 ' 64 0.5 0 248 1039) (band-width-table 4 9 16 24 14 8 4 0.5 0 487 64) (band-width-table 4 9 16 24 14 8 8 0.5 0 487 129) (band-width-table 4 9 16 24 14 8 16 0.5 0 487 26) (band-width-table 4 9 16 24 14 8 32 0.5 0 487 544) (band-width-table 4 9 16 24 14 8 < 64 0.5 0 487 1039) (band-width-table 4 9 16 24 14 16 4 0.5 0 926 64) (band-width-table 4 9 16 24 14 16 8 0.5 0 926 129) (band-width-table 4 9 16 24 14 16 16 0.5 0 926 264) (band-width-table 4 9 16 24 14 16 32 0.5 0 926 544) (band-width-table 4 9 16 24 14 16 64 0.5 0 926 1039) (band-width-table 4 9 16 24 14 32 4 0.5 0 1770 64) (band-width-table 4 9 16 24 14 32 8 0.5 0 1770 129) (band-width-table 4 9 16 24 14 32 16 0.5 0 1770 264) (band-width-table 4 9 16 24 14 32 32 0.5 0 1770 544) (band-width-table 4 9 16 24 14 32 64 0.5 0 1770 1039) (band-width-table 4 9 16 24 14 64 4 0.5 0 2795 64) (band-width-table 4 9 16 24 14 64 8 0.5 0 2795 129) 195 (band-width-table 4 9 16 24 (band-width-table 4 9 16 24 (band-width-table 4 9 16 24 (band-width-table 4 9 16 24 ) 14 64 16 0.5 0 2795 264) 14 64 32 0.5 0 2795 544) 14 64 64 0.5 0 2795 1039) 14 100 100 0.5 0 3584 16384) The values of network param eters for model A were also obtained from a norm al distribution with a standard deviation equal to one-tenth of the m ean values given by Table 6.1. The communication profile based on these experim ental values are used as argum ents of the function given below: (defun bwtab_obs() (header 14) (band-width-table 4 9 16 24 14 4 - 1 0.2 0 287 69) (band-width-table 4 9 16 24 14 4 8 0.5 0 287 128) (band-width-table 4 9 16 24 14 4 : 16 0.5 0 287 253) (band-width-table 4 9 16 24 14 4 32 0.5 0 287 528) (band-width-table 4 9 16 24 14 4 64 0.5 0 287 1004) (band-width-table 4 9 16 24 14 8 - I 0.5 0 499 69) (band-width-table 4 9 16 24 14 8 8 0.5 0 499 128) (band-width-table 4 9 16 24 14 8 : 16 0.5 0 499 253) (band-width-table 4 9 16 24 14 8 32 0.5 0 499 528) (band-width-table 4 9 16 24 14 8 64 0.5 0 499 1004) (band-width-table 4 9 16 24 14 16 4 0.50 969 69) (band-width-table 4 9 16 24 14 16 8 0.5 0 969 128) (band-width-table 4 9 16 24 14 16 16 0.5 0 969 253) (band-width-table 4 9 16 24 14 16 32 0.5 0 969 528) (band-width-table 4 9 16 24 14 16 64 0.5 0 969 1004) (band-width-table 4 9 16 24 14 32 4 0.5 0 1963 69) (band-width-table 4 9 16 24 14 32 8 0.5 0 1963 128) (band-width-table 4 9 16 24 14 32 16 0.5 0 1963 253) (band-width-table 4 9 16 24 14 32 32 0.5 0 1963 528) (band-width-table 4 9 16 24 14 32 64 0.5 0 1963 1004) (band-width-table 4 9 16 24 14 64 4 0.5 0 2996 69) (band-width-table 4 9 16 24 14 64 8 0.5 0 2996 128) (band-width-table 4 9 16 24 14 64 16 0.5 0 2996 253) (band-width-table 4 9 16 24 14 64 32 0.5 0 2996 528) (band-width-table 4 9 16 24 14 64 64 0.5 0 2996 1004) (band-width-table 4 9 16 24 14 100 100 0.5 0 3401 15834) ) The communication profile was also recorded for different system sizes for the five nets of Table 6.1, as shown below. For each of these experiments, we took pi = 1/8 and pr = 1/16. (defun bw.diffpus () (print "order is: model A, B, C. D, E") (terpri) (print "rhol rho2 hypercube hypernet busnet mesh (total for same sequence)") (terpri) (load "spreadlO.lsp") 196 (terpri) (print "data for IK pus") (terpri) (band-width-table 4 9 16 24 10 8 16 0.5 0 3075 4404) (band-width-table 4 8 14 24 10 8 16 0.5 0 1036 32000) (band-width-table 4 8 13 24 10 8 16 0.5 0 1135 2074) (band-width-table 4 11 16 24 10 8 16 0.5 0 3093 8024) (band-width-table 4 12 14 24 10 8 16 0.5 0 1046 4129) (load "spreadl2.1sp") (print "data for 4K pus") (terpri) (band-width-table 4 9 16 24 12 8 16 0.5 0 1444 1054) (band-width-table 4 8 14 24 12 8 16 0.5 0 880 8200) (band-width-table 4 9 13 24 12 8 16 0.5 0 583 538) (band-width-table 4 11 16 24 12 8 16 0.5 0 1658 2057) (band-width-table 4 12 14 24 12 8 16 0.5 0 543 1063) (load ”spreadl4.lsp”) (print "data for 16K pus") (terpri) (band-width-table 4 9 16 24 14 8 16 0.5 0 519 253) (band-width-table 4 8 14 24 14 8 16 0.5 0 610 2180) (band-width-table 4 8 13 24 14 8 16 0.5 0 460 163) (band-width-table 4 11 16 24 14 8 16 0.5 64 1034 1065) (band-width-table 4 12 14 24 14 8 16 0.5 192 548 1162) (load "spread.lsp") (print "data for 64K pus”) (terpri) (band-width-table 4 9 16 24 16 8 16 0.5 16 318 158) (band-width-table 4 8 13 24 16 8 16 0.5 0 605 598) (band-width-table 4 8 14 24 16 8 16 0.5 0 287 45) (band-width-table 4 11 16 24 16 8 16 0.5 112 1043 1163) (band-width-table 4 12 14 24 16 8 16 0.5 240 649 3869) (load "spreadl8.1sp") (print "data for 256K pus") (terpri) (band-width-table 4 9 16 24 18 8 16 0.5 28 321 1854) (band-width-table 4 8 14 24 18 8 16 0.5 12 301 5603) (band-width-table 4 8 13 24 18 8 16 0.5 12 369 78) (band-width-table 4 11 16 24 18 8 16 0.5 120 1327 2176) (band-width-table 4 12 14 24 18 8 16 0.5 250 232 3867)) B .2 .4 In flu en ce o f p ack et size The comm unication demands is affected by the maxim um allowable packet size. If the size is 1, th a t is, each packet conveys only the state of a single cell, then a very large num ber of packets are generated. If the packet size is very large, on the other hand, then many packets will be only half full._____ 197 We assume the num ber of bytes in a packet of size n to be 2ra + 4. This is because the addresses of the source and destination processors (2 bytes each) are common to all messages in a packet. Each message contains an internal cell address and the cell state, thus needing two bytes. The m axim um num ber of bytes in packet is its weight in the following execution trace for net A. For each packet size, two sets of (p,,pr) are used: (1/8,1/16) and (1/32,1/32). The trace given below is the basis for Figs. 6.5 and 6.6. For brevity, the last four columns of the output listing the total bandw idth dem ands, have been deleted. difpkts.lsp Script started on Thu Oct 22 20:08:44 1987 > (diffpaks) "data for 64K pus" "rhos are rl=8; r2=16 followed by rl=r2=32" "pkt-size = 1; weight = 6" 8 16 89152.0 141327.4 72061.0 2609442.0 32 32 270480.0 404578.6 207065.0 5855916.0 "pkt-size = 2; weight - 8" 8 16 42472.0 68327.6 34745.0 1312278.0 32 32 135276.0 202301.3 103553.0 2928486.0 "pkt-size = 4; weight = 12" 8 16 21660.0 34970.0 17766.0 676566.0 32 32 67664.0 101134.4 51784.0 1463682.0 "pkt-size = 8; weight = 20" 8 16 11268.0 18316.6 9289.0 359304.0 32 32 33860.0 50561.4 25904.0 731808.0 "pkt-size = 16; weight = 36" 8 16 5792.0 9363.6 4753.0 181038.0 32 32 16964.0 25282.3 12968.0 365904.0 198 "pkt-size = 32; weight = 68" 8 16 3048.0 4879.6 2481.0 91872.0 32 32 16964.0 25282.3 12968.0 365904.0 "pkt-size = 64; weight = 132" 8 16 1556.0 2431.0 1252.0 45408.0 32 32 16964.0 25282.3 12968.0 365904.0 
Asset Metadata
Creator Ghosh, Joydeep (author) 
Core Title Communications-efficient architectures for massively parallel processing 
Contributor Digitized by ProQuest (provenance) 
Degree Doctor of Philosophy 
Degree Program Computer Engineering 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag Computer Science,OAI-PMH Harvest 
Language English
Advisor Hwang, Kai (committee chair), Gasser, Les (committee member), Moldovan, Dan (committee member) 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c17-770680 
Unique identifier UC11348214 
Identifier DP22773.pdf (filename),usctheses-c17-770680 (legacy record id) 
Legacy Identifier DP22773.pdf 
Dmrecord 770680 
Document Type Dissertation 
Rights Ghosh, Joydeep 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button