Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
LOAD BALANCING METHODS FOR MESSAGE-PASSING M ULTICOM PUTERS by Jian Xu A D issertation Presented to the FACULTY OF TH E GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In P artial Fulfillment of the Requirements for the Degree D O CTO R O F PHILOSOPHY (Com puter Science) December 1990 Copyright 1990 Jian Xu UMI Number: D P 22810 All rights reserved INFORMATION TO ALL U SE R S The quality o f this reproduction is dependent upon the quality of the copy submitted. In the unlikely ev en t that the author did not sen d a com plete manuscript and there are m issing p a g es, th e se will be noted. A lso, if material had to be rem oved, a note will indicate the deletion. Dissertation Ptbl stung UMI D P 22810 Published by ProQ uest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQ uest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United S ta tes C ode ProQ uest LLC. 7 89 East E isenhow er Parkway P.O. Box 1346 Ann Arbor, Ml 4 8 1 0 6 - 1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL P h t> UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90089-4015 j This dissertation, written by ..................JIA N XU.............................................................. under the direction of h&x Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillm ent of re quirem ents for the degree of D O C TO R OF PH ILOSOPH Y Dean of Graduate Studies Date DISSERTATION COMMITTEE Chairperson Dedication To my parents, Shang-Qing and Jie-An Xu To my husband, Yong-Xiang Xu Acknowledgments i j I i My sincerest gratitude to my advisor, Dr. Kai Hwang for his guidance, support! and encouragement. In addition to his vast knowledge of com puter science and) com puter engineering, Dr. Hwang has an extremely vigorous approach to every-' thing he does which makes him a m ost simulating intellectual presence. During1 the course of my dissertation work, he has been a fountain of inspiration and insights. His unflagging enthusiasm for new ideas and adventures were invaluable in shaping this dissertation. I am grateful to him not only for generously sharing his knowledge with me but for providing encouragement to me. I wish to thank Dr. Michel Dubois and Dr. Kim Korner for serving on my dissertation committee. My thanks also to Dr. Less Gasser and Dr. Ming Deh1 Huang for serving on my qualifying exam committee. Dr. Wesley W . Peterson and Dr. Kazuo Sugihara instilled in me the spirit i of scholarly ethics during my early years when I was at the University of Hawaii. Sincere thanks to them. ' t My thanks to Steve Kuo for the coordinating work on the parallel production1 I l systems, and to Yui-Bin Chen for helping me use the graphics facility at the SUN| i workstation. My parents carefully nurtured me with boundless love and affection. My dear I elder sister Run-Lan has been a source of great love. By word and deed, they I have encouraged and supported me during their entire life. ^ ! To my husband, Yong-Xiang Xu, for his faith and support during the four years when I was apart from him while studying in Hawaii, I thank most of all; ! My thanks to him for his patience and understanding on the long nights when I was away from home working in the lab. i I was financially supported by a one-year graduate fellowship from the USC~] G raduate School. The financial support for this dissertation was provided i n \ part by an AT&T grant, “Optical M ulticom puter Network for Balanced Parallel Processing” . I have participated as a Research Assistant under the supervision of Professor Kai Hwang. Contents A cknow ledgm ents List of Figures List of Tables A bstract 1 Introduction 1.1 M ulticom puter S y s te m s .................................................... 1.2 Parallel Program Execution on A M ulticom puter . . 1.3 The M apping P ro b le m ........................................................ 1.4 A Dual-level Load Balancing S ch em e............................ 1.5 Summary of Original C ontributions............................... 1.6 Thesis O rg a n iz a tio n ........................................................... 2 R eview of the Load Balancing R esearch A rea 2.1 Basic Design P olicies........................................................... 2.2 Objectives and M easurem ents.......................................... 2.3 Taxonomy of Various M e th o d s ................. .................... 2.4 Im plem entation P r a c tic e s ................................................. 2.5 Perform ance Analysis M o d e ls .......................................... 3 Static Load Balancing Using Sim ulated A nnealing 3.1 The Simulated Annealing M ethod ............................... 3.2 M apping of Partitioned Program M o dules................. 3.2.1 Formulation of the Mapping Problem . . . ........................ 3.2.2 A New Cost F u n c tio n .................................................................. 3.3 M apping of M ultiple Production R u le s ................................................. 3.3.1 A Parallel Production S y stem .................................................... 3.3.2 M apping Based on Rule Dependency Analysis 3.3.3 Cost Function for Mapping Production Systems .............. 3.4 Im plem entation of the Simulated Annealing M e t h o d ..................... 3.4.1 Generating M e c h a n is m .............................................................. 3.4.2 Cooling Schedule............................................................................ 3.4.3 SIMAL: A Mapping T o o l ........................................................... 3.5 Markov Chain Analysis on C o m p lex ity ................................................. 3.5.1 A Model of Homogeneous Markov C h a in s .............................. 3.5.2 Proof of the Convergence to a Suboptim al Solution . . . . 3.5.3 Complexity A nalysis..................................................................... 3.6 Experim ental Results ............................................................................... 3.6.1 Results on Mapping Program M o d u les................................... 3.6.2 Benchmark Experim ents on Production S y s te m s ................ 3.6.3 Perform ance Analysis of the Simulated Annealing M ethod 4 H euristic P rocess M igration for D ynam ic Load Balancing 4.1 An A daptive D istributed Load Balancing Model ............................ 4.2 Heuristic Process M igration M e th o d s................................................... 4.3 PSIM: A Parallel Discrete Event-Driven S im u la to r ........................ 4.4 Perform ance of Heuristic M e th o d s ....................................................... 4.4.1 Performance Measures and Overhead E s tim a te s................. 4.4.2 Evaluation of the Heuristic M e th o d s ...................................... 5 DLB: A D ynam ic Load Balancer 5.1 O perating System Support . . ........................... . . 5.1.1 Process State and Process Control Block ............................. 99 5.1.2 O perating System P r im itiv e s ................................................ . 102 5.1.3 Control Level Parallelism in Process M ig r a tio n ................... 106 5.2 Im plem entation of the Load Balancer 109 ' 5.2.1 The Software Construction ..................................................... 109 5.2.2 Asynchronous Message P a s s in g .................................................. 113 6 Perform ance R esults of Benchm ark E xperim ents 115 6.1 Benchmark P r o g r a m s ..................................................................... 115: 6.1.1 The Tak P r o g r a m .......................................................................... 116 6.1.2 The Fibonacci P r o g ra m ................................................................ 117 6.1.3 T he Quicksort P r o g r a m ................................................................ 118! 6.2 Experim ents P e rfo rm e d ............................................................................. 120 6.2.1 Experim ents on the Tak and Fibonacci P ro g ra m s............... 120' 6.2.2 Experim ents on the Quicksort P r o g r a m ................................. 122; 6.3 Benchm ark Performance R e s u l t s ........................................................... 123 6.3.1 Speedup P e rfo rm a n c e ................................................................... 124j 6.3.2 Efficiency Analysis . ................................................................... 127i 6.3.3 Load Distribution E v a lu a tio n .................................................... 1291 J i 6.3.4 Comparison of Heuristic M ethods ........................................... 133 7 Sum m ary and Conclusions 138 7.1 Prim ary Results of T h e s is ............................................. 138 7.1.1 Flexibility and Suitability of th e Simulated Annealing M ethod for Static Load B a la n c in g ..................................... ....... 139 7.1.2 High Performance of the A daptive M ethod for Dynamic Load Balancing ............................................ 140^ 7.1.3 Parallelism at the Process Control L e v e l.................. 14lj 7.1.4 Im plem entation of the Dual-Level S ch em e ............... 141| I i vii 7.2 Suggestions for Future Research ........................................................... 142 R eferences 144 A ppendices 156 A The SIM AL Code 156 A .l Header M o d u le ................................................................... ............................ 156 A .2 Graphics Header Module ...................................... ............................ 160 A.3 Main M o d u le .............................................................. ............................ 162 A.4 Input M o d u le .............................................................. j ............................ 163 A.5 System D ata Built-Up M o d u le ............................ ............................ 165* A.6 Cooling Module ....................................................... ............................ 169 A .7 Initiation M o d u le....................................................... ............................ 177 A.8 Hardware Simulator M o d u le................................... ............................ 181 A.9 U pdate M o d u le........................................................... ............................ 182 A .10 Generating M o d u le....................................................... ............................ 188 A .11 U tility and Library M o d u le ...................................... ............................ 191 A .12 Declaration M o d u le .................................................... ............................ 195 A .13 Graphics M o d u le ....................................................... . ............................ 196 A .14 Screen Dump of the E -cu rv e................................... ............................ 204 B Benchm ark Production System s 205 B .l Toru-W altz P ro g ra m .......................................................... 205 B.1.1 Program C o d e ................................................................................ 205 B .l.2 Characteristic M a tr ic e s ............................................................... 220 B.2 Tourney Program ...................................................................................... 222 B.2.1 Program C o d e ................................................................................ 222 B.2.2 Characteristic Matrices ..................................................... 232 viii C The PSIM Code 234 C .l Host Program .............................................................................................. 234 C.1.1 Host Header M o d u le............................................................... 234 C.1.2 Host M o d u le 238 1 i C .l.3 Host Print. M o d u le 249 j C.1.4 Host Declaration M o d u le ............................................................ 251 C.2 Node P r o g r a m .............................................................................................. 2 5 2 1 C.2.1 Node Header M o d u le ...................................... ........................ 252 C.2.2 Node M o d u le........................................................ ........................ 2 5 8 1 C.2.3 Initialization Module ...................................... ........................ 259 C.2.4 Simulation Control M o d u le ............................ ........................ 262 C.2.5 Process Event M o d u le ...................................... ........................ 265 C.2.6 Node Communication Module ..................... ........................ 270 C.2.7 Threshold U pdate Module ............................ ........................ 273 C.2.8 Decision Making M o d u l e ............................... ........................ 276 C.2.9 Recording Statistics M o d u l e ......................... ........................ 2781 C.2.10 M igration M o d u le ............................................. ........................ 280 i 1 C.2.11 Node Library M o d u le ...................................... ........................ 284 C.2.12 Node Declaration M o d u le ............................... ........................ 289! D The DLB Code 2 9 1 1 D .l Host Program ............................................................................................. 291 j D.1.1 Host Header M o d u le...................................................................... 291! D .l.2 Host Main M o d u le .......................................................................... 294; D .l.3 Host-Mailman M o d u le................................................................... 298! D .l.4 Host-load M o d u le .......................................................................... 3011 D .l.5 Host Declaration M o d u le ............................................................ 3031 D.2 Node P r o g r a m ............................................................................................. 304! D.2.1 Node Header M o d u l e ................................................................... 304 D.2.2 Node Main Module ....................................................................... 309 D.2.3 Kernel M o d u le ................................................................................. 313 D.2.4 Run M o d u le ..................................................................................... 320 D.2.5 Suspend M o d u le .............................................................................. 322 D.2.6 Load Transfer M o d u le .................................................................... 324 D.2.7 Queue Handler M o d u le ................................................................ 328 D.2.8 Benchmark Program M o d u le ...................................................... 331 D.2.9 Node Library M o d u le 336 | D .2.10 Node Declaration M o d u le ............................................................. 337 x List of Figures 1.1 The architecture of a m ulticom puter.................................................... 2 1 i 1.2 Distinct features of a m ulticom puter..................................................... 3| I 1.3 Partitioning of a program for distributed execution on a m ulti- | com puter........................................................................................................ 5 1.4 Load imbalance in a m ulticom puter system ........................................ 6 1.5 A dual-level load balancing scheme................. 8 2.1 A taxonom y of various load balancing m ethods................................. 18 3.1 A general simulated annealing procedure............................................. 32 i 1 3.2 M apping of an example program , partitioned into many modules, onto the nodes of a hypercube m ulticom puter system .................... 33! 3.3 The calculation of the cost function for the initial allocation in : Fig. 3.2 ...................................................................................................................... 37i 3.4 The execution model of a parallel production system on a m ulti- ] J com puter........................................................................................................ 39; 3.5 The rule dependencies, parallelism, and communication matrices j of a two-rule production system ............................................................. 42 3.6 The m atrices of the T ourney production system and a mapping corresponding to the allocation m atrix X ........................................... 46 3.7 New allocations generated after applying three different swap functions to the allocation in Fig.3.2.................................................... 49 i 3.8 Components of SIMAL for parallel m apping of production systems 51 3.9 The simulated annealing procedure im plem ented in SIMAL. . . 53 XI J ' O ff 3.11 3.12 3.13 3.14 3.15 3.16 4.1 4.2 4.3 4.4 4.5 4.6 4.7 The final allocation ancfthe convergence curve of the costTunc- tion obtained from an annealing process............................................. Convergence of the cost function between a good and a bad initial a llo c a tio n ..................................................................... .............................. Relative performance of three swap functions used in the simu lated program allocation experim ents. .......................................... Variation of the cost function for three cooling speeds in the simulated annealing program allocation process............................... E-curve obtained from the sim ulated annealing process in two benchm ark experim ents.......................................... .................................. Speedup comparison of different m apping m ethods......................... Comparison of different m apping m ethods on cost function vs. machine size.................................................................................................. An adaptive load balancing model for a m ulticom puter with n processor nodes and a host p ro cesso r................................ ................. An example of the variable tim e window function W t w ith k\ = 0.01 and k2 = 0.1........................................................................................ A queueing model for the dynam ic load balancing at each dis tributed com puter node............................................................................. An example of the open network load balancing model (D: deci sion maker, S: server)................................................................................. Software components of PSIM on an iP S C /2 hypercube system. Mean and standard deviation of the response tim e vs. mean utilization level............................................................................................. Mean and standard deviation of th e response tim e vs. load im balance factor............................................................................................... 4.8 Mean and standard deviation of the response tim e vs. mean service time. . ..................................................................... 95 4.9 Mean and standard deviation of the response tim e vs. m ulticom puter size........................................................................................................ 96 5.1 An im plem entation model describing the operations of distributed load balancer at each processor node.................................................... 99 5.2 Process state transitions among various queues 100 j 5.3 The structure of a Process Control Block (PCB) and the Process ! I Halting Result (P H R )............................................................................... 102: 5.4 The structure of Data Set ( D S ) ............................................................ 103, 5.5 The Semantics of the ru n and suspend OS directives 104 ‘ i 5.6 The process creation tree and the PCBs used in the invocation of the fibonacci function fib (4), at node 0 of a 32-node m ulti com puter system .......................................................................................... 107 5.7 Exploiting parallelism by process m igration...................................... 1081 5.8 The supervision kernel in th e host processor 110 { 5.9 The load balancer program at each node processor.......................... 112 6.1 Speedup obtained from balanced execution of the tak program. 125 6.2 Speedup obtained from balanced execution of th e fibonacci pro gram ................................................................................................................ 126 6.3 Speedup obtained from balanced execution of the quicksort pro- I gram (randomly generated d ata set).................................................... 128! 6.4 Efficiency obtained from balanced execution of the tak program. 130; 6.5 Efficiency obtained from balanced execution of the fibonacci pro- \ i gram ................................................................................. 131- I 6.6 Efficiency obtained from balanced execution of the Q uicksort ' I program (random ly generated d ata set).............................................. 132 T he E-curve generated during th e annealing process iiTrhapping the Toru-waltz production system onto an 8-node hypercube m ul ticom puter ................................................................................................. List of Tables i 1 2.1 The Hierarchy of Load Balancing T a x o n o m y .................................... 19 3.1 Comparison of Three M ethods to Map the T oru-w altz Produc- i tion System on a 16-node Hypercube Com puter . 66 j 3.2 Comparison of Three M ethods to Map the Tourney Production System on a 4-node Hypercube C om puter......................................... 67 3.3 Average Perform ance of SIMAL on the Toru-waltz System . . . 71 j 3.4 Cost Function and CPU Tim e in Using Three Initial Tem perature S e ttin g s ................................................................. 72 3.5 Cost Function and CPU Tim e as a Function of th e Term ination 1 1 Criterion (The choice of the param eter k ) ................................... 73j 3.6 Cost Function and CPU tim e as a Function of the Cooling Con- | stant a 73 j I 4.1 Exam ple Load D istributions Before and After th e Application of j the Local Round-Robin M e th o d ........................................................... 83 4.2 M easured Com m unication Overhead and M igration Delay . . . 90 6.1 The Process Execution D istribution For Program T akQ In Case 1 Experim ents............................................................................................... 134 6.2 The Process Execution D istribution For Program Fib In Case 1 Experim ents.................................................................................................. 135 6.3 The Process Execution D istribution For Program Q ksortQ on Sorting a 2k-size D ata Set........................................................................ 136 Abstract M ulticom puters have been widely used in num erical com puting for solving J scientific and engineering problems. However, with the existing operating sys-, terns, it is difficult to achieve appreciable speedup in using a m ulticom puter t o ( l execute Al-oriented application program s which have irregular structures and un-1 I predictable run-tim e behaviors. This thesis explores the role of load iaZancmgj for parallel execution of both numerical and AI programs on a message-passing | I m ulticom puter. In many applications, it is impossible to gain any speedup w ithout load bal ancing. Two barriers to high performance are identified: (1) Unbalanced initial load distribution and high communication cost caused by a poor static allocation. (2) Some nodes become idle while others are busy at run time. These problems; i are solved in this thesis by a dual-level load balancing scheme, which includes i i both static program allocation and dynamic process m igration. ' I An optim ization m ethod is presented to solve the static mapping problems) using a sim ulated annealing approach. Simulation and benchm ark experiments! I show the effectiveness of the static m apping m ethod. The m inim ization of a cost, I function reflects in a balanced load distribution. Theoretical proofs verify the near j I optim ality of the proposed m ethod. 4 A new adaptive scheme is presented for dynamic load balancing, which is based! on using easy-to-im plement heuristics and a variable threshold in m igrating pro cesses among the m ulticom puter nodes. This adaptive scheme uses a distributed} control overall processor nodes coordinated by a host processor. The adaptive| I nature and less com m unication and m igration overhead of the m ethod appear to be superior to the previously proposed m ethods. A mechanism which explores xvi parallelism of program execution at the process control level is presented in this thesis. Experim ents performed on a 32-node iPSC /2 hypercube m ulticom puter include the development of a parallel discrete event-driven sim ulator and the pro totype im plem entation of a distributed load balancer. Simulation and benchm ark results have shown th at this adaptive scheme can achieve high perform ance at low cost. This thesis proves th at sim ulated annealing is indeed a powerful m ethod to achieve static load balancing. The experim ental load balancing results verify the effectiveness of using the adaptive threshold and simple heuristics in achieving dynamic load balancing. The m ain contribution of this thesis lies in setting up a framework to build a dual-level load balancing system for m ulticom puters which can be applied to both numerical and artificial intelligence applications. Chapter 1 Introduction 1.1 M u ltico m p u ter S y stem s A m ulticom puter is a multiprocessor system w ith distributed-m em ory archi- I tecture as shown in Fig. 1.1, which is also called a message-passing multiprocessor. A m ulticom puter consists of multiple processor nodes connected by some inter connection network. Each processor node has a processor, a local memory and a switch connecting to the network. T he distributed memories are not shared. Interprocessor communication is perform ed via message passing through the in terconnection network. The interconnection network can be as simple as a bus.j i It can be also a point-to-point network such as m e s h , tre e , hypercube1 rin g ^ h yp ern et, etc [98]. Some other m ulticom puters use m ultistage networks such as, the butterfly switch [4] and Omega network [75]. Each processor node is simply] called a node. I The processor nodes in a m ulticom puter operate asynchronously w ith dis tributed program s and data sets, and use message-passing for communicationj I over the interconnection network. Therefore, a m ulticom puter is a m ultiple-' 1 instruction/m ultiple-data stream (MIMD) m achine which can exploit parallelism^ at both control and data levels. This is in contrast w ith tightly coupled com puter' I architectures where the processors com m unicate through a shared memory. The entire ensemble of a m ulticom puter is controlled by a distributed operating system i [107] so all the processor nodes work together to solve one problem. This is anj i i i Memory Memory Processor Switch Switch Processor Switch Memory Processor Interconnection Network Figure 1.1: The architecture of a m ulticom puter. I I im portant distinction from a local network operating system which links com puters j working on different problems. M ulticom puter examples include CDC Cyberplus,| Flexible/32, Intel iPSC, JP L Mark III, Caltech Cosmic, N CU BE/10, FPS-T , etc.; Figure 1.2 characterizes the features of a m ulticom puter by showing a 16-node, m achine connected by a hypercube interconnection network. Com pared to shared-memory m ultiprocessors, the principal driving forces be-1 hind the interest of m ulticom puter architectures include: ! j • Hardware scalability: The num ber of processor nodes can be easily increased! w ithout m ajor modification of the architecture. New design can be im ple m ented on a small num ber of processor nodes and expanded to a larger num ber of nodes in a process called scaling up. | • I I i • Suitability for distributed computing: Program s and d ata sets can be parti-! 1 tioned and executed on distributed nodes. W ithout the data sharing, th e1 memory conflicts can be reduced. This is particularly suitable for artifi-' cial intelligence (AI) applications, which usually require frequent m em ory J Processor Node Memory Processor Switch * Distributed Local Memory * Aysnchronous Message Passing * Distributed Operating System Figure 1.2: D istinct features of a m ulticom puter. I i accesses. ! • Application demands: M ulticom puter systems are deemed indispensable for a number of symbolic applications, in addition to the increasing demands of numerical supercom puting. They can be employed to alleviate the von j Neumann processor/m em ory bottleneck to intense and irregular memory J access patterns of m any symbolic processing applications, and to make large problems tractable. Although hardw are advantages make m ulticom puters appear very attractive, | the software support to existing m ulticom puters is seriously lacking. Conse- j quently, the benefits of this kind of loose-coupled architectures have not been I I realized widely. The increasing application demands make it im portant to inves- : tigate the p attern of parallel program executions on a m ulticom puter and provide ■ software methodologies and tools to achieve the software scalability. This moti-1 3 vated the research work to form ray thesis. 1.2 P arallel P rogra m E x e c u tio n on A M u ltico m p u ter . Parallel processing of a user program on m ulticom puter systems consists of three stages: 1. Program partitioning: An application problem needs to be decomposed into smaller sub-problems. The user program to solve this problem has to b e ; partitioned into multiple modules. A m odule is a set of program code con- 1 sisting of one or more procedures which are intensively connected. A pro-i cedure is an atom ic unit of code which cannot be decomposed any more. It is desirable th at the code of different modules can be executed concurrently! and the interconnections between modules are relatively sparse. Therefore, some modules may be identical, i.e. th e same code can be duplicated to b e( executed on different processor nodes to work on different d a ta sets. 2. Static Allocation: Partitioned program modules need to be allocated onto m ultiple processor nodes at the post compile tim e so th at the code at dis tributed nodes can be executed in parallel at run tim e. j 3. Parallel Execution: Each processor node starts the program execution by ac tivating an initial process. A process is an active procedure at run tim e. Thei parallel execution continues when m ultiple processor nodes execute newlyj created processes, until all the processes halt at all nodes. j l Figure 1.3 shows the structure of a program which can be executed on a . m ulticom puter in parallel. At run tim e each procedure will be activated as one or i more processes. i 4 Figure 1.3: Partitioning of a program for distributed execution on a multicom- j puter. I 1.3 T h e M ap p in g P ro b lem Using m ultiple processor nodes in a m ulticom puter appears to be capable of j speeding up the execution tim e of an application program - it is possible to let , each node perform equal am ount of work concurrently and finish the execution ! i at almost the same time. In this way, a linear speedup of program execution | I can be achieved. However, for some applications, it is possible to gain only very lim ited speedup. In other words, the speedup is not proportional to the num ber of processor nodes used. The m ajor obstacles for gaining speedup is the mapping problem in which partitioned program modules (at post compile tim e) or ready processes (at run tim e) need to be allocated to m ultiple processor nodes. The m apping problem at post compile tim e is done by static allocation where partitioned program modules and d ata sets are distributed to m ultiple nodes. An inappropriate allocation will initially cause high com m unication cost and create 5 initial process O process created at run time Figure 1.4: Load imbalance in a m ulticom puter system. an unbalanced load distribution. If tb e static allocation cannot maximize the parallelism , then only a few nodes can be fully used for problem solving. In addi tion, tremendous overhead will be created by frequent d a ta movements through message-passing if the d ata sets are not allocated properly. i The m apping or so called remapping at run tim e, assigns ready processes to the m ost appropriate nodes for execution. Such rem apping is usually not sup ported by th e available distributed operating system , where each process can only be executed at the node where it is created. However, th e run-tim e rem apping becomes extremely im portant to some program s which have highly variable and unpredictable behaviors, such as AI programs th at are d ata dependent and non-j determ inistic. Figure 1.4 shows an unbalanced situation in process creation on aj 4-node m ulticom puter. Assuming each process has the same execution tim e, the to tal execution tim e of this parallel processing is bounded by the tim e at node N i making the speedup very lim ited. Problems like this need to be solved through the run-tim e rem apping which reallocates processes by a mechanism called process mi- gration. Using process m igration, processes can be transferred from heavily loaded j nodes to the lightly loaded ones, so th a t the system workload can be balanced. ! I 1.4 A D u a l-lev el L oad B alan cin g S ch em e Load balancing is a mechanism by which a distributed system is able to allocate an equal am ount of workload to m ultiple processor nodes in order to achieve high perform ance. As discussed in the previous section, the m ain reasons for the lim ited speedup obtained from parallel processing are: (1) an inappropriate static allocation which is unable to maximize the parallelism and may cause high com m unication overhead, (2) the lack of dynam ic process rem apping technique to balance the system load at run tim e. These two problems can be solved by aj j dual-level load balancing scheme, which consists of the static program allocation! and the dynam ic process m igration. The scheme provides the high speed up by. i maximizing the degree of parallelism. The dual-level load balancing scheme as shown in Fig.1.5 is addressed prim arily to the m apping problem in which m ultiple program modules or processes are allocated onto m ultiple processor nodes. In my approach, th e m apping problem is! I divided into two levels. T he first level concerns how to m ap partitioned program] modules onto m ultiple nodes at post compile-time. This m apping is done by the! host or the global operating system. The second level addresses on how to remap i ready processes onto processor nodes at run-tim e. It is done by a distributed; I i operating system which has identical node OS resident in each processor node. ! Based on two levels of th e mapping problem, a static load balancing is chosen! to solve the first level m apping problem, with the goal to achieve a suboptim al initial load distribution which can further minimize com putation tim e and inter-! I I com m unication cost; a dynam ic load balancing is chosen to solve the second level Static Load Balancing post compile time multiple program modules allocation mapping ■ 1 1 Dynamic Load Balancing run time process migration — ► remapping Figure 1.5: A dual-level load balancing scheme. .8 . m apping problem, i.e. to balance the workload by redistributing processes. The dual-level load balancing scheme appeals to variety of programs mixes. It is suitable not only for numerical applications, but also for symbolic oriented programs. In general, numerical applications are regular and predictable, whereas symbolic applications are irregular and unpredictable. For predictable applica tions, load balancing can be achieved by static allocation. For symbolic applica tion with unpredictable run-tim e characteristics, even if there is an optim al initial placem ent, a balanced workload during the execution cannot be guaranteed. This | I m otivates the development of a dual-level scheme. ! I 1.5 S u m m ary o f O riginal C o n trib u tio n s i This thesis studied the load balancing problem in message-passing m ulticom -! puter systems. The original contributions of this thesis include: | • A d ual— level load balancing scheme is introduced, which consists of the! static program allocation and the dynam ic process m igration. The scheme j 1 can provide the high speedup by maxim izing the degree of parallelism. ItJ suggests a way to make m ulticom puter architecture and operating system appeal to all program mixes. • An efficient static allocation m ethod using sim ulated annealing, is presented to m ap partitioned program modules to m ultiple processor nodes. • A new parallel mapping m ethod is developed to m ap production rules in a! parallel rule-based system to nodes in a m ulticom puter. J • An optim ization m ethod and a software tool using sim ulated annealing are, developed to solve the static m apping problem. New cost functions are de-1 ! fined to reflect the goal of maximizing the degree of parallelism, balancing the load distribution and minimizing the intercom m unication cost. Simula tion and benchm ark experim ents show the effectiveness and efficiency of the mapping m ethod. • Theoretical proofs verify the near optim ality and low com plexity of the s ta tic ! load balancing m ethod, based on a homogeneous M arkov chain model. T h e 1 i complexity is semi-quadratic, which is superior to th e previously proposed m ethod. • A new adaptive distributed scheme for dynamic load balancing is proposed. This scheme cooperates with a central supervisor at the host m achine and adaptively invoke the load balancing activities under decentralized control. Four heuristic m ethods are proposed for process m igration. The overhead is reduced by low inform ation collection and process m igration cost, provided I by the adaptive nature of these heuristics. j i • A parallel discrete event-driven sim ulator is developed on a 32-node iPSC /2 j hypercube m ulticom puter. The parallel sim ulations are performed to ver-1 1 ify the perform ance of the dynamic load balancing scheme, using an open | queueing network model. j i • A prototype distributed load balancer is developed and im plem ented on a I 32-node iP S C /2 hypercube m ulticom puter. Benchm ark experim ents using 1 this load balancer show the feasibility of im plem enting th e proposed lo ad ! balancing m ethods in a real distributed operating system. • The benchm ark experiments exercised parallel execution of program s at thei j process control level. The performance results show th a t the new adaptive! load balancing scheme is superior to the previously proposed ones. ! • The ru n and suspend OS primitives are developed in a U N IX/C program-1 ming environment. They are used to maximize parallelism at the process! I 10 control level by incorporating w ith the dynamic load balancing. • Based on a comprehensive survey of state-to-the-art works in the area of distributed com puting, a new load balancing taxonom y is presented. It ■ attem pts to provide a common terminology and classification mechanism ; i necessary to address this problem. ' i 1.6 T h esis O rgan ization i C hapter 2 provides the background of load balancing for distributed com put ing. The design issues are summarized and the related work is surveyed in the ! up-to-date basis. A new load balancing taxonom y is presented to show the current I state of load balancing m ethods. C hapter 3 presents the static load balancing m ethod using sim ulated annealing. 1 i The sim ulated annealing m ethod is described in Section 3.1. Section 3.2 presents | I a model of m apping partitioned program modules to m ulticom puter nodes. The parallel m apping from production rules in a parallel rule-based system to processor nodes in a m ulticom puter, is presented in Section 3.3. In Section 3.4, the imple m entation issues and the software tools which have been developed are discussed. The theoretical proof of near optim ality and th e com plexity of the sim ulated an- ! nealing m ethod is shown in Section 3.5. In Section 3.6, the perform ance evaiuationj I is presented, which shows the effectiveness of the static load balancing methods ini solving the two m apping problems. The perform ance of th e m ethod is analyzed j to determ ine the effects of th e cooling schedule to the quality and efficiency of the sim ulated annealing m ethod. C hapter 4 introduces a new adaptive dynamic load balancing model under the supervision of a host processor. T he adaptive inform ation updating scheme is described by using an adjustable tim e window. Four heuristic process migration m ethods are presented using a queueing model. These m ethods are distinct in 11 I __ I - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - . - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - . - - - - - - - - - - - - - - - - - - - - - - - - - - - - j threshold updating, candidate selection and load transferring policy. A parallel t event-driven sim ulator developed on an Intel iP S C /2 hypercube m ulticom puter I is presented, which is used to evaluate the perform ance using an open-network queueing model. C hapter 5 presents a prototype distributed load balancer im plem ented on a 32-node Intel iP S C /2 hypercube m ulticom puter. It first discusses how to exploit parallelism in program execution at the process control level. The process state transitions, process control blocks and th e OS prim itives ru n and suspend are presented. By inserting the OS prim itives to th e sequential program s, parallelism' I can be achieved incorporating w ith the load balancing m ethod. The structure of th e load balancer is presented and the im plem entation issues are discussed. j C hapter 6 presents the benchm ark experim ents to evaluate the perform ance of the heuristic process m igration m ethods for dynam ic load balancing. Two types of the benchm ark experim ents are presented. The perform ance is evaluated inj I term of the speedup and efficiency obtained from benchm ark program executions] i and the corresponding load distributions among m ulticom puter nodes. I i 1 Finally, C hapter 7 summarizes the prim ary results of the thesis and presents directions for future research. I 12 Chapter 2 Review of the Load Balancing Research Area Load balancing has been studied extensively by researchers during the last two decades and becomes m ore and more im portant as parallel and distributed com- j puting grows. In this section, th e design policies and objectives of load balancing j i are presented. A new taxonom y of load balancing is proposed which is based on a ! t comprehensive state-of-the-art survey. Known formal models and im plem entation • i practices of various load balancing m ethods are reported. ^ i t 2.1 B a sic D e sig n P o licies I In distributed com puting, load balancing approaches the problem with the ! j philosophy th a t being fair to the resources of th e system is good to the user of th a t system. In general, a load balancing scheme is im plemented by a set of load : balancing policies [30] [78]. Listed are the policies th at form the design of a load ] balancing m ethod: j I i • D ecision M aking P olicy j The decision making policy determ ines when the load balancing is applied. | I In static load balancing, m ultiple processes or program modules are assigned j to processors at the post compiler time. It is assumed th a t inform ation j i regarding th e to tal mix of processes in the system as well as the independent , subtasks involved in a job or task force, are available by the tim e the program ! object modules are linked into load modules. Static load balancing tries to j ( I 13 I ensure th at each image of a process or m odule is assigned to the m ost suitable processor before the concurrent com putation starts. It is pre-determ ined in the sense th at the distribution of loads is according to some knowledge 1 th a t is set priori. In dynamic load balancing, processes are allocated tot I processors at run time. The load balancing control is interspersed in thej real tim e of actual com putation. The assignment of processes are based on i the estim ate of the system state. A process is assigned to th e best processor! upon its creation or arrival. It is not necessary for a process to be executed1 at where it is created. Therefore, a process may be m igrated to another, other processor for execution. * • C ontrol P olicy j i i Control policy decides the control authority to a load balancing scheme. It j determ ines who makes the load transferring decision. In centralized loadj balancing, there is a central controller which collects the system status in- j I form ation and makes a decision to balance the load. The controller assigns j processes to different processors. In decentralized load balancing, the au-j thority is physically distributed to each node of the entire system. Thej decisions on where processes should be assigned are m ade either indepen dently or cooperatively by individual nodes. The term distributed refers to a decentralized scheme in which the responsibility for making and car rying out a decision is shared among nodes in a distributed system. Since m ost of th e decentralized schemes are distributed, these two term s are used interchangeably. • Inform ation P olicy Inform ation policy dictates the m ethod of exchanging local status among processors. It concerns how inform ation is collected, the type, the am ount, I I the storage and the accuracy of inform ation. In state-dependent or deter- 14 m inistic approaches, load balancing decision is based on the current state of J f the system , which includes CPU utilization, am ount of free memory, average I i response tim e, num ber of the active processes, etc. In a probabilistic or a 1 nondeterm inistic approach, an arriving job is dispatched to the processor ac- i I cording to a set of branching probabilities th at is collected from the previous | experience or is based on system characteristics. The m ethods of informa- ! tion collection include broadcasting, which is a one-to-m any com m unication; or point-to-point message passing, which is a one-to-one com m unication. In general, there are conflicts betw een inform ation gathering and dissem ina tion. In order to reduce overheads, inform ation collection should be lim ited to the point where system status can be reflected adequately. j | • In itiation P olicy j l I Initiation policy determ ines who invokes the load balancing activities. If' the control policy is centralized, then the initiator is simply the central controller. Under a decentralized control, if a sender-initiated policy is used,, then load balancing is initiated by the heavily loaded processor sending o u t j the load; if a receiver-initiated policy is applied, load balancing is initiated i by the lightly loaded processor receiving the load. : • Transfer P olicy j Transfer policy decides w hether a process should be processed locally or! should be transferred to some other processor. In a state-dependent ap-j proach, it is dependent not only on the local processor state, b u t also on the global state. In general, the transfer policy selects a process to be migrated! from the local processor when the local node is heavily loaded and there are! some lightly loaded or idle nodes available. W hen a threshold is used to- 1 determ ine w hether a node is heavily or lightly loaded, the transfer policy is! I called a threshold policy. j 15! • Location P olicy j Location policy determ ines where a process should be m igrated. It selectsj a candidate processor as the destination of process m igration. In dynamic; load balancing, an inadequate location policy will cause some processes to be; i m igrated back to their original processors. This is called process thrashings Location policy is usually based on the com bination of local state and global state. j I • M igration P olicy j i This is a policy to decide how a process is going to be m igrated. In single mi-J gration, a process is allowed to be m igrated only once; whereas in repetitive] approach, a process is allowed to be m igrated m ore th an once. On the other hand, in nonpreem ptive policy, a process can only be m igrated to otherj processor before it starts execution; in preem ptive policy, a process can be \ m igrated to the other processor even if it has started the execution. ! i t 2.2 O b jectiv e s and M ea su rem en ts The goal of load balancing is to improve the system perform ance by assigningi I an equal am ount of workload to each processor. The following objectives andj I m easurem ents are used to justify load balancing m ethods. I i • Scalability Scalability is th e capability to provide increases in performance with corre sponding increases in th e num ber of processors. • E ffectiveness Effectiveness m eans th a t there should be a significant cost decrease resulting from using load balancing, as compared to not using it. 16 I | • F lexib ility Flexibility is the ability to work effectively for variable application size. • S tab ility t 1 Stability is the bounded perform ance at different loads and overheads. W hether I the load balancing m echanism can continue to perform functionally when j j th e system is overloaded, is m easured. • Efficiency Efficiency concerns the cost or penalty paid to the perform ance enhance m ent. High efficiency requires minimizing any cost as a result of running the load balancer. I t • Fairness ! I ! Fairness concerns uniformly acceptable perform ance provided to the pro-j cesses at run tim e. There should be no process suffering indefinite post- j ponem ent by perform ing load balancing mechanism. 2.3 T a x o n o m y o f V ariou s M eth o d s A taxonom y of load balancing shown in Fig.2.1 provides a convenient mean of quickly describing the central aspects of load balancing schemes. It is different from previous proposed taxonom y [16] [123], in classifying load balancing m ethods by a hierarchy of design policies as shown in Table 2.1. Under dynamic load; balancing, some classifications are overlapped w ith more than one policy. One' classification which overlaps transfer and location policies is non-cooperative vs.j cooperative. In the non-cooperative approach, each processor decides when and| where to transfer the load w ithout knowing the state of the rest of the processors. In the cooperative approach, processors coordinate with each other to make a Load Balancing static dynamic Static Load Balancing centralized probabilistic state-dependent math distributed probabilistic g r§ E h _____ theoretic programming heuristic Dynamic Load Balancing centralized distributed state-dependent probabilistic ^ In d e p e n d e n t probabilistic heuristic . . ,..... cpnrlpr..........................receiver- non-cooperative cooperative Figure 2.1: A taxonom y of various load balancing m ethods. 18,’ Table 2.1: The Hierarchy of Load Balancing Taxonomy Level Policy Classification 1 decision making static vs. dynamic 2 control authority centralized vs. distributed 3 inform ation state-dependent vs. probabilistic 4 initiation sender-initiated vs. receiver-initiated 5 transfer and location non-cooperative vs. cooperative load transfer decision. A nother classification under dynam ic load balancing is adaptive vs. non-adaptive. In the adaptive m ethod, the load balancing policies are modified as the system state changes. In th e non-adaptive m ethod, the policies are unchanged. This issue is not included in the taxonom y because any dynam ic m ethod can be either adaptive or non-adaptive. In the following, previously proposed load balancing m ethods are surveyed, using the taxonom y proposed. A . Static Load B alancing A .l C entralized , The state-dependent centralized m ethods include graph theoretic, m ath pro- \ gramming and heuristic approaches, where the first two seek optim al solutions while the heuristics try to find only suboptim al solutions. The probabilistic m eth ods are based on queueing theory. • Graph Theoretic In this approach, modules and processors are represented as nodes in a graph, and the interprocessor communication cost as undirected weighted edges connecting the nodes. The total processing cost is the sum of module i | 19 i - — -------- ■ ■ ■ ■ ■ - - -_____ _ i processing cost and interprocessor com m unication cost. The module allo cation mechanism is to minimize the to tal processing cost by performing a m in-cut algorithm [59]. The famous m in-cut/m ax-flow algorithm was used to show th at optim al assignm ents can be found for two processor systems [113]. The lim itations of the scheme are: (1) the problem of finding a m in im um N-way cut for N > 2 is NP-com plete, (2) it is assumed th at there is no concurrency between modules, and (3) it has inability to consider re source lim itation. Recently, a new family of heuristic algorithm s was added to the graph theoretic model by [80], which improves the characteristics of J the model. In [10], a sum-bottleneck path algorithm is developed for a doubly S weighted graph to solve the static partition problem. j i I • M athematical Programming ! This m ethod is also called integer 0-1 program m ing. It form ulates the mod- ule allocation problem to an objective function of the assignm ent. By adding some constraints such as resource lim itation, load balancing, and real-time! requirem ents, the goal is to minimize th e objective function. The classic j integer 0-1 program m ing is done in [24]. A m echanism of applying m athe-' m atical programming was practiced to module allocation [93]. The recent work for a fully distributed system is proposed in [56], where th e allocation problem is formed as a nonlinear, nonconvex, nonseparable and m inim ax m athem atical program m ing. M athem atics program m ing has a certain po- i tential which allows adding constraints for various applications. However, i t 1 l is expensive in execution tim e to achieve an optim al solution. j i ! • Heuristic \ Heuristic approaches only seek an approxim ate solution for static allocation. The key is to find an allocation to compromise the costs between interprc^ cessor communication cost and load balancing. T he previously proposed! work includes: [9], [32], [50], [79], [80]. The iterative improvement heuristic was used to im plem ent a LAST algorithm for task allocation [8]. Simulated annealing, as a generalization of th e iterative im provement [68], was applied | to static allocation w ith better results [6] [63], [105], [127]. j ! • Queueing-model Probabilistic m ethods often use queueing models. In [19], a standard queue-' ing network is used to analyze load balancing in a heterogeneous m u lti-; i l processor system , where a central dispatcher allocates processes to multiple! hosts, in order to minimize the m ean response tim e in the network. In [22], l execution of a program consisting of a set of modules on a distributed system was modeled by a semi-M arkov chain process w ith rewards. ( A .2 D istributed j i D istributed static load balancing is often used in local networks, where each; host allocates its own arriving tasks or jobs. Each job or task may be executed at the host where it arrives or transferred to another host. Finding optim al; probabilistic rules in a product-form queueing network has been developed in [114], | where the load balancing is form ulated as a nonlinear program m ing problem. In i [34], three heuristic m ethods are presented to minimize a cost function at each! host, using a simple queueing model. i * I B. D ynam ic Load B alancing B .l C entralized • Heuristic In a centralized heuristic m ethod, the central controller uses some heuristic j strategy to estim ate the system state and balance the workload. Simulatedj annealing has been used for dynamic load balancing by performing continu ously during the run tim e [41], [42]. Based on the neural networks approach [60], a bold network model was proposed to minimize th e energy function as the sum m ation of each processor’s workload by a determ inistic set of equa tions [42]. In [45], a two-tie scheduling is proposed to balance the workload by process m igration in a shared-m em ory m ultiprocessor system. • Queueing Theoretic Classic queueing theory assumes a workload of continuously arriving inde pendent processes. In general, a n*(M /M /l) queueing model is used, where n is the num ber of processors in the system. In a single class model, a nondeterm inistic policy is used to exploit several job routing strategies to reduce the average job turnaround tim e [19], [89]. In a multiple class model, an optim al probabilistic algorithm is used to schedule different classes of jobs to minimize the average job response tim e [90]. T he recently proposed m ethods are adaptive in nature. In [23], an adaptive distributed queue ing network was proposed to im plem ent task scheduling and adapt itself to workload fluctuation. An adaptive quasi-static algorithm for allocating jobs in the generic arrival stream was proposed in [11]. T he communication delays and its effects on load sharing is studied in [84], where the queueing theoretical models are form ulated to use a M atrix-G eom etric technique to solve load balancing in types of heterogeneous systems. B .2 D istributed The initiation policy divides distributed dynam ic load balancing into sender- initiated and receiver-initiated m ethods th a t can be found in a producer-consumer fundam ental relationship [123]. A threshold is often used to determ ine the transfer and location policy. If the thresholds can reflect the system state changes, then the m ethods become adaptive. As reported in [15] and [31], the sender-initiated m ethod is unstable at high loads and superior at low loads; whereas th e receiver-' I initiated m ethod provides b etter perform ance when the system is heavily loaded! but has high overhead during the system initiation. By using adaptive thresholds disadvantages can be diminished. • Sender-initiated Methods Previously proposed sender-initiated m ethods includes contract net protocol I [106], bidding [108] [111], above-average [70], gradient [77], greedy [20] and1 flexible [97]. Each is different from each other with respect to transfer and location policy. Sender-initiated m ethod yields perform ance close to opti m al, particularly at light to m oderate system loads. It takes advantage of I im m ediately beginning process m igration as soon as the processor enters a1 heavily loaded state. However, since the am ount of load balancing activ-' 1 ities increases as the system load increases, it reaches a point where the ( cost of such mechanism is greater th an its benefits. An adaptive methodj is proposed in [33], which reduces the overhead caused by inform ation col lecting and load transferring. A nother adaptive m ethod proposed in [94], I uses gradient estim ators em bedded in a sender-initiated scheme to provide; an adaptive threshold. , i • Receiver-initiated Methods j I The well-known receiver-initiated m ethods are pull-when-idle [78] and draft ing [91]. Pull-when-idle and drafting m ethods differ in their location poli cies. The drafting scheme selects a location by broadcasting draft requests i I and comparing the replies, whereas the poll-when-idle m ethod looks for any! I i location th at satisfies the m igration requirem ents. • Hybrid Method Among non-adaptive m ethods, a hybrid m ethod provides sufficient im prove m ent in perform ance [21]. It uses the receiver-initiated drafting when the system is heavily loaded, and the sender-initiated gradient m ethod when the system load is light. • Probabilistic A cooperative m ethod is proposed in [65], where each host estim ates the load level of other hosts and com putes the probability to transfer loads. An adaptive m ethod for distributed real-tim e systems is developed w ith state-change broadcasts [103]. It is receiver-initiated, but embedded w ith a M arkov chain model which calculates the distributed queue length at each node and the probability of meeting task deadlines. | 2.4 Im p le m e n ta tio n P ra c tices I Load balancing schemes proposed by researchers have been im plem ented in various distributed systems. In the following, I briefly discuss several successful! I im plem entations in local networks or m ulticom puter systems. j I | • Load balancing was efficiently im plem ented in P urdue Unix network [61]. j The system is called ECN(Engineering Com puter Network), which is a het-j erogeneous environm ent. The mechanism is state-dependent. The state of th e system is represented by a num ber called load average. The m ajor function was perform ed by a program called rxe, which is a scheduling rou-! tine developed to run a selected set of commands on the most idle machine' I available in a transparent m anner. j • Dynam ic process m igration was im plem ented in D EM O S/M P [92]. D E - M O S/M P is a version of DEMOS operating system with the extension to! operate in a distributed environment. It is a message passing system, w ith1 i i 24' _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ J distributed control. The rem arkable work of this im plem entation is th at a process can be m igrated during th e execution and continued on another processor w ith continuous access to all its resources. I • Load balancing was im plem ented in LOCUS distributed operating system! [122]. It is an im plem entation of n*(M /M /l) model. LOCUS is a Unix! com patible distributed operating system in operational use at UCLA. Load balancing was im plem ented with transparent distributing process execution. • In Roses System for the robotic and m achine intelligence application on NCUBE, sim ulated annealing was im plem ented as a static load balancing of i the framework [6]. • Dynamic load balancing is im plem ented on M OOSE (M ultitasking Object- oriented O perating System) for Caltech M ark II, NCUBE and iPSC. All m ajor objects in M OOSE, semaphores, pipes and tasks are ultim ately can didates for reallocation during load balancing. Optim izing the load balance! ! is expressed as the minimizing of a H am iltonian, which is defined as the! sum of the sum m ation of total load of each processor and the summation! I of com m unication between two processors. Sim ulated annealing or neuralj network m ethods can alternatively be chosen as the load balancing scheme' [66]. • In NEST, a network of w orkstations, preem pty process m igration is imple-j m ented at AT&T Bell Lab [35]. The decentralized control is used, and the dynam ic load balancing is adaptive. The model consists of an inform ation^1 and a control policy module. j I • GOMMON (global allocation from m axim um to m inim um in constant tim e), a load balancing strategy for local com puter systems w ith multiaccess net works, is im plemented on a network of SUN w orkstations [7]. Process mi gration is im plem ented by using the T C /P IP protocol. • In S p rite network operating system [27] [28] [96], transparent process mi gration is im plem ented by using virtual memory transfer. It supports two system calls to m igrate processes: pm ake and m ig. T he p m a ke system call uses process m igration to invoke as m any commands in parallel as there are idle nodes available, it is similar to th e m ake in UNIX. T he m ig takes ar gum ent as shell command and selects an idle node to use process m igration I ! to execute the command on th at node. Sprite allows to m igrate open files J where the virtual memory transfer is im plem ented to flush dirty pages to a : J I file server during the m igration and let the destination host to retrieve pages I from the file server as they are referenced. • Process m igration is im plem ented in Charlotte system [3] in the similar fash ion as th a t in LOCUS. The virtual memory transferring is done by sending the process’s entire memory image to the destination node. • The V operating system developed a t Standford University im plem ented load balancing by a pre-copying process m igration technique [18] [116]. The pre-copying reduces the freeze tim es substantially, which is very useful for th e system like V where process have real-tim e deadlines. • The lazy-copying technique is used to im plem ent process m igration at the| i A ccent operating system [130] at CMU. In this approach, process’s virtual mem ory pages are left on the original node until they are actually referenced on th e destination node. Thus, the cost of process m igration is reduced. 26 2.5 P erfo rm a n ce A n a ly sis M o d els A lthough m any models have been developed to describe the load balancing problem , theoretically speaking, they belong to a few formal models. I review | i these form al models which can be used to specify load balancing design policies J i and required criteria. A. Q ueueing M odel Queueing theory is fundam ental to probabilistic approach [19] [89] and [90]. Therefore, queueing models have been considered as the classical models for prob abilistic mechanism. In queueing modeling, each processor in a m ultiple processor system is mod- I eled by a M / M j l queue, which is an open, single-server system w ith a Poisson arrival process and exponentially distributed service tim es. For n processors, there are n independent queues. An n*(M /M /l) m odel can be employed to represent an n-node m ultiprocessor system. In general, the first-come first-serve (FC FS) queueing discipline is used. ! The system perform ance is usually indicated by the average job response tim e. { I T he objective of load balancing is to find an optim al probabilistic job assignm ent I | to reduce average response tim e by balancing the workload expressed as the num - j | ber of jobs in each queue. j In queueing modeling, m athem atical objects which represent system charac teristics can be form ulated. The attractive features of such modeling are: (l) it leads to a firm theoretical foundation for perform ance evaluation, (2) it is often easy to vary all system param eters and predicate the perform ance of system s be fore actually constructing them , and (3) it provides some equations which can reveal the dependencies among param eters. However, queueing m odeling has its' lim itations: the difficulty to conclude all the system features in the m odel, and1 the analytical intractability. i B . D istrib u ted C ontrol M odel ; • The CFA Model CFA model was proposed by [14] [15], where CFA stands for C om m unicat ing Finite A utom ata. It is a model designed for the purpose of analyzing distributed decision making algorithm s. In this model, each distributed algo rith m is classified into two descriptions: structure and sem antic. The struc- i tu re includes topology of interconnection, decision-making entities, state^ statistic, and static global knowledge at each node. T he semantics are spec-i ified by defining a finite autom ata for each node in the graph. CFA is a framework providing the quantitative evaluation and comparison' among different approaches for a given problem. It has the following bene ficial features: (1) the separation of internal and external state inform ation, (2) th e notion of phases to model discrete message passing and to controlj global inform ation exchange, (3) the flexibility to define non-homogeneousj transition function. The disadvantages of CFA m odel are: (1) it only allows' synchronous com m unication and same transition among processors, and (2) it does not show how local execution of load balancing will lead to global1 optim al. ♦ I I • A n Asynchronous Model I i TsitsikHs’s asynchronous model is a general m odel for distributed compu-l tation proposed by [120]. It allows different processors to update the samej com ponent of a vector, so th a t distributed asynchronous determ inistic andl stochastic com putation can be modeled. Therefore, the spectrum of the com putation is wide. This model is so far the m ost general model for dis-^ I trib u ted com putation. The advantages of this m odel are: (1) the ability to m odel asynchronous com putation, (2) th a t both determ inistic and stochas tic algorithm s can be represented, and (3) each processor has its own locali m em ory and different processors can update same com ponent of a vector. This model has problems w ith relative com plicated representation and a| 1 general assum ption which is not appropriate to the load balancing problem.J Chapter 3 Static Load Balancing Using Simulated Annealing In a m essage-passing m ulticom puter, the initial mapping from partitioned user program code onto various com puting nodes is crucial to the run-tim e perfor mance. This chapter presents a sim ulated annealing approach to achieving static load balancing at post compile tim e, in which the object code of the user pro gram and d ata set are linked and loaded. The purpose is to balance the initial com puting dem and in such a way which will minimize the potential internode com m unication dem and and will maximize th e degree of parallelism at run tim e. A combined com puting and com m unication cost function is defined in th e opti m ization process. Theoretical proofs are presented to verify the suboptim ality of the m apping m ethod. T he com plexity of the m ethod is analyzed as O (m ln m ), where m is th e num ber of partitioned user program code. Experim ental results are presented to show the effectiveness of the m apping m ethod and the perform ance of the sim ulated annealing process. This chapter begins by briefly describing the sim ulated annealing m ethod. It • then defines two static load balancing problems to which the sim ulated annealing m ethod is applied. The first is a general m apping problem where p a rtitio n e d , i program modules are allocated to distributed nodes in a m ulticom puter. T he second is to m ap m ultiple production rules in a parallel rule-based system onto m ultiple processor nodes. Im plem entation issues and theoretical proofs are then presented. The perform ance evaluation concludes this chapter. 30: 3.1 T h e S im u la ted A n n ea lin g M e th o d Sim ulated annealing is a heuristic m ethod for solving com binatorial optim iza tion problems. The idea comes originally from statistical m echanics, where a careful annealing is often performed to study th e properties of a m aterial near the ground state. First pointed out by K irkpatrick e t.a l, th e com binatorial op tim ization problems are analogous to the condensed m atter at low tem perature [68]. Thus the famous M a n to Caro procedure from statistical mechanics for the, annealing process [83], was used as an effective m ethod for determ ining global min-l I im a of com binatorial optim ization problems involving m any degrees of freedom. Sim ulated annealing is able to achieve a good approxim ation of NP-com plete prob-, lem and be considered as a generalization of a heuristic strategy called iterative im provem ent. | Sim ulated annealing is composed of four elements. T he first is a concrete definition of the system configuration, which formalizes the subject need to be! optim ized. T he second is a scalar cost function, which reduces the objectives of; i the optim ization process to a single num ber. The third is a generating mechanism[ \ \ to generate a trial of system rearrangem ents. T he fourth is a cooling schedule controlled by a temperature set. j The purpose of sim ulated annealing is to reach a global m inim um of the cost function by changing the system configuration under th e tem perature control. The process begins w ith an arbitrary configuration w ith an initial value of the cost function at an initial tem perature. A heuristic function is then applied] I which makes trial changes to the configuration and generates new values of the I cost function. The acceptance to a new configuration is based on the B o ltzm a n n distribution e~AE^T, where A E is the increase of the cost function E, and 2] is the temperature. At each tem perature, a certain num ber of trials need to be performed to reach an equilibrium. By decreasing the tem perature, the process < 31 Annealing Procedure 1. set an initial temperature T = To 2. set an initial arbitrary configuration X 3. while ( temperature not freeze ) do I I 3.1. while ( not reach an equilibrium ) do 3.1.1. randomly select a nearby configuration X ' = E{X) 3.1.2. calculate AE = E'(X') - E(X) j 3.1.3. if ( AE < 0 ) then ; accept X = X ’ else -A B accept X ’ with prob. = e t I 3.2. T — decrement (T ) 4. output configuration X Figure 3.1: A general sim ulated annealing procedure. is able to reach a global m inimum. This is due to the fact th a t e~AE^T will allow] occasional hill climbing to escape from th e trapping local m inim a. In theory,1 nearly global m inimum can be achieved as far as T is initially set to sufficientlyj high and gradually converges to zero. In practice, an annealing process can be term inated much earlier when the im provem ent on the cost function becomes very small. f Figure 3.1 gives a general description of the annealing process. The system con figuration X and the cost function E have to be defined to reflect the optim ization objectives. J 7 is a heuristic function which generates a new system configuration. The specification of F is designated as the generating m echanism . T he setting! of the initial tem perature To and the control over the two while-loops are called the cooling schedule. The conditions on the loop term inations are presented in! theoretical sense here, which have to be specified in im plem entation. In the fol- 32 ____i lowing sections, the system configuration and the cost function will be defined for two different mapping problems, and the generating m echanism and cooling schedule will be specified in the software package im plem ented. i l 3.2 M a p p in g o f P a rtitio n ed P ro g ra m M o d u les In order to use a m ulticom puter for parallel processing, an application program is usually partitioned into m ultiple m odules of code which can be executed con currently on distributed processor nodes. This section describes how to achieve J Figure 3.2: M apping of an example program , partitioned into m any m odules, onto ; the nodes of a hypercube m ulticom puter system . an efficient allocation of partitioned program m odules onto a message-passing m ulticom puter. 3.2.1 Form ulation of th e M apping P rob lem A program graph A machine graph 33 i The allocation of a partitioned program modules onto the nodes of a hyper cube m ulticom puter is illustrated in Fig.3.2. It is assum ed th a t the user program is already partitioned into m any program modules. A m odule can be a subroutine, a procedure, a subprogram , a service routine, or a well-defined segment of code th a t can be structured as blocks such as the loops, tasks and functions, etc. Techniques to partition or to restructure programs are beyond the scope of this problem. I concentrate on th e allocation of partitioned program m odules, instead of the way to partition them from th e original program . The interrelationship among program modules can be specified by a program graph after com pilation. T he program graph is an undirected graph with weights j associated w ith its nodes and edges. Each program m odule, M i for i = 1 ,2 ,..., m , j | corresponds to one node in the program graph. The module weight reflects th e ! code segment length and the size of the d ata set used in M i. Therefore, module weight is an approxim ation of the m em ory dem and of a program module. The: module weight also gives an approxim ated estim ate of the CPU cycles needed toi execute the program m odule. It is difficult to estim ate th e exact execution tim e at! i compile tim e due to unknown run-tim e conditions. Therefore, the module weight| is a very rough load index, which indicates more th e initial memory dem and rather than the com puting tim e which is actually required. The edge connections in the program graph correspond to com m unication paths among program modules. For simplification, I assume the edges bidirec tional between any pair of modules Mi and Mj] i.e. either M i can call M j or vice versa. The edge weight, e^-, reflects th e expected calling frequency between the two program modules. Note th at = eji due to the bidirectional assum ption. This calling frequency can be estim ated approxim ately as the num ber of cross references during compilation. This is especially convenient to be obtained in a C-program m ing environm ent using the U N IX /O S. In the experim ents, I simply use the UNIX com m and cxref to count the num ber of cross references between ! I 34,' program modules. From the output file of cxref , all the edge weights can be easily retrieved after compilation. It should be noted th at a program graph may contain cycles due to recursive function calls. Furtherm ore, m any of the program modules in a graph m ay be identical; i.e. the same code m ay be duplicated and distributed to different pro cessor nodes executing on different d ata sets. The data sets needed for each node are fully distributed w ithout any d ata sharing among the nodes. The exchange' t or m igration of program modules and their data sets is allowed during the static! allocation process. My prim ary concern is to produce an efficient allocation ofj program modules among m ultiple processor nodes. T he actual program code mi-j gration and d ata movement at run tim e are beyond th e control of a static loadj balancing scheme, which will be presented in chapter 4 and 5. A m ulticom puter is modeled as a machine graph in Fig.3.2. The distance, d ij, between any two nodes N i and N j in the graph is defined as the num ber of hops between them . For the hypercube w ith 8 nodes, the distances are either 0, 1, 2, or 3. It is assumed da = 0 and d ^ = dji for all i and j . The immediate neighbors of a node Ni are those w ith distance 1 from N i. I also assum e bidirectional links betw een all pairs of processor nodes in the m ulticom puter being used. In Fig.3.2, all module weights are w ritten w ithin the modules and edge weightsl I I are w ritten along the relevant edges in the program graph. The boldface num erals' inside each processor node identify the resident module num bers. Figure 3.2 gives only an arbitrary initial allocation. The problem of m apping the program modules into the processor nodes is NP-complete [43], if an optim al allocation is required. In m y approach, an approxim ation to this NP-com plete problem is sought by using the sim ulated annealing m ethod. 35; 3.2.2 A N ew C ost Function In the static program allocation, the system configuration is defined as an i allocation X which allocates m partitioned modules of a given program graph P j into the n nodes of a machine graph G : ; X = {k | 0 < i < n } (3.1) where local load Z ; assigned to processor node is calculated as li — Ylj wj f°r program modules Mj resident in node N {. Note th a t, as program modules being exchanged or m igrated during th e allocation process, the nearly optim al allocation X should be converged in the process. I For a given allocation X , an imbalance vector A indicates the degree of load ■ j im balance over n processor nodes. ' ■ I * A = (£1: S2, - - • 5 ^n) (3-2), where th e load offset Si for node Ni is defined as Si = | h — I |; and I = S lL i h /a is1 i th e load average among n processor nodes for a given allocation X . The system\ \ load imbalance for a given allocation is defined by: ! e m = E 5 ‘ ( 3 -3 >, i=1 I define a communication m atrix C = (Cij), as a n x n m atrix; where where Cij is the communication cost betw een nodes Ni and N j, i.e. d j = dij (3-4) X y I i where dtj is the distance betw een JV ) and N j, and exy is the edge weight betw een all program modules M x in N { and M y in N j. Using the com m unication m atrix C , th e total communication cost for a given allocation is expressed as: ( \ E E ( 3 -5 )t Z i=lj=n j 36 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _J The cost function E , which is the sum of the total load im balance and all the com m unication costs for a given allocation L , is defined as: n 1 n n e = E imh + E C om = Y1 U t- - ^ i + ~ c^ i = i ^ i = i j = i (3.6) I n itia l load d istr ib u tio n : L = { l 0 , I 3 , 1 4 , 1 $ , Z 0, l 7 } = {8 0 ,1 4 0 ,3 8 ,9 2 ,1 0 2 ,1 1 4 ,9 5 ,9 1 } L oad a v era g e: 1 = ( 8 0 + 1 4 0 + 3 8 + 9 2 + 1 0 2 + 1 1 4 + 9 5 + 9 1 )/8 = 93.75, L oad offset: A = ( S o , S i , S 2 , S 3 , 6 4 , 6 S , S 6 , 6 7 ) = (13.75,46.25,55.75,1.75, 6 .25,20.25,1.25,2.75) C o m m u n ic a tio n m a trix : 10 23, 24. 25. 26 11. 12> 13, 14. . > 5 y f t6 , > 17, 18, . 1 9 / '27. 28. 29. 30, V 31 j f 2 0 , N 21. 22 0 45 1 14 0 0 0 6 45 0 10 17 30 0 0 0 1 10 0 1 20 0 0 0 14 17 1 0 39 0 4 27 0 30 20 39 0 23 8 0 0 0 0 0 23 0 0 26 0 0 0 4 8 0 0 45 6 0 0 27 0 26 45 0 L oad im b a la n ce: E i r n b = 13.75 + 46.25 + 55.75 + 1.75+ 6.25 + 2 0 .2 5 + 1 .2 5 + 2.75 = 148 C o m m u m ic a tio n cost: E c o m = (66 + 102 + 3 2 + 92 +120 + 49 + 57 + 104)/2 = 6 3 2 /2 = 316 C o s t fu n c t io n (I n itia l v a lu e ): E = E i m b + E c o m . — 464 Figure 3.3: T he calculation of the cost function for the initial allocation in Fig.3.2 The term inologies defined above are illustrated in Fig.3.3 for the sam ple alio- j t cation shown in Fig.3.2. In Fig.3.3, the boldface num erals inside the hypercube ! ! nodes are th e program m odule num bers, i.e. i for M i. The sequence of evaluating ; the cost function E is illustrated. Based on the system configuration X and cost ! f function E defined for th e m apping problem, the sim ulated annealing process then can be applied. T he experim ental results will be presented in section 3.6. 371 3.3 M a p p in g o f M u ltip le P r o d u c tio n R u les Production System s are often used in artificial intelligence applications [62] [115]. For a parallel production system im plem ented on a message-passing m ulti- 1 I com puter, m ultiple production rules need to be m apped to distributed processor , nodes for parallel rule firing. The sim ulated annealing m ethod is applied to achieve ! I th e suboptim al partition and allocation for this m apping problem . The goal is to i maximize the degree of parallelism by distributing workload equally and minimiz- , ing com m unication cost in message passing among m ultiple nodes. This section i i form ulates th e m apping problem based on the d ata dependency analysis and de- ; i fine the cost function which needs to be minimized by the sim ulated annealing ! process. I i I 3.3.1 A P arallel P rod u ction System ; i A production system consists of a set of production rules, a global knowledge! base, and an inference engine. The inference cycle of a conventional production j system is divided into three phases: rule matching, conflict-resolution, and rule j execution, w ith the rule m atching consuming 50% of the to tal tim e. ; Parallelism of a production system can be achieved by: 1) m ultiple rule match-! I ing, and 2) m ultiple rule firing [53] [54]. A parallel production system w ith m ul tiple rule firing on a message-passing m ulticom puter is shown in Fig.3.4 [71]. It has inference cycle consisting of four phases: fire-selected-rules, inter-rule- communication, matching and select-multiple-rule. The first three phases are ex ecuted concurrently by m ultiple processor nodes, and the last phase is executed sequentially at the host machine. The operations at th e processor nodes and atj th e host are overlapped, so th a t parallelism can be exploited. The to tal execution1 tim e of the production system can be significantly reduced by parallel m atching j and firing. Thus m ultiple rules need to be allocated to m ultiple processor nodes to facilitate parallel m atching and firing. Processor nodes with distributed memory Host C om puter Fire Fire Comm Fire Comm Comm M atch M atch M atch Select Rules Fire: fire-selected-rule Comm: inter-rule-com m unication N i: com puter node Figure 3.4: The execution model of a parallel production system on a multicom-j puter. j I T he assignm ent of production rules to processor nodes has a great im pact on th e overall perform ance. For example, rules cannot be fired in parallel if they are assigned to the same com puter node. Perform ance m ay be degraded by interdependence among rules. If two rules have to com m unicate w ith each other, assigning them to different nodes will take longer to pass messages. It isj desirable, however, to assign an equal am ount of the initial workload to all nodes .j In general, the num ber of rules in a production system are much greater th an thej num ber of processor nodes in a m ulticom puter. To find an optim al allocation of i the production rules is an NP-com plete problem [43]. j t i I 1 I i t t 39 3.3.2 M apping Based on R ule D ep en d en cy A nalysis The knowledge base of a production system is formed with a set of knowledge elements, denoted as K — {&{}, where each knowledge element ki is represented by a predicate of facts. A production rule in the rule base has the following form: (P (c+A c+A ...) - (cjf A c2 A ...) -----y add(af, a f ,...) remove(a,i , a j ,...) ) j where q is a condition element and a{ is an action element. All the conditions J ! are on the lefthand side and actions are on the righthand side of the arrow. If' l all conditions on the lefthand side are m et, it is eligible to take actions on th e 1 righthand side. Actions implies add and removal of action elements into the: knowledge base. The positive condition elem ent c f is m et if there exists some j | I such kj = c f , and th e negative condition elem ent c~ is m et, if there is no any j\ such th at kj = c j . The positive action elem ent a f will be newly added into the 1 knowledge base, and the negative action elem ent ai will be rem oved from the, knowledge base. The event to add or to remove an action element is called as an action. I denote the sets C + = {c*}, C + = {c J }, A + = {u^}, and A~ = { a ^ }. Let Pi and P 2 be two productions in the rule base, and P 2 be the one to be fired. Based on the definitions defined by Moldovan and his associates [26] [72], several d ata dependencies between Pi and P 2 are defined as follows: • Pi is input dependent on P 2, if ( c 1 + n A 2 -)u (C 'a ~ n A + ) ^ 0 This relationship means th at the condition part of P i will not be satisfied after P 2 is fired. I • Pi is output dependent on P 2, if (A1 + n A 2 - ) u ( ^ n A + ) ^ 0 This m eans th a t the elements added or removed from the knowledge base by firing P\ will be removed or added if P 2 is fired. • Pi is input-output dependent on P 2, if (C1 + n ^ + ) u ( C ' r n A 2 - ) # 0 This m eans th a t the conditions of P x m ay be satisfied, after P 2 is fired. ' | Two rules Pi and P 2 are compatible (or co m m u ta tive), if they are neither input j dependent nor o u tp u t dependent to each other. If rules P x and P2 are com patible, then the firing of P x and P 2 in any order or in parallel produces the same result. The rule base in a production system w ith m rules, is formed w ith th e set R = { P i I 1 < i < m }. The parallelism m atrix, P = for 1 < i , j < m , is defined as: 0 if rule Pi and rule Pj are com patible Pij (3.7) 1 otherw ise I Pij — 0 implies th a t P, and P j can be fired in parallel. A communication matrix, C = {c^-} for 1 < i , j < m , is defined by: 1 if rule Pi is input dependent or input-output dependent on P j. (3-8) 0 otherw ise If Pi is in p u t dependent on P j, firing P j m ay destroy the conditions needed to firei Pi. On the other hand, if Pi and Pj are input-output dependent, then firing any: I one will bring elem ents into knowledge base which will invoke the other to fire.' i The condition Cjj = 1 implies th at a message m ust be sent from Pi to P j, once Pi j is fired. 1 i I I 41' ___________________________________________________________________________________ Rule Base R (Pi (A A B A C) -(DAE) i __ ^ j add(jEJ) ' remove (-4)) 1 C+ = {A, B,C} ' C f ={D,E} i At = {E} 1 AZ = {A} \ I I Pj is input dependent on P2: | (Cf h a z ) u (Cz nA+) = {c} : P i is output dependent on P2: : ( i + n i - ) u ( i - n i + ) = {£} Pi is input-output dependent on P2: (Cl n i + ) u ( C f n 4 ~ ) = {b ,e } {Pl,P2} (P2 c l Cz A+ a 2 (C A D ) -(B) add (B) remove(C, E)) ’ ■ {C,D} m m {C ,E } P 2 is not input dependent on P x: (Cl n 4 j ) u (Cz n At) = 0 P2 is output dependent on P\: (^■2 n 4~) u (4~ n A l ) = {E} P2 is not input-output dependent on Pj: (Cl nA+)u(C2 - n 4~) = 0 p = 0 1 c = 0 1 1 0 0 0 I Figure 3.5: The rule dependencies, parallelism, and com m unication m atrices of a j two-rule production system . \ 42 Figure 3.5 shows a sample production system used to explain the idea of rule dependency and the generation of the P and C m atrices. This exam ple shows th at the parallelism m atrix P is sym m etric, while the com m unication m atrix C \ i is asymm etric. Each production rule Pi is associated w ith a firing frequency w which shows how frequently P, is fired. The value of Wi is estim ated at compile tim e. A firingj frequency vector is defined as: i I W = fw i | 1 < i < m } (3.9)1 A multicomputer is modeled as M = (N , H ), where N — {Nk | 1 < k < n} is| I the set of processor nodes; and H is th e set of all physical links am ong them . Aj distance matrix is defined as: j D = {dij | 1 < i , j < n} (3.10) where the distance dij is the num ber of links between node Ni and Nj. It isj assumed da — 0 and d^ = dji for all i and j . The immediate neighbors of a node Nk are those nodes w ith distance 1 from N k . I assume bidirectional links among all processor nodes. j Given a production system w ith R = {Pi | 1 < i < m }, which is characterized' by P, C and W m atrices, and a m ulticom puter M — (N ,H ) w ith N = {Nk |! 1 < k < n} and the D m atrix, the m apping of the production system is achieved^ by partitioning m rules into n subsets of rules and allocating each subset to aj processor node. T he m apping problem is essentially to generate an allocation! m atrix X = {siy}, where Xij for 1 < i < m and 1 < j < n as defined below: j ( ! I 1 rule P{ is assigned to node N j i Xij — \ (3.11^ I 0 otherwise Each rule is assigned only to a single node. T hat is Y fj-i x ij — 1- Each processor node Nj is assigned up to Cj rules: J27Li xij ^ Cji where Cj is lim ited by th e local m em ory capacity in Nj. 43 i Four m ajor goals of the m apping problem are identified: 1. Maximize parallelism by assigning com patible rules to different processor nodes. i 2. Balance the workload among different processor nodes in a m ulticom puter.1 j 3. Minimize the to ta l com m unication cost by assigning rules to the sam e node] if they m ust com m unicate w ith each other. 4. P artitio n and allocation scheme m ust be efficient in m apping very large production systems. 3.3.3 C ost Function for M apping P rod u ction System s j j In this m apping problem , the system configuration is characterized by a con figuration matrix, X , as defined in (3.11). I i The cost function E to be used in th e optim ization process is defined as: j i I E = EpaT + E imb + E com (3.12) I t i where Epar is th e parallelism loss, Ejmf, is the load imbalance, and E corn is the internode communication defined below. The parallelism loss Epar is attrib u ted to overhead by sequential execution of i com patible rules, which are m istakenly allocated to the same com puter node. If Pij = 0? and Pi and P j are placed in the same node T V * ., then these two rulesj m ust be fired sequentially. T he production match time is denoted by tm, which t is the average m atching tim e per rule in the sequential execution of a production; system. Thus, Epar = Yfk=i Epar, where Epar counts the additional cost to detect^ com patible rules assigned to the same node, and defined as: : I T 7 T T T l | E p a r = 5 1 P i J ' * W i ' (3-13) i = l j j z i j 44' where wt is the firing frequency. For each pair of rules Pi and Pj, = 1 means th at they are com patible, and x ik ■ xik = 1 indicates th at they are allocated to the same node N k. Let the system load L = {lk | 1 < k < n}, where lk is the average tim e to execute rules allocated at node N k. lk = x ik ■ where x ik indicates _ y™ i w hether rule Pi is allocated at Nk. Let I — f c be the load average among n com puter nodes for a given L , the load imbalance Eimbi is defined as: = = < 3 -14) k— l k=l The to tal communication cost among com puter nodes is E corn = Ylk=i ^comi where E^om is defined as th e cost for node N k to pass messages to other nodes through the interconnection network. \ T Y i n m j com = y ' y y ^ ■ ^kl * tcom (3.15) I i=11= 1 j=l : where dki is the distance between node N k and Ni, and tcom is the unit routing time j to pass a message from one node to one or m ore of its im m ediate neighbors. j is the sum of th e cost of each rule Pi allocated at N k ( Xik = 1 ) to pass messages j to rules allocated at other nodes JV j’s. For rule Pj allocated at Ni (Xji = 1), if Pij has to com m unicate w ith Pj ( C y - = 1 ) then there is a cost to pass a message from node N k to Ni. The cost depends on the firing frequency Wi and the distance dk{\ between them . If two rules are allocated to th e same node N k, then th e cost ofl message passing betw een them is 0 because dkk = 0. By these definitions, th e cost function E can also be w ritten as: e = £ ( £ * „ + e L , + E i J k= 1 It is actually characterized by 5 m atrices X, P, C, W and D. For simplicity, I denote the expression as E(-) in the sequel. Figure 3.6 presents the P,C, and W m atrices of a 26-rule Tourney production system , the D m atrix of a 8-node hypercube m ulticom puter, and th e m apping 45 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ J Q 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 I 1 1 1 1 1 0 1 1 0 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 2 0 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 D 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 t 0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 I 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 1 0 0 X 0 0 0 X 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 4 4 4 8 4 5 2 9 3 5 5 7 5 20 5 5 4 4 5 2 5 2 5 2 9 9 9 5 5 5 5 5 5 0 1 1 2 1 2 2 3 1 0 2 1 2 1 3 2 1 2 0 1 2 3 1 2 2 1 1 0 3 2 2 1 1 2 2 3 0 1 1 2 2 1 3 2 1 0 2 1 2 3 1 2 1 2 0 X 3 2 2 I 2 X I 0 5 . 14. 1 5 . 1 6 1 2. 1 3 10 19. 17. 2 2. 2 5 J 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 Figure 3.6: T he m atrices of th e Tourney production system and a m apping cor responding to th e allocation m atrix X . 46 configuration m atrix X. The corresponding m apping graph is also shown, where i th e boldface num erals inside the processor node are the allocated rules. T he value of the cost function for this mapping is calculated as E — Epar + Eimb + E com — 805.5 + 347.4 + 4837 = 5989.9. The m apping problem is form ulated as follows: I i t 4 Given: A production system w ith R = {Pi | 1 < i < m } and th e associated P, C and W m atrices, A m ulticom puter system w ith N — {Nk | 1 < k < n} and the D| i m atrix as defined in Eq.(3.10). ! To find: A allocation m atrix X = { a | 1 < i < m , 1 < j < n}, such th a t the j cost function E(-) is minimized under the constraints 1 = lj an<i x ij ^ @ 3' ; j In section 3.6, the results of using sim ulated annealing m ethod to solve this! I m apping problem will be presented, and the perform ance will be analyzed. I 3.4 Im p le m e n ta tio n o f th e S im u la ted A n n e a lin g M ethodj After defining th e system configuration and the cost function for each m apping' problem stated above, the sim ulated annealing m ethod can then be applied to find! a suboptim al solution associated w ith the minimized cost function. This section presents th e im plem entation issues addressed to the sim ulated annealing m ethod used in solving the two m apping problems. The generating m echanism and cooling , j schedule are developed to ensure the quality and efficiency. The software package SIMAL, which has been developed, is described. 47i 3.4.1 G e n e r a tin g M e c h a n ism Since two kinds of the m apping problems were addressed to, a com putationalj elem ent a{ is used as a generic term for program code need to be allocated to| a processor node, i.e. ai can be either a partitioned program m odule Mi or aj production rule Pi. For a defined system configuration X , a heuristic functionl 1 is defined as T : X — » X ' where X is a given configuration and X ' is a new; j configuration. T reassigns an element to generate a new system configuration.! 1 i Suppose th at a; is allocated to node Nk in the old configuration X , and th e new configuration is obtained by applying T on X and az. Three heuristic swap functions are defined below: • L o cal e x ch an g e : Random ly select an elem ent aj from a neighboring node Nh of node Nk and exchange a{ w ith aj j i • G lo b a l e x c h a n g e : Random ly select a node Ni other than node Nk. Ex change ai w ith a random ly selected elem ent aj allocated to node Ni. • L o c a l m ig ra tio n : Compare th e load Z j w ith the load Ih of a randomlyj selected neighboring node Nh of Ni, if li < lh, m igrate a t - to Nh- ■ I In order to show how these swap functions work, for the initial allocation of partitioned modules in Fig.3.2, Fig.3.7 gives th e results of one application of th e three swap functions, respectively. In Fig.3.7 (a), th e local exchange m ethod is used. M odule Mo resides in node No in is exchanged w ith m odule Mg in node N 2- The global exchange m ethod is applied in Fig.3.7 (b), where m odule Af15 is random ly selected. Since Af1 5 is not in No, M 0 and M x5 are exchanged. In Fig.3.7 (c), the local m igration m ethod is applied. The im m ediate neighboring node 1 V 2 is random ly selected, and it is found th a t Z 2 = 38 is less than Iq = 80. Therefore, Mo j is m igrated to 1 V 2. Note th a t the value of the cost function changes w ith different am ounts for different swap functions used. | E = 475 E = 496 E = 488 11. 12 13, 14, 20, 20, 17, 18, 27, 28, 29, 30, 27, 28, 29, 30, 27, 28, 29, 30, 23, 24, .25, 26 23, 24, ,25, 26 '23, 24, v 25, 26 f 16, > 17, 18, f 16, > 17, 18, 11, 12 > 13, 14, . 15 ) f 20, > 21, 22 11, 12' 13, 14, (a) After local exchange (b) After global exchange ^ ^ fter local migration between M 0 and M9 between M0 and M 1 5 0f from C0 to C2 Figure 3.7: New allocations generated after applying three different swap functions; I to the allocation in Fig.3.2. 3.4.2 C ooling Schedule The cooling schedule is controlled by a decreasing sequence of tem peraturesj i T , such th at: ; T = {T i | 0 < i , T i > T i+J (3.16)| I model the sim ulated annealing process as a set of homogeneous M arkov chains' I [74], where the annealing process at tem perature T, corresponds to a single M arkov chain. T he convergence to the global optim ality is, theoretically, constrained by i I infinite transitions at each T{, and lim ^oo T, = 0. In practice, however, the' num ber of trials at each tem perature T[ and the size of T m ust be finite. The specification of the followings forms the cooling schedule used in th e m ap ping m ethod: I (1). Initial temperature Setting: I choose an initial tem perature To, which is sufficiently high so th at virtually all transitions will be accepted. T h at is] 49 - A E e T o ~ 1 for all possible A E . This criterion can be approxim ated by defin ing an initial acceptance ratio xo as the num ber of accepted configurations divided by the num ber of trials, such as choosing Xo — 0.9. I perform a trial over all elements with th e acceptance probability 1 and calculate the m axim um difference in cost function m a x (A E )* . Then To is chosen as rp _ maxiAE'}* 10 ~ Kxo-1) ' (2). Stop Criterion Used: If k consecutive M arkov chains have th e sam e value of E and result in identical configurations, and the difference betw een this cost value and th at of the last (k — l)th chain is less than a small termination fraction e = 10-fc of the m a x(A E )* , then th e annealing process is term inated and the final configuration is accepted. The constant k , is determ ined by the user. (3). Decrement Rule Used: An exponential decrem ent rule is used to m ake small changes in tem perature: T^+1 = a ■ T{, 0 < i and 0 < a < 1, where a is the cooling constant. My experim ents show th a t this simple rule requires m uch less overhead tim e, com pared to th a t of the standard decrem ent rule using = u srfe i 1 461158!- (4). Length of Markov Chain: To simplify th e problem , th e num ber m of com putational elem ents in a user program determ ines the length of the M arkov chain modeled. This length determ ines th e num ber trials perform ed at each tem perature. The cooling schedule defined above, actually controls the quality and the effi ciency of the annealing process. In section 3.6, the effects of cooling constants to the perform ance will be discussed. 50 I n p u t System D ata Control Param Arch. Spec O u tp u t Node Table Config. X E-Curve U pdate Initiali zation Graphics D atabase User Interface G enerate H ardw are Sim ulator Cooling U tility Figure 3.8: Com ponents of SIMAL for parallel m apping of production systems 3.4.3 SIM AL: A M apping Tool The SIMAL is a software package which im plem ents the sim ulated annealing m ethod to achieve static load balancing in m apping user applications onto a mul ticom puter. SIMAL is im plem ented on a SUN-3 w orkstation w ith SunOS UNIX 4.0. The m ajor com ponents in SIMAL is shown in Fig.3.8. The input to SIMAL includes the system data, the control parameters and the target m achine architecture specification. T he system d ata characterize the user application program , which can be a program graph or a set of m atrices to specify a parallel production system. The control param eters include th e cooling con stants & , a and Xo- The architecture specification is determ ined by the hardw are topology of the m ulticom puter, nam ely th e m em ory size and cost of the intern ode com m unication via message passing. SIMAL is able to m ap an application program onto a m ulticom puter using different interconnection networks such as hypercube, ring, mesh, etc. 51 The system configuration is represented by a node-table, where each entry indicates the state of a processor node. There is also a element-table where the state of each elem ent is stored. In order to reduce the overhead tim e of the annealing process, the calculation of each new cost function is simplified by recom puting only the com ponents effected by the m ost recent changes in the configuration. The ou tp u t of SIMAL is th e node-table with a nearly optim al allocation of th e application program . The SIMAL generates a curve showing the variation of the cost function during the entire annealing process. The sim ulated annealing algorithm being im plem ented in SIMAL is shown in Fig.3.9, where procedure or function calls are w ritten in Italics. There are two loops to im plem ent the cooling schedule: th e outer while loop (i-loop) controls the tem perature set T and the inner fo r loop (j-loop) controls the length of the M arkov chain. The term ination of while loop is controlled by a flag stop which indicates w hether the term ination criterion is satisfied. The fo r loop is bounded by the num ber m of com putational elements in an application program , which can be a num ber of program modules or production rules. The random(Q,l) is a random num ber generator which draws a uniform distribution on (0,1). T he acceptance probability, the B oltzm ann function is im plem ented by — A. -E com paring e t w ith the random num ber generated betw een 0 and 1. 52 sim ulated_annealing() { T = 5e<_T0(); X = set S u it-X {)\ E = calculate-E(X); stop = NO; i = 0; while ( ! stop ) { for ( j = 0; j < m ; j + + ) { X ' = F (X )i E' = calculate -E (X '); A E = E ' - E ; if ( e- ^ > random^0,1) ) X = accept(X'); /* set init tem perature */ j* set init configuration * / /* calculate cost value */ /* initialize the stop flag */ /* set out-loop counter * / /* outer-loop * / /* inner-loop * / /* generate new config. * / /* calculate new E * / /* calculate E increase * / /* acceptance prob. */ /* accept new config. */ } if ( ! (stop = stopjok{i) ) ) { /* check stop criterion * j i + + ; T = decrem ent(T ); } /* increm ent counter */ /* decrease tem perature */ } output(X); /* outp u t solution * / } Figure 3.9: T he sim ulated annealing procedure im plem ented in SIMAL. 53 3.5 M arkov C h ain A n a ly sis on C o m p le x ity This section presents the m athem atical model of th e sim ulated annealing m ethod used. The goal is to prove th a t the sim ulated annealing m ethod can j achieve a global optim ality. The suboptim ality of the proposed m apping m ethod i I is revealed, and com putational com plexity of the m apping m ethod is analyzed, j 3.5.1 A M odel of H om ogeneous M arkov Chains A set of homogeneous M arkov chains is used to m odel the proposed annealing m ethod. D EFIN ITIO N 1: The system configuration set S is a set of all possible system configurations which m aps m com putational elements onto a m ulticom puter w ith ! n nodes. The sim ulated annealing m ethod is to minimize the cost function E over' this finite set S. The m axim um cardinality of S, is |5 | = n m. J l D EFIN ITIO N 2: Let s G S be an arb itrary configuration, the derived set N ( s ) ofj i s is defined as: j N (s) = -{Vis' s, s' = E {s)} I Vs € S , N ( s ) C 5 . Each s' G N ( s ) is called a neighbor of s. ; * D EFIN ITIO N 3: For s, s' G 5 , th e generating probability g(s, s') is the probability* I th a t s' is generated from s, such th a t ! f wTT if s> £ A (s) j g{s,s,) = \ N{s) (3.17) ^ 0 otherwise j g(s,s') > 0 iff s' G -N(s). G = {<?(s, s ') |s ,s ' G 5} is th e generating probability I matrix. J I D EFIN ITIO N 4: For s, s' G S , let s' be generated from s, th e acceptance probability ! 54 th a t s' is accepted by the annealing process is defined as: - < E { s ' ) - E ( s ) ) „ a(s,s') = m in( l , e T ) (3.18) A = { a (s ,s ') | s ,s ' G 5} is the acceptance matrix. j ! The annealing process continuously attem p ts to transform the current config uration into its neighbors. This kind of m echanism can be described by the mean of Markov chain: a sequence of trials, where the outcom e of each trial depends t only on the outcom e of the previous one [36]. Let X = (X (0 ),X (1 ),. . . ,X (fc ),...) be a sequence of configurations generated j by the annealing process. The process X = (X ( k ) ,k > 0) is described by a set of conditional probabilities. For each pair of outcom es, (s,s') € S , the conditional probability th a t X ( k ) = s' is generated from X ( k — 1) = s is denoted as: P {X {k ) = s' \ X ( k - 1 ) = s} (3.19) i T H E O R E M 1: T he sequence of configurations at a constant tem perature T ,, X t = (X r { k ) ,k > 0) is a homogeneous M arkov chain. ■ P ro o f: Basing on the sim ulated annealing algorithm , for Vs, s' € S, J I i P { X { k ) = s' | X ( k - 1 ) = s} = p (s ,s ') • a(s, s') (3.20) If the conditional probability P { X ( k ) = s' j X ( k — 1) = s} does not depend on k , the chain X = ( X ( k ) ,k > 0) is hom ogeneous, otherw ise it is heterogeneous [36].J i By (3.20), th e conditional probability a t a fixed tem perature T is independent of k. So, each X = (X ( k ) ,k > 0) at a constant tem perature T is a homogeneous M arkov chain. It is denoted as X t = (X r ( k ) ,k > 0). Q.E.D i I Since th e annealing process a t each tem perature T is a homogeneous Markov! t chain, the entire annealing process controlled by th e tem perature set therefore becomes a set of homogeneous Markov chains. 55 J 3 .5 .2 P r o o f o f th e C o n v e rg e n c e to a S u b o p tim a l S o lu tio n » I Based on the M arkov chain m odel, I prove th a t the sim ulated annealing can achieve a global optim al on the theoretical basis. I then show th at a suboptim al solution can be achieved in my im plem entation. D EFIN ITIO N 5: For all s, s' 6 S and a constant T , the transition probability from i j s to s' is defined as: * ! qT(s,s') ■ ar(s, s') s s' y T \ i ) T \ i f ( 3 2 1 ) 1 - J2s"^s’ 9t (s , 5") • aT(s,s") s = s' PT = {P t (s , s') \ S,S* e S } is the transition matrix. j From (3.21), Vs € 5 , E }'e s ^ r ( s )s0 = 1- Therefore, Pt is a stochastic m atrix 1 and X t is a stochastic process. Each Markov chain is im plem ented by the fo r ' loop in Fig.3.9, and the tem perature is decrem ented between two subsequent 1 M arkov chains. The sim ulated annealing m ethod attem pts to find an optim al 1 1 configuration w ith the m inim um value of th e cost function E . i D EFIN ITIO N 6: Let E (S ) = {E (s) | s € S}, the global minimum cost of E ( S ) is | defined as E* = m in seS(E (S)) D EFIN ITIO N 7: The set of optimal configurations S* is defined as 5* = {s e S | E(s) e E .} T H E O R E M 2: Each M arkov chain X t generated by the sim ulated annealing process is irreducible and aperiodic. P ro o f: X t is irreducible, if Vs*, Sj 6 S, and I > 1, there exists a sequence of j configurations (st -, st+1, . . . , s-,=i+i), such th a t g(si,Si+k) > 0 for 0 < k < I. And 1 X t is aperiodic, if for all T > 0, there exist s,s' € 5 , such th a t I i aT(s i s') < 0 and s ' ) > 0 (3.22) i i 5 6 By the heuristic swap functions defined in th e generating m echanism , the probability of a new configuration s' derived from an arbitrary configuration s is T h a t is, VNk, Ni G N , 3i > 1, such th at there is always a sequence of hops (d fe(jfe+i), d(k+i)(fe+2), • ■ •, d(k+i),i) from N k to N t. By reachability and the generation m echanism , an elem ent at any node can be replaced by any other elem ent at the other nodes. Therefore, for all s, s' G S', s' can be generated from s through a num ber of transitions. Thus, irreducibility is granted. In my im plem entation, the acceptance probability is determ ined by com paring e r~ * ^ w ith a random num ber generated uniformly between 0 and 1. For all T > 0 and s,s' G S , let s G S* and s' £ S then by (3.18) and the generating m echanism , a j(5 ,5 f) < 0 and gr{s i s') are always true. Therefore the aperiodicity defined in (3.22) is satisfied. to achieve th e global optim ality. T H E O R E M 3: The set of homogeneous M arkov chains generated by th e anneal- chain has an infinite length, and th e tem perature converges to zero. P ro o f: Let the uniform distribution of the set 5* be a |5*|-vectoT it and defined as As proved in [81], the essence to the convergence is achieved by two conditions: a uniform distribution over N (s). I assum e th at a m ulticom puter is reachable: Q.E.D. Theorem 2 provides the conditions on Gt and A t such th a t there exists a stationary distribution on each Markov chain X y. This is th e sufficient condition ing process convergence to the global optim al configuration set S , , if each M arkov if s G 5* otherwise lim ( lim P { X T(k) = s}) = t t ( s ) 1 —*0 k — *o o lim (lim P { X T(k) = 5 ,}) = 1 r-*0 k— too 57 These two conditions imply th a t 1) for each M arkov chain X x , there exists a stationary distribution qx, and 2) for decreasing T , qx converges to 7r. The condition on Gx &nd A x for the existence of the stationary distribution of each M arkov chain X x is the irreducibility and aperiodicity as provided by the anneal ing process defined. If the stationary distribution qx exists, then the convergence to global optim al can be achieved. Q .E.D . Theorem 3 proves th a t the sim ulated annealing can achieve global optim al ity. In practice, however, the length of each M arkov chain m ust be finite, and th e process can be stopped before the tem p eratu re approaches zero. Thus, any im plem entation of the m ethod can only achieve a suboptim al solution. D EFIN ITIO N 8: Let E * = m a x seS(E ( S )), and m a x (A E ) = E* - E „ for 0 < e < < 1, the suboptimal configuration set is defined as: = {s 6 S | E(s) — E t < e • m a x (A E )J , (3-23) D E FIN ITIO N 9: The annealing algorithm is said in quasi-equilibrium at a tem p eratu re T , if ax{s,s') is close to qx where s' is the configuration generated from the m th (last) trial at T. C O R O L L A R Y 3: The sim ulated annealing algorithm im plem ented by the ho mogeneous M arkov chain m odel converges to the suboptim al set S[ when the M arkov chain length m and the cooling constant a provide the quasi-equilibrium at each tem perature T . P ro o f: The suboptim ality and quasi-equilibrium are provided by th e cooling policy. First, m ax(A E )* obtained in initial tem perature setting approxim ates the m a x (A E ), which is used to determ ine th e stop of process. W hen k consecutive M arkov chains converges to an identical configurations, ax(s,s') is close to qx for s' generated from the m th trial. Q.E.D. To achieve the quasi-equilibrium at each T , the cooling constant a needs to 58 compromise with the length of M arkov chain m. The smaller the m , the large the a (slower cooling) should be applied, in order to reduce the difference between qx and qx+i- W hen m is large, smaller a (faster cooling) can be used since the M arkov chain is long enough to m ake a r (« ,s ') approaching qx- 3 .5 .3 C o m p le x ity A n a ly s is The com plexity of the proposed m apping m ethod is analyzed in th e following. T H E O R E M 4: The com plexity of the proposed m apping m ethod is ln(e • ln (xo1)) ~ n ) m In a (3.24) where e is th e term ination fraction, Xo is th e initial acceptance ratio, m is the num ber of elem ents in an application program , n is the num ber of nodes in a m ulticom puter, and a is th e cooling constant. P ro o f: In the sim ulated annealing process, th e num ber of iterations in the while- loop i is bounded by: Tf = • T0 (3-25) where a is th e cooling constant, T o and T f are th e initial and final tem perature respectively. In th e im plem entation, the initial tem perature is defined as m ax(A E )* T -o — , / _ n (3.26) J - n (Xo ) -O .B W hen th e stop criterion is satisfied, the acceptance probability e T f becomes very — max(AE) small, and A E — e • m ax(A E )*. Based on (3.21), e T f < j|j, thus e • m ax(A E )* T/ = — M S |— (3'27) S ubstitute (3.26) and (3.27) to (3.25), then: g - M x o 1) _ i In 151 a 59 Therefore: I 'ln“ = ln( ln |S | } . _ ln(e • ln (x o X )) - ln (ln \S\) l ~ In a Since the upperbound of |5 | is n m, ln(e • ln(xo *)) — ln(m ’ n ) * “ In a Since the num ber of iterations in the for-loop is m , therefore the total complexity of the algorithm is proved. Q.E.D. C O R O L L A R Y 4: W hen the application program is large, ( i.e. m t l ), the above com plexity is reduced to 0 ( m ■ ln m ). T he 0 ( m ■ ln m ) com plexity is obtained after a sequence of approxim ations. This result indicates the overhead associated w ith the m apping m ethod is pri m arily decided by th e size of the application system , not by the m achine size. The overhead is sub-quadratic, which is tolerable for m any practical application system sizes. 3.6 E x p e r im e n ta l R e su lts This section presents the perform ance evaluation of the static load balancing m ethod. It shows the effectiveness of the sim ulated annealing m ethod used in two m apping problems and presents the analysis on im plem entation issues of the annealing process. 3.6.1 R esu lts on M apping P rogram M odules A set of program allocation experim ents for solving the m apping problem de fined in section 3.2, were carried out to investigate the effectiveness of the alloca tion m ethod. These experim ents m ap from 32 program modules in a partitioned 20, C , 26, 18, .23, 24 f 14, > 3 1 , 2 7 , v 1 2 J ” 2 8 , 30> 22, 11, 7. 25 700 600 - 500 - 464 400 - 327 300 T 22 6 1000 282 79 1 (a). Final allocation (b). E-curve ' I I Figure 3.10: The final allocation and the convergence curve of the cost function i i obtained from an annealing process. j user program to a hypercube m ulticom puter w ith 8 nodes as shown in Fig.3.2. The experim ental results are analyzed w ith respect to the sensitivity in selecting the initial allocation, the relative perform ance of using different swap functions l and th e cooling speed. C onvergence to the Suboptim ality Using SIMAL as a tool, the partitioned program m odules can be m apped j to a 8-node m ulticom puter in an allocation w ith the suboptim al cost function. Consider the initial allocation shown in Fig.3.2 w ith E0 — 464, the final allocation and th e fluctuating curve of the cost function E obtained from th e sim ulated annealing process is shown in Fig.3.10 (a) and (b) respectively. The term E- ; curve is used for the convergence curve of th e cost function. The swap function ■ used is th e local exchange. The cooling constant a is set to 0.9. T he initial | I tem p eratu re is set to be To = 1000, which is a bit higher than the value obtained ; from the form ula. The higher tem perature setting is used solely for illustrations. ! T he process term inates at a low tem perature T = 1 with which a final value of 61 the cost function E — 327 obtained, which is much lower than the initial cost function E = 464. The allocation shown in Fig.3.10(b) w ith this cost function E = 327, is the solution to the m apping problem. R andom Selection o f the Initial A llocation 15, 23, J L 700 500 400 300 700 1000 282 79 22 600 - 400 - 1000 282 79 22 (a)A good initial allocation (b) A bad initial allocation Figure 3.11: Convergence of the cost function betw een a good and a bad initial alio cation Fig.3.11 shows the effects on the cost function w ith respect to two initial allocations which are different from th e initial allocation as shown in Fig.3.2. Fig.3.11 (a) shows a good initial distribution w ith a low cost function, whereas Fig.3.11 (b) shows a bad initial distribution w ith a high cost function. T he E- 62 >00 282 79 22 6 1 1000 282 79 22 6 1 1000 282 79 2 2 6 1 (a). Local exchange (b). Global exchange (c). Local exchange ! I Figure 3.12: Relative perform ance of three swap functions used in the sim ulated program allocation experim ents. I I curves show th a t initial distribution makes very little difference in achieving the | nearly minimized cost function at the end of th e annealing process. This result 1 is very encouraging, because it shows th a t we can start w ith a random allocation ' and still end up w ith a fairly balanced load. ; i i R elative Perform ance o f U sing D ifferent Sw ap F un ctions | i t T he convergence curves of applying the three different swap functions described are shown in Fig.3.12 (a), (b) and (c). Again, the swap functions used m ake very little difference in obtaining the final allocation w ith alm ost the same cost. The difference m ay He in th e convergence rate and the efficiency in im plem enting th e | anneahng process on real m ulticom puters. This is due to the fact th a t different heuristics may require different am ounts of overhead. Effect of the C ooling Speed The coohng schedule does effects the quality of the solution and the efficienc}r : of im plem entation. I will discuss this in detail when evaluating the sim ulated i I i i 63 | (a). r i+i = 0.97; (b). Ti+1 = 0.8Ti (c). Ti+1 = o .7 rf (slow cooling) (Intermediate cooling) (fast cooling) Figure 3.13: V ariation of th e cost function for th ree cooling speeds in th e sim ulated annealing program allocation process. annealing m ethod. T he effect of decrem enting constant a is shown in F ig.3.13. Fig.3.13 (a), (b) and (c) correspond to 3 cooling speeds from slow, to interm ediate, and fast, setting a to 0.9, 0.8 and 0.7 respectively. The slow cooling produces m ore fluctuation in the £ - curve. However, it eventually produces the lowest final value | (E f = 327 as shown). T he faster cooling produces less fluctuation, b u t higher ! final values (such as E f = 354 and E f = 376 in Fig.3.13 (b) and (c) ) respectively. I 3.6.2 B enchm ark E xperim ents on P rod u ction System s To test th e static load balancing to m ap parallel production systems onto m ul ticom puters, benchm ark experim ents were perform ed. These benchm arks are pro duction systems w ritten in OPS5. Two production systems were chosen: Tourney and T oru-w altz. Tourney is a production system for scheduling bridge to u rn a m ents. Tour-waltz is a production system im plem enting the W altz’s algorithm s for edge labelling. These two production system s are good candidates for parallel processing. M ultiple rule firing can be applied to these production system s and 64 produces the correct results. The original Tourney and Toru-w altz in 0 P S 5 ver sion have 17 and 31 rules respectively. In order to exploit parallelism , the Single- C ontext-M ultiple-Rules m odel is applied to reduce these two benchm arks to 26 and 48 rules respectively, which were used in our benchm ark experim ents [72]. The parallelism m atrix P and com m unication m atrix C are derived based on their definitions at the compile tim e. The firing frequency vector W is obtained from the average run-tim e traces by running the benchm arks sequentially. I input th e parallelism m atrix P, the com m unication m atrix C, and the firing frequency vector W of the benchm arks to SIMAL. W ith hardw are specification and all other input param eters, SIMAL produces a nearly optim al partitioning and allocation of the rules onto a hypercube. T he m atching tim e tm was assum ed to be 1.8 ms. The com m unication tim e tcom between two neighboring nodes was assumed to be 1 ms. The num ber of nodes to be used is determ ined by the m axim um parallelism which can be exploited from the production system . This is the m axim um num ber of different rules which can be fired in an inference cycle, and can be obtained from th e trace of sequential execution. For T o u rn ey , the m axim um parallelism which can be exploited is 4, while for Toru-waltz it is 14. From an economical point of view, I partition and allocate Tourney to a 4-node hypercube and Tour-waltz to a 16-node hypercube. T he perform ance on different m achine sizes, however, has also been observed. The initial configuration is determ ined using a round-robin or equal-divider approach. In the round-robin m ethod, rule Pi is assigned to N j such j — (i modulo n). In the equal-divider m ethod, it is assum ed th a t m = q • n + r. For i < (q ■ n), each rule Pi is assigned to N q such th a t q is the quotient of ijn . For i > (q • n), rule Pi is assigned to node Nr such th a t r = (i m odulo n). The first two columns of Table 3.1 and Table 3.2, show th a t the cost function was not optim ized in either m ethod. The th ird column shows the random results 65 Table 3.1: Com parison of Three M ethods to Map th e Toru-waltz Production System on a 16-node H ypercube C om puter Allocation Round-Robin Equal-Divider Annealing N0 0,16,32 0,1,2 24,20,12 Ax 1,17, 33 3,4,5 35,43,16 n 2 2,18,34 6,7,8 34,6,10 n 3 3,19,35 9,10,11 47,18,5 n 4 4,20,36 12,13,14 7,31,15 As 5,21,37 15,16,17 41,3,30 N6 6,22,38 18,19,20 46,9 21 N 7 7,23,39 21,22,23 0,2,1 N 8 8,24,40 24,25,26 39,19,28 n 9 9,25,41 27,28,29 44,22,40 N io 10,26,42 30,31,32 26,29,14 N u 11,27,43 33,34,35 32,4,37 n 12 12,28,44 36,37,38 27,11,33 n 13 13,29,45 39,40,41 8,42,38 N\4 14,30,46 42,43,44 45,23,17 Nrs 15,31,47 45,46,47 25,13,36 Epar 943.5 1060.5 681.0 Eimb 136.1 206.7 96.5 Ecom 4823.0 4940 3822.0 E 5902.6 6207.2 4599.5 66 Table 3.2: Com parison of Three M ethods to Map th e Tourney Production System on a 4-node H ypercube Com puter. Allocation Round-Robin Equal-Divider Annealing No 0,4,8,12,16,20,25 0,1,2,3,4,5,25 17,8,3,20,7,1,6 N i 1,5,9,13,17,21 6,7,8,9,10,11 4,12,2,15,21,18,5 n 2 2,6,10,14,18,22 12,13,14,15,16,17 10,24,0,23,11,9 N z 3,7,11,15,19,23 18,19,20,21,23,24 22,25,13,16,19,14 Epar 7552.5 6714.0 5625.0 Eimb 156.6 363.6 153.0 E C om 4165 4031.0 2803.0 E 11874.0 11108.6 8581.0 after running th e sim ulated annealing m ethod. T he cost function has been re duced significantly in our m ethod, com pared to th e round-robin and equal-divider m ethods. Not only th e value of E com decreased significantly, but also th e values of Epar and This clearly shows the effectiveness of using the sim ulated an nealing m ethod. In Fig.3.14, variation of the cost function E {•) is obtained from the SIMAL o u tp u t list. N ote th a t as T — * ■ 0, the cost function approaches its m inim um value. Let # C a be the num ber of inference cycles in th e sequential execution of a production system , and # Cp be the num ber of inference cycles of the parallel execution of a production system w ith m ultiple firing. Let # F i be the num ber of rules fired for each inference cycle. In sequential execution, only one rule can be fired in an inference cycle, thus ^ F i is always equal to 1. However, more th an one rules can be fired in the parallel execution, and the num ber which really can be 67i 6500 -i 4500 T 1000 207 25 3 0 12000 * 11000 - 10000- eooo 14 7 (a) Toru-waltz (b) Tourney Figure 3.14: E-curve obtained from the sim ulated annealing process in two bench m ark experim ents. fired is determ ined by the mapping. T h e speedup Sp is defined as th e following: # a (3.28) In the benchm arks, some rules need to be fired m ore th an once w ithin a cycle. Therefore, even if each firable rule is allocated to a different node, m ore th an one firing is still needed for some parallel cycle. If th e m apping cannot guarantee th a t firable rules are to be allocated to different nodes, then m ore firings are needed for some inference cycle. In Fig.3 .1 5 ,1 show th e speedup obtained from th e different I m ethods when compared to th e ID (ID eal) case. It shows th a t sim ulated anneal- j ing m ethod can produce speedup close to the ideal case. These experim ents also j dem onstrate th a t the proposed new cost function effectively reflects the desirable perform ance improvements. 681 4 SA RR ED 3 2 1 0 16 32 2 4 8 n 3 2 4 8 2 16 (a) Toru-w altz (b) Tourney I Figure 3.15: Speedup comparison of different m apping m ethods. j i 3.6.3 P erform ance A nalysis of the Sim ulated A n nealin g M eth od The perform ance of the simulated annealing m ethod is evaluated by the quality of th e solution and th e efficien cy of the process. T he quality of the solution | m easures how the final value of-the cost function obtained is close to th e real J global m inim um . T he efficiency of the process m easures th e actual CPU execution j 1 tim e used. In this section, I use the results obtained from m apping production j rules in a Toru-waltz benchm ark production system to evaluate the perform ance 1 of th e annealing m ethod. The effects of various param eters in th e cooling schedule \ ! are th en analyzed. i i i A n alysis on Q uality and Efficiency In order to verify the quality of the sim ulated annealing process, the cost functions obtained from different m ethods are com pared w ith the optim al value obtained from the quasi-liner-programming. Figure 3.16 shows the resulting cost functions by m apping rules in Toru-waltz benchm ark to different num ber of nodes using different m ethods; where ID stands for “ID eal” to denote the linear pro gram m ing, R R for “Round-Robin” , E D for “Equal-D ivider” , and SA for “Simu- 69 16000 20000 14000 12000 15000 10000 8000 10000 6000 4000 5000 (a) Toru-w altz (b) Tourney Figure 3.16: Com parison of different m apping m ethods on cost function vs. m a chine size. i t lated A nnealing” . Clearly, the cost values of SA are very close to th e optim al of ! ID , which m eans th a t they are nearly optim al. 1 As shown in Section 3.5, the overhead incurred w ith our sim ulated annealing j m ethod is estim ated to be O (m ln m ). Consider m apping 48 rules in Toru-w altz | l to a 16-node hypercube m ulticom puter. T he optim al value of th e cost function of this m apping is E , = 4301 ms obtained from running a linear program m ing program on SUN-3 Unix system. T he CPU tim e to run this linear program m ing is ab o u t 10 m inutes; while the annealing process requires only a few seconds to com plete. In Table 3.3, I show the cost function and C PU tim e obtained from running ' j different sw apping heuristics on two initial configurations I have experim ented I upon, where the value of E for each column is averaged from 20 runs w ith different cooling policy. The difference, E — jE,, leads to the percentage A = g ~£'* -. The . ^ ite ra tio n s is th e total num ber iterations in th e w hile-loop (i-loop). The CPU tim e corresponds to running SIMAL on th e Sun-3 Unix system . T he m axim um 70 i i Table 3.3: Average Perform ance of SIMAL on the T oru -w altz System M ethod R ound-R obin L ocal-E xchange R ound-R obin G lobal-E xchange Equal-D ivider L ocal-E xchange E qual-D ivider G lobal-E xchange E 4485 4375 4663 4599 E - E . 184 74 302 298 A in % 4.3 1.7 7 6.9 ^ iteratio n s 82 89 81 85 CPU (s) 1.25 1.4 1.21 1.37 of (Ef — jE„) is 302, and the m inim um is 74. T he average deviation of final cost function varies from 1.7% to 6.9% from th e optim al value shown. As far as the C PU tim e is concerned, by sacrificing about 7% in quality, SIMAL can save 99% execution tim e, when com paring tim e to obtain an optim al solution. Notice th a t Toru-w altz only has 48 rules. W hen the size increases to thousands of rules in a production system , the saving in overhead will be even m ore impressive. I have run th e annealing m ethod using two swapping heuristics. By definition, the local exchange has sm aller derived set than th a t of the global exchange m ethod. The results shown in Table 3.3 suggest th a t the local exchange m ethod requires less execution tim e b u t produces higher value of the cost function. The larger the size of th e derived set is, the m ore chances a lower cost function will be generated w ith the sim ulated annealing m ethod. The choice of the swapping heuristics should be based on th e actual im plem entation cost. Effects o f C ooling P aram eters J In im plem entation, the quality and efficiency of the sim ulated annealing methoc is controlled by th e cooling schedule used [82]. T he initial tem perature setting, 71 Table 3.4: Cost Function and CPU Time in Using Three Initial T em perature Settings T0 Setting m a x ( A jE ? ) * M X n -1 ) E /q In 2 2E0 To 1507 19608 11805 E f 4375 4554 4548 # Iteration 89 1 1 0 125 Tf 0.019 0.18 0 . 0 2 2 CPU(s) 1.4 2.9 3.2 th e term ination criterion and the decrem enting rule all affect th e convergence to th e suboptim al solution. Let E* be th e upper bound of E for all possible system configurations, and E * be the absolute m inim um of E . Geman and Gem an [46] proved th a t tem p eratu re setting Ti = (i > 1 ) will guarantee the global m inim um of th e annealing process. I have used f° approxim ate the entire span E* — E 0. T he overhead to determ ine a good initial tem perature T0 is caused by an extra iteration over the i-loop. But the gain lies in significantly reducing the num bers of required iterations over the i-loop. Table 3.4 shows the num ber of iterations and th e actual execution times for three T0 settings when a = 0.9 and k — 4. The initial assignm ent is round-robin, and the swapping heuristic is global exchange. T he first colum n suggests the best perform ance among the three cases. T he term ination criterion th a t I have used is determ ined by a constant param eter k, which determ ines the num ber of previous M arkov chains checked before the term ination, and th e term ination fraction e = 10~f c to be used. Since e is a very small, e • m a x(A E )* is also very small to approxim ate the difference between E f and E „. Larger values of k result in better solution as shown in Table 3.5. In 72 Table 3.5: Cost Function and CPU Tim e as a Function of the Term ination C rite rion (The choice of the param eter k) k 1 2 3 4 5 6 7 8 € IO" 1 io - 2 1 0 ~ 3 o 1 1 0 ~ 5 1 0 ~ 6 IO" 7 1 0 " 8 E f 4656 4381 4381 4375 4375 4375 4369 4369 i 59 82 83 89 90 91 107 107 T 0.53 0.26 0 . 2 0 0 . 1 1 0 . 1 1 0 . 1 0 0.019 0.019 CPU 0.9 1 . 2 1 . 2 1.4 1.4 1.5 2.9 2.9 practice, again, the k should be chosen based on th e tradeoff between th e solution quality and CPU tim e used. In Table 3.5, the results are obtained from running th e Toru-waltz production system using the round-robin initiation assignm ent, global swapping function, and a choice of cooling constant a = 0.9. The initial tem p eratu re T0 was 1507. As k increases, th e cost function decreases and the CPU tim e increases. After k > 4, there is no big im provem ents in cost function. W hen k = 4, the CPU tim e is 25% Table 3.6: Cost Function and CPU tim e as a Function of th e Cooling C onstant a a 0.4 0.5 0.6 0.7 0.8 0.9 0.95 E f 4687 4508 4432 4531 4523 4375 4470 i 37 40 62 48 55 89 149 Tf 10~5 10~5 1 0 '5 10~4 n r 3 0.12 0.72 CPU 1.01 1.12 1.33 1.21 1.21 1.42 3.98 73 less th an th e one obtained when k = 5, but only has 1% higher value in the cost function. Thus choosing k = 4 is sufficiently good and economical. For simplicity, I chose a simple decrem ent rule where T'+i = a ■ Ti, where 0 < a < 1. In Table 3.6, the effect of a on perform ance is shown. T he initial configuration chosen was generated from the round-robin m ethod and the global swapping function was used w ith To = 1507 (where # Iterations are abbreviated as “i” ). T he term ination criterion is set by choosing k = 4. Larger ct will result in longer CPU tim e. A tradeoff between the CPU tim e and th e solution quality exists. 74 Chapter 4 Heuristic Process Migration for Dynamic Load Balancing A good dynam ic load balancing m ethod should reduce the overhead in col lecting load indices and in process m igration [30], [33]. T he com m unication costs incurred in these operations depend on the message routing scheme used. In this chapter, I present four easy-to-im plem ent heuristic m ethods for dynam ic process m igration to achieve balanced load among m ulticom puter nodes. These heuristics avoid frequent load indices exchanges among nodes w ith the coordination of a host processor, such as th e cube m anager used in the iPSC system. The term threshold has been used to decide when load balancing operations should be invoked [38]. M ost load balancing m ethods use some fixed threshold values [21], [77], [91]. If th e threshold is too low, th e thrashing caused by excessive load balancing activities m ay degrade the perform ance of the entire system . If the threshold is too high, effective load balancing cannot be achieved at all. Recently, Pulidas et al. [94] proposed a gradient m ethod to optim ize the choice of threshold value. Their m odel requires frequent exchange of load indices among all nodes w ithout a centralized supervisor. I propose a new scheme where each node updates the threshold on a periodic basis under the supervision of a host processor [129]. This idea is inspired by the supervised distributed message routing scheme im plem ented in A parnet [121]. However, the A parnet scheme was m eant to m inim ize the switching p ath delays betw een source and destination. My scheme is designed to balance load among 75 m ulticom puter nodes; which has a different set of optim ization objectives. Efe and Groselj [33] have .also proposed a supervised load sharing model. They use fixed threshold and a controller node, which will execute the extra load transferred from the rem aining nodes or transfer them back to the idle nodes for balanced execution. My scheme differs from their approach in using adaptive thresholds at all nodes and in restricting the host processor to perform only the collection and broadcasting of load indices from all com puter nodes. T he host in my scheme does not participate in the decision of process m igration or actual execution of user jobs, which are entirely distributed to local nodes. My scheme does not require load inform ation exchange among nodes as done in [94]. The distributed decision in m y m ethod is based on sender initiation, b u t using an adaptive threshold which requires no handshaking among nodes as suggested in [3], [70]. My approach improves from the above m ethods and greatly reduces th e im plem entation and control overheads, which leads to high perform ance at lower cost. The perform ance of heuristic load balancing m ethods is evaluated using a parallel discrete event-driven sim ulator P S I M im plem ented on a 32-node Intel iP S C /2 hypercube system . T he sim ulator consists of a host program and dis trib u ted event-driven sim ulators on the hypercube nodes. T he sim ulation results do verify the effectiveness of the load balancing scheme. I evaluated th e m ulticom p u ter perform ance under different utilization levels, load im balance conditions, m ulticom puter sizes, and the overhead caused by com m unication and process m i gration. T he experim ents have shown th a t these heuristic m ethods are insensitive to utilization level and are scalable in perform ance to m ulticom puter size. 4.1 An Adaptive Distributed Load Balancing Model A m ulticom puter is represented by n processor nodes N i, 0 < i < n , inter connected by a network characterized by a distance m atrix D — {dt3}, where d^ 76 shows the num ber of hops between node Ni and Nj. It is assumed th a t da = 0 and d{j — dj± for all i and j . The im m ediate neighborhood of node Ni is defined by the subset Gi = { N j \ dij = 1}. n — 1 n — 1 Host Processor n— X Interconnection N etwork (M essage Passing) Figure 4.1: A n adaptive load balancing m odel for a m ulticom puter w ith n pro cessor nodes and a host processor. An adaptive m odel of distributed load balancing is shown in Fig.4.1. T he host processor is connected to all nodes. The load index li of Ni is passed from each node Ni to th e host and the system load distribution L t = {li | 0 < i < n} at tim e t, is broadcast to all nodes on a periodic basis. All nodes m aintain their own load balancing operations independently and report their load indices to th e host on a regular basis. T he host periodically updates and broadcasts th e system status. T he load balancing in each node is form alized by a queueing model, w ith a Poisson arrival rate and an exponential service rate pi. There are two m igration ports at each node N p th e input port li and th e output port Oj. These ports are connected to an interconnection network for process m igration. Arriving processes at each node can be either executed locally or m igrated to some rem ote nodes for execution. — rrl u The load distribution L t is described by a m ean value lt — '=g and a y — 1 n. _ J) variance <r(Lt) = "-i'- S L n -— ■ Let L ^ and Lti+j ( i > 0 ) be th e load distributions! 77 at two adjacent update tim es ti and £ i + 1 respectively. The tim e window is defined as W ti — £{+ 1 — ti. Let r be the load variation factor defined by: r = \ *(Lti+1) ~ <r(Lt{) \ ( 4 J ) max(cr(Lti),cr(Lti+1)) T he param eter r indicates the increm ental change in two successive system load distributions. Assume the initial tim e windows W to = W tl- Consider two adjacent u p d ate tim es ti and U+i and choose 0 < k\ < k2 < 1 ? the tim e window W ti+l is com puted from the earlier window W ti recursively as follows: ( 1 — r) • W ti if kx < r < k2 ( 1 - k 2) - W u if r > k 2 \ a t. (1 Ajj) • Wt. if r < ki W f.t = < c i + l . w u if wti < k2 ■ wt 0 W hen the system enters a ready state, the system load distribution becomes virtually unchanged. This implies th a t the tim e window becomes m uch longer in a steady state. W hen th e system load changes rapidly, the difference betw een th e load variances becomes significant. T hus th e tim e window will become shorter accordingly. This implies th a t th e system state will be updated m ore frequently. T he param eters kx and k2 are introduced to avoid rapid changes in Wt, especially during the initiation period. In Fig.4.2, I show an example w ith initial condition ki = 0.01, k2 = 0.1 and Wto = 5000 ms. The successive tim e windows Wtt + 1 are calculated from the earlier window W ti using Eq.4.1 recursively. At each node N i, a sender-initiated load balancing m ethod is used, where heavily loaded nodes initiate m igrating processes. T he sender-initiated m ethod has the advantage of faster process m igration, as soon as the load index of a processor node exceeds a certain threshold. I use an adaptive threshold, which is updated periodically according to th e variation of system load distribution. The distributed load balancing for each node Ni is represented by the queueing model shown in Fig.4.3. Processes arrive to each node w ith the Poisson m ean arrival 78 cr(Li) — 4 <r( Z/2 ) = 3.8 cr(Lz) — 3.8..... c r(i4) = 4.8 I 1 1 1 _i_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ I _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _i _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ i '.'.'-J - W o ------------- 5000 ms - W j — 5000 ms - w2 — 4750 ms - W 3 • 5225 ms Figure 4.2: An exam ple of the variable tim e window function W t with ki — 0.01 and k2 = 0 .1 . Node Ni Decision Maker Server b Figure 4.3: A queueing m odel for the dynam ic load balancing at each distributed com puter node. 79 rate of A; per second. Each server at node Ni has the exponential m ean service rate //.{. In other words, process service tim es are exponentially distributed w ith th e m ean value •S 'i = ~ second per process. A m ulticom puter is m odeled as a homogeneous system , where servers at all com puter nodes Ni are identical. Thus Si — S and pi — p for all i. An arriving processes will be put into one of the two queues: the ready queue Ri or the migration queue M i. Let p be a process. Let F(p) = e(p) + m(p), where e(p) as the execution tim e of p and m(p) is the mem ory dem and. T he load index li of node Ni is determ ined by sum m ing up the F(p) of all processes in th e ready queue Ri of N{. h = E f (p) (4-3) PG Ri The threshold 6i to be used at Ni is described in the next section. W hen a process is ready to run a t N i, its load index Z ; is com pared w ith th e threshold If Ii < th e process is p u t into the ready queue and U is increm ented by F(p)-, otherwise it is put into the m igration queue. After entering th e ready queue, the process will be assigned to a processor for execution and never to be m igrated again. This guarantees the single migration policy. Processes in the m igration queue will eventually be m igrated to rem ote nodes. The input port R accepts processes m igrated from other nodes and puts them into the ready queue. The o u tp u t port Oi m igrates processes to other nodes. Each processor follows the First-C om e First-Serve (FC FS) policy to schedule the execution of ready-to-run processes. Each node in the load balancing model is connected by a process m igration interconnection network. This network is defined by the heuristic load balancing m ethods which will be described in the next section. Each in p u t port R and o u t put port Oi can be connected to the neighboring nodes or to th e rest of th e nodes in the system . T he neighborhood is determ ined by the hardw are topology. Com bining Fig.4.1 and Fig.4.3, the proposed load balancing model is actually an open 80 No Nt. Service Q J* -q a r li H i ll I I h D Migration Q Service Q S - M r _ l l i U Migration Q N 3 S ervice Q Jj 3 c m i l — io jp Migration Q Service Q S T T T T T JLLU jj Migration Q Figure 4.4: An example of the open network load balancing m odel (D: decision m aker, S: server). queueing network which can be shown by an example in Fig.4.4. In Fig.4.4, the m ulticom puter is a 4-node hypercube, and the process m igration interconnection work connects the input port and output port of each node to those of its neigh borhood. Note the role of the host processor is only collecting and broadcasting load inform ation, there is no load transferring between the distributed nodes and the host. T hus, the adaptive model is totally decentralized. 4.2 H e u r istic P r o c e ss M ig ra tio n M e th o d s Four heuristic m ethods for m igrating processes are form ally introduced below. These heuristics are used to invoke the m igrating process, to update the 81 threshold, and to choose the destination nodes for process m igration. These m eth ods are based on using the load distribution L t, which varies from tim e to tim e. Two attrib u tes are identified below to distinguish the four process m igration m eth ods. By considering 4 com binations of these 2 attrib u tes, I propose four different heuristic m ethods for process m igration. 1. Decision range: T he load redistribution process can be restricted to those nodes in the neighborhood set G, adjacent to each N t, or involve all the m ulticom puter nodes in the system . T he threshold is u p dated by either using the local average load am ong neighboring nodes or using the global average load among all nodes in the system. 2. H euristics used in process migration: Either a round-robin (RR) m ethod or a m inim um -load (ML) m ethod will be used in selecting the destination node for m igration. The R R m ethod uses a circular list w ith a pointer to indicate th e front end. T he ML m ethod chooses a node w ith m inim um load as th e destination node. A. Localized Round-Robin (LR R ) Method Each node Ni uses th e average load among im m ediate neighboring nodes to u p d ate th e threshold and only m igrates processes to th e im m ediate neighboring nodes from set G i. The round-robin discipline described above is used to select a candidate node for process m igration. T hree aspects of th e load balancing design are stated below: 1 . A fter receiving system load distribution L t from th e host at tim e i, the node Ni resets its threshold 6 * to the average load am ong im m ediate neighboring nodes. T h at is, Si = [(1 + a ) ' 1 ? where 0 < a < 0.2 is a norm alized constant chosen. I 82 ----------------------------------------------------------------------------------------------- — — ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- I Table 4.1: Exam ple Load D istributions Before and A fter the A pplication of the Local Round-Robin M ethod Time Ni No Ax a 2 a 3 a 4 As Ne Nr B k 2 10 8 1 6 3 5 15 E Si 8 5 5 10 5 10 10 7 F a 4 a 3 No a 2 No a 4 a 4 a 3 0 Ci n 2 No a 3 Ax As Ax a 2 As R E A 4 A 5 a 6 a 7 No N7 Nr No A k 5 10 8 4 6 4 6 15 F a 4 No a 3 a 2 As a 4 n 4 As T Ci n 2 A 5 a 6 Ax a 6 Ax n 2 No E R Ax a 3 No N7 A 0 Nr Nr a 3 2. T he input m igration port A is connected to the o u tp u t port 0 / s and the o u tp u t m igration p o rt 0{ is connected to th e in p u t p o rt / j ’s, such th a t Nj € Gi. 3. An ordered candidate list Ci is used to select the destination node. As a set, Ci = G{ and each entry in Gi is a d a ta structure representing a neighboring node of Ni. T he entries in Ci are ordered by th e increasing load indices involved. W ithin each tim e window W t, Ci is u p d ated in a round-robin fashion. In Table 1 , I show an exam ple of using the LR R m ethod in an 8 -node hyper cube. The norm alized constant a is set to be 0.1 in updating each local threshold. T he list Ci is increasingly ordered, and the m igration queue at each node is ini tially empty. After a process arrives, and the local load index is u pdated. B ut the threshold will not be updated until the next L t+ X is broadcast from the host. Assume th a t the F(p) for each arriving process is one unit (for illustration p u r pose). At node N x, since lx > Sx, it puts the ready process in th e m igration queue and keeps th e load index lx unchanged. T he process at the front of th e m igration queue is m igrated to node N 3 . After th a t, is p u t back to the end of C x. At node IV3, since Z 3 < 6 3 , it puts the ready process into th e ready queue. Since N 3 also receives 2 processes m igrated from nodes N x and JV 7, its load index Z 3 is totally increm ented by 3. The candidate list (73 rem ains unchanged because no process needs to be m igrated at this point. B. Global Round-Robin (G R R ) Method Each node Ni uses a globally determ ined threshold and m igrates processes to any appropriate node in th e system. The selection from candidate list for process m igration is based on the round-robin discipline. A fter receiving th e load distribution L t from the host, th e global threshold St is set to th e system average load am ong all the nodes. T h at is 8t — [(1 -f a ) • — for th e tim e window Wt. The input and outp u t m igration ports Ix and Oi are connected to all O j ’s and I f s respectively, for j f i. T he candidate list Ci operates th e same as in the LRR, except Ci = { N j | j f i}- C. Localized M inim um Load (LM L) Method T he way to determ ine th e threshold and to set up m igration ports is the same as th a t in LRR. The difference between LML and LRR is in the policy to select a destination node. At node Ni, there is a load table to store the load index of each node in Gi. LML uses th e node w ith the m inim um load index in the load table as a destination node. After a process is m igrated to the selected node, its load index in the load table is increm ented accordingly. 84 D. Global M inim um Load (GML) Method In this case, the way to setup threshold and m igration ports is the sam e as th a t in G RR. But the destination node is determ ined as the same way as th a t in LML. T h a t is the node w ith a m inim um load index in the global load table will be selected as the destination node. 4.3 P S IM : A P a ra llel D isc r e te E v e n t-D r iv e n S im u la to r To investigate the effectiveness of the proposed heuristic load balancing m eth ods, I have developed a parallel event-driven sim ulator P S I M w ith load balancers distributed over a 32-node Intel iP S C /2 hypercube system . PSIM is w ritten in C w ith iP S C /2 message passing library calls. It consists of a host program and a node program . The host program is im plem ented at the frontend host m achine w ith the U nix OS. The node program is im plem ented at the m ultiple hypercube com puter nodes w ith the N X /2 operating system. PSIM im plem ents an open queueing network w ith load balancing m echanism em bedded. T he sim ulation param eters are in p u t to th e host and broadcast to all nodes. Each node program concurrently perform s discrete event-driven sim ulation [39], balancing the load through message passing. At th e end of sim ulation, each node reports its statistic results to the host program . T he host program sum m a rizes the perform ance evaluated from the parallel sim ulation. The construction of PSIM is shown in Fig.4.5. T he host program plays two m ajor roles: th e host processor in th e load bal ancing m ethod, and the controller of th e parallel sim ulation. It has four parts: the user interface, the controller, the statistics reporter and the com m unication subsystem. The user interface accepts the sim ulation param eters and o u tp u ts the final results. These results will be defined in the next section. It can also dis play statistic messages from distributed nodes during the parallel execution. The 85 Host Program Node Program Contro ller W0, k. Comm. Comm. Output Sybsys. Subsys. display Input Port controller Statistics Recorder Event — e* Threshold Updater | p Driven Cntrl. ; — e* Decision Maker Output Port Controller To remote nodes Figure 4.5: Software com ponents of PSIM on an iP S C /2 hypercube system . controller passes the sim ulation param eters to the node program and collects the system states based on the tim e window W . The statistics reporter calculates th e sim ulation results using th e final messages received from each node. The com m unication subsystem receives and broadcasts all messages fro m /to distributed nodes. The distributed load balancers are identical am ong all nodes. It is controlled by a discrete event-driven controller, which im plem ents the queueing model shown in Fig.4.3. T he service queue and migration queue are m anipulated by th e event- driven controller. The distributed load balancing m ethod is im plem ented by four m odules: th e threshold updater, the decision maker, the input controller and the output controller. T he threshold updater simply updates th e threshold upon re ceiving m essage L t from th e host controller. The decision m aker handles the candidate fist and load table operations to decide th e destination of process m i 86 gration. T he statistics recorder is used to keep track of th e statistics during the j sim ulation and sends it to the host at the end of sim ulation. T he com m unication 1 subsystem is im plem ented by using iP S C /2 message passing system calls. Message j sending and receiving is handled in an asynchronous m anner. The node event- ■ driven sim ulation is never blocked by the message passing. Actions are taken i when messages are received. PSIM allows the user to input desirable perform ance m easurem ent param e ters. A fter receiving the param eters, each node perform s its own sim ulation using : 1 th e load balancing m ethod. The entire system actually contains n individual | i node sim ulators with attached load balancers executed in parallel, w here n is the ' i m ulticom puter size. ! The sim ulation is controlled by a total sim ulation tim e, which sets the tim e . i lim it to generate arrivals at each node. Each node will send a finishing message I s to the host when its own event-list becomes empty. The host program will send j | a shutdow n message to each node after it receives finishing messages from all nodes. The load balancer at each node is active until the shutdow n message is i received. A fter receiving the shutdow n message, each node reports to the host i its local statistic results, which include the local m ean response tim e, num ber of i I arrivals, average load, average w aiting processes, etc. The perform ance m easures j and overhead estim ates will be discussed in the next section. The final results are : produced by the statistics reporter at the host, based on the parallel sim ulation j i results from distributed nodes. 4 .4 P erfo rm a n ce o f H eu ristic M e th o d s In this section, I use the open network queueing model as shown in Fig.4.4 to evaluate th e perform ance of the heuristic process m igration m ethods for dynam ic I load balancing. > 4 .4 .1 P e rfo rm a n c e M e a su re s a n d O v e rh e a d E s tim a te s Using the parallel sim ulator developed at the iP S C /2, I have done experim ents to reveal the perform ance of the propose heuristic m ethods. T he effectiveness of th e heuristic m ethods is basically m easured by the mean value R and th e standard deviation cr(Ri) of the system mean response tim e R{s, where Ri is th e mean response tim e obtained from local sim ulation at A^. R is defined as: n— 1 y where A is th e total arrival rate, is the mean node arrival rate at N{. And cr(Ri) is defined as: w = Jszmu-Ttr V n The R and cr(Ri) are m easured under the following assum ptions: a. M ean utilization U , mean service tim e S and imbalance factor c: T he m ean utilization U is the average value of each node’s utilization U{. In order to verify th e effectiveness of load balancing, the local utilization Ui is m ade as a uniform distribution between U + c and U — c, where the imbalance factor c ( 0 < c < l ) i s a fraction such th a t 0 < U — c and U + c < 1. T he im balance factor c determ ines the inherent degree of variation in system utilization, c — 0 implies th a t the system is inherently balanced, such th a t each node has th e same utilization Ui = U . Since a m ulticom puter is a homogeneous system , I set the local service tim e Si equal to the system m ean service tim e S for all i. T he local arrival rate A * can be determ ined by A; = T he to tal arrival rate A = A ^. b . Initial updating window W 0 and adjustable constant k's: T he tim e window between two consecutive system state collection is deter m ined by W 0 and the adjustable constants. k t and set the boundaries of tim e window adjustm ent. T he purpose is to reduce unnecessary inform ation collection when th e system is in a relatively stable state, and to collect th e system state 88 m ore frequently when it fluctuates. In m y experim ents, I choose 1 < W0 < 20 second, 0 . 0 1 < < 0 .0 2 , and 0 . 1 < k2 < 0 .2 . c. M ulticom puter size: The m ulticom puter size n is the num ber of com puter nodes in the system . In my experim ents, n is chosen from 2 to 32, based on the iP S C /2 facility available. This param eter is used to test the scalability of th e proposed m ethods. The overhead caused by state inform ation collection and process m igration is m easured through the execution of the parallel sim ulation. The iP S C /2 system call m clockQ is inserted in th e program to record th e tim e delays caused by com m unication and process m igration. d. Com m unication overhead Oc: The com m unication overhead Oc is defined as Oc = | , where tc is th e cost of updating system load inform ation. tc counts th e tim e spent to send local load U from node N{ to the host and the tim e to receive m essage L t from th e supervisor. Let ts and tT be the sending and receiving tim e respectivelj7 , tc = t s + tr . In the iP S C /2 environm ent, the message passing tim e is independent from the message length for short messages, b u t is determ ined by the message passing distance. In the parallel sim ulation, t s and tr are m easured from th e execution of message passing. The average tim e to pass a message from a node to it neighboring node through 1 physical link is m easured as 1 ms. T he value of tc is m easured from 2 to 12 ms. W hen n — 16, the average tc = 8 ms. The com m unication overhead Oc — ^ 6- can be varied by changing the value of service tim e S. e. M igration overhead Om The m igration overhead Om is defined as Om — where tm is the tim e to m igrate a process from a node to the destination node. tm includes the tim e of queue m anipulation, candidate list or load table operation, and sending or 89 Table 4.2: M easured Com m unication O verhead and M igration Delay M ethod tc (ms) (ms) LRR 8 25 GRR 8 30 LML 8 28 GML 8 35 receiving a process by message passing. The value of t m depends on th e load balancing m ethod used. Based on the actual m easurem ents, tm ranges from 25 to 35 ms. The m easured com m unication delay and m igration delay from th e experim ents in a 32-node iP S C /2 hypercube m ulticom puter are shown in Table 4.2. 4.4.2 E valuation of th e H euristic M eth od s T he perform ance of the heuristic m ethods is evaluated using various consid erations. The default values of param eters are set as: U = 0.6, c = 0.35, 5 = 1 s/process, W 0 = 5 s, & ! = 0.01, k2 = 0.1 and n — 16. T he m ean system response tim es R obtained from applying four load balancing m ethods are com pared to the reference condition of no load balancing (NLB). A . Effect of th e U tilization Level In Fig.4.6 , the m ean value R and the standard deviation of the response tim e Ri is obtained from applying load balancing m ethods under variable u ti lization level. T he system m ean utilization level U is set from 0.3 to 0.7, and th e im balance factor c is set to 0.2. This means th a t U{ varies from 0.1 to 0.9J W ith c = 0.2, the system load is not inherently balanced, b u t has only m oderate variation in utilization. From Fig.4.6, it is clear th a t th e heuristic m ethods can im prove the system perform ance in R. As the system m ean utilization increases, the im provem ents in R rem ains alm ost the same. This implies th a t the m ethods are not sensitive to the utilization level. T he standard deviation cr(Ri) improves m ore as th e system utilization level increases. T he m ean response tim e Ri at each node Ni becomes m ore close to each other under higher utilization level. This implies th a t the proposed load balancing m ethods is stable when system is heavily loaded. This scalability usually cannot be provided by the previously proposed load balancing m ethods. B . Effect of Load Im balance T he im balance factor c determ ines the inherent degree of system load im bal ance. It is a param eter to test the effectiveness of the load balancing m ethods under a different system load distribution. It is desirable th a t the load balancing will not degrade the system perform ance when system load is inherently balanced, b u t upgrade the perform ance when system load is unbalanced. Figure 4.7 shows th a t the proposed m ethods can achieve th e expected results. In Fig.4.7, The de duction in m ean response R increases as th e im balance factor c increases. This proves th a t the m ore unbalanced the system is, the more effective are th e load bal ancing m ethods. The values of (r(Ri) obtained show th a t th e each node will have very close m ean response tim e Ri after applying the load balancing m ethod to a workload unbalanced system . B ut, if the system workload is balanced ( c < 0.2), then load balancing overhead will produce variations on the m ean response tim e R i. However, this variation does not degrade the overall system perform ance which is m easured by R. C. Effect of C om m unication and M igration O verhead In my experim ents, the com m unication delay tc and the m igration delay fm 91 M ean Response Tim e S tandard D eviation of Response Tim e R 6 .0 -- * : LRR o : G R R * : LML o : GML * : NLB 5.0-- 4 .0 - 3 .0 - 2. 0 -- 1.0 - 0.0 0.3 0.4 0.5 0.6 0.7 System U tilization cr(Ri) 3 .0 " * : LRR o : G R R . : LML o : GML * : NLB 2.5-- 2.0 1.5-- 1. 0 " 0 .5 - 0.0 0.3 0.4 0.5 0.6 0.7 System U tilization Figure 4.6: M ean and stan d ard deviation of the response tim e vs. m ean utilization level. R . k 5 .0 - 4.0 M ean Response Tim e 3 _0 2.0 1.0 __l-------- 1 -------- ] ---------1 -------- 1 ---------1 -------- h - 0.0 0.1 0.2 0.3 Load Im balance Factor 5 .0 - S tandard 4 q Deviation of Response 3 q Tim e * : LRR o : G RR * : LML o : GML * : NLB 0.4 * : LRR o : G RR * : LML o : GML * : NLB 0.0 0.1 0.2 0.3 Load Im balance Factor Figure 4.7: M ean and standard deviation of the response tim e vs. load imbalance factor. 93 are m easured from the actual execution of the parallel sim ulation. The average values are shown in T&ble 4.2. Since t c and tm are fixed by th e real execution tim e in sim ulation, changing S actually changes Oc = ^ and Om = In Fig.4.8, S ranges from 1 to 0.1 second, w ith fixed U = 0.6 and c = 0.35. This implies th a t Oc ranges from 0.008 to 0.08, and Om ranges from 0.025-0.03 to 0.25-0.35. In Fig.4.8, the im provem ents on m ean response tim e R deteriorate in a small portion as S decreases, but still keeps distance w ith the one of no load balancing. This implies th a t the perform ance is not sensitive to com m unication and m igra tion overheads. As S = 0.1, Oc = 0.08 and Om = 0.25-0.35, there is still 10% im provem ents on R. The overhead of the load balancing is shown by the increase in a(Ri) when the service tim e S becomes small. W hen th e inform ation collection and process m igration cost become a large proportion of the service tim e, the variation on the m ean response tim e Ri occurs ( s < 0.4 ). D . Effect o f th e M u lticom p u ter Size The proposed load balancing m ethods have scalability to th e m ulticom puter size n. This is shown in Fig.4.9. W ith th e default values of th e perform ance param eters, th e im provem ents on R and cr(Ri) stays even when n increases. The scalability comes from the decentralized control and th e adaptive n atu re of the load balancing m ethods. A lthough the experim ents are m ade up to only 32 nodes, it is predictable th a t th e perform ance im provem ents will not degrade as n exceeds 32. The four heuristic load balancing m ethods all perform well as com pared with no load balancing. T he relative m erits of these four m ethods are discussed below. The LR R and LML m ethods are based on the locality and have a short m igration distance am ong im m ediate neighbors. The G R R and GML m ethods are based on global states and m ay experience much longer m igration distance. The G R R and GML have b etter perform ance when the m ean service tim e 5 = 1 second, but thej M ean Response Tim e S tandard Deviation of Response Tim e Figure 4.8 tim e. R 4.0 ★ : LRR o : G R R • : LML o : GML * : NLB 3 .5 " 3 . 0 - 2.5-- 0 '' t > i g 0.0 2.0 — f — 0.4 0.2 0.8 0.6 M ean Service T im e (sec/proc) <r(Ri) 3.0 * : LRR o : G R R * : LML o : GML * : NLB 2.5-- 2 . 0 - 1.5 -- 1.0 - 0 . 5 - 0.0 0.0 0.2 0.6 0.4 0.8 Mean Service Tim e (sec/proc) : M ean and standard deviation of th e response tim e vs. m ean service 95 R Mean Response Time 10. 0 -* 8 . 0 - 6 .0 - 4.0-- 2 . 0 ” 0.0 M ulticom puter Size ★ : LRR o : G R R • ; LML o : GML * : NLB n cr(Ri) 10 .0 " S tandard 8 0 " Deviation of Response qq-- Tim e 4 .0 - 2 . 0 " 0.0 M ulticom puter Size ★ I LRR o : G R R • : LML o : GML ★ * NLB Figure 4.9: M ean and standard deviation of the response tim e vs. m ulticom puter size. 96 LRR and LML m ethods are better, when S is small. Since 0 m = tm does not affect the perform ance too much when S is large; it does, however when S is small. Thus, the system which has long m ean service tim e S can use the global m ethods; while the system which has short jobs should choose the local m ethods. My experim ents show th a t the overhead tc and tm are dependent of the phys ical links between nodes. If the interconnection netw ork of a m ulticom puter is m ultiple-bus or m ultistage network, then different perform ance results will be ob tained. The relative m erits of the proposed m ethods can be expected as th a t th e m ethods based on locality (LR R and LML) are suitable to a system where th e t m is interconnection network dependent or Om is sensitive to tm; while the global m ethods (G R R and GML) are b etter choices when tm is independent to interconnection network or Om is insensitive to £m. The LRR and G R R m ethods use th e round-robin discipline on a candidate list to decide the load transfer destination, while th e LML and GML m ethods determ ine the destination of process m igration by th e m inim um load among the local or global range of nodes. The m inim um load strategy has a slightly b etter perform ance th an the round-robin discipline when th e m ulticom puter size n is small, b u t the round-robin discipline on candidate list m anipulation appears b etter when n is large. Basically, th e load table operation can provide m ore accurate inform ation on the current states b u t the cost to search for a good candidate will become higher when th e size of load table increases. It is suggested the use of round-robin heuristics for large systems and using m inim um -load m ethod for small systems. 97 Chapter 5 DLB: A Dynamic Load Balancer In this chapter, I present a prototype dynam ic load balancer D L B im ple m ented on a 32-node iPCS hypercube m ulticom puter. This load balancer is used to verify th e effectiveness of the new adaptive m ethod proposed in C hapter 4 by benchm ark experim ents. Four heuristic process m igration m ethods are im ple m ented in DLB. The nodal load balancing scheme is m odeled as shown in Fig.5.1, being slightly different from the one shown in Fig.4.3. T he parallelism of pro gram execution is exploited at the process control level. T he process m igration technique used is th e com plete copying via a process control block. I describe the operating system support im plem ented in DLB for parallel program execution, present the structure of DLB, and discuss the im plem entation issues. 5.1 O p era tin g S y ste m S u p p o rt The parallelism in user program s are exploited at the process control level, where each process is considered an atom ic execution unit. A process is repre sented by a Process Control Block (P C B ), which is a d a ta stru ctu re containing all the inform ation needed to execute a process. Each process can be executed at the creating node, or can be m igrated to a rem ote node for execution. T he creation and suspension of a process is controlled by two operating system prim itives: ru n and suspend. In this section, I will describe the process state transition, the PC B , and the use of PCBs, and define the ru n and suspend OS prim itives. 98 Node N{ m ut process Ready Q ueue Decision Maker processor ready process M igration Queue j Figure 5.1: An im plem entation m odel describing the operations of distributed load balancer at each processor node. 5.1.1 P rocess S tate and P rocess C ontrol B lock Each process m ay be in one of the 5 states: new , rea d y, ru n n in g , w a itin g and halted. T he state transitions are im plem ented w ith three queues: ready, suspend and m ig ra tio n queues. Figure 5.2 shows th e state transitions among these queues. The process in the ready queue will be dispatched for execution. W hen starting execution, the sta te of a process is changed from ready to running. A running process can be suspended. W hen suspended, a process enters the waiting state and is p u t into th e suspend queue. A suspended process can be awakened and becomes ready again. Processes in the m igration queue will be transferred to rem ote nodes; and processes m igrated from other nodes will be put into the ready queue. There are two techniques th a t can be used for im plem enting process m igration: complete copying and code m igration. In com plete copying, all copies of user codes have to be allocated to each processor node. T he operating system is identically 99; Migration Q Migration Port Ready Q New Suspend Q Figure 5.2: Process state transitions am ong various queues. distributed to each node. Each process generated in a user program is represented by a Process Control Block (PC B ). It is the PCB to be m igrated rath er the actual program code. In a code m igration approach, different user codes are distributed to disjoint nodes. W hen a process is m igrated, its actual executable code has to be passed to a destination. T he com plete copying has an advantage of lower m igration cost, but requires higher m em ory dem ands. It is suitable for a homogeneous m ulticom puter system . T he code m igration has lower m em ory dem ands b u t higher process m igration cost. There is a tradeoff betw een these two m ethods. For parallel processing w ith a m ulticom puter, it is often to duplicate application code and distribute them to all nodes working on different d ata sets. In this work, I use the com plete copying approach. The structure of a PC B is shown in Fig.5.3 (a), where each field represents the following inform ation: • P I P : A process is identified by its Process ID entification (PID ), denoted by (pid,pnode). T he pid is an integer value of the pid-counter at each node, which is increm ented by 1 when a new process is created at th e pnode. The 100 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- — _______________ I pnode is the processor node w here a process is created. The initial process created at each node Ni has P I D = (0 ,i), w ith pid = 0. Thus a PID is a unique identifier of each process in the entire m ulticom puter system ; i.e. (5,3) and (16,3) are representing two different processes even though they have the same pid: the (5,3) is created at node N 5 while the (16,3) is created at node IV16. • P arent ID : The parent PID identifies the parent process which creates this process. W hen a process is executed, it becomes a current process. Each new process is always created by the current process. • P ort ID : The port ID determ ines th e port where the results of this process should retu rn to its parent. It is an index into the argum ent array of the parent process. W hen a process is halted, the retu rn value along w ith its pnode, parent PID and port ID, is stored in a d a ta structure called Pro- cess-Halting„Result (PH R ) as shown in Fig.5.3 (b). By the sem antics of the OS prim itives defined below, each process is created at a node where its parent process is suspended. T hus a PH R is always returned to its pnode by message passing. At the pnode, th e parent process in the suspend queue is identified by the parent PID . T he actual value of P H R is returned to the appropriate port of the parent process. • S tatu s: The status defines the state of a process in one of the three queues. The value can be either ready or w aiting. • N um ber of A rgum ents: The num ber of argum ents defines how m any argu m ents of d ata are required to run this process. • A rgum ent Counter: The argum ent counter of a PC B counts the num ber of returns required from the child processes. A process is ready when its argum ent counter indicates all the required argum ents are available. 101 PCB PED (pid,pnode) Parent PID ________Port ID_________ Status Executable Code Address Number of Arguments Argument Counter Argument Array PHR Pnode Parent PID Port ID Value (a) H ie structure of a PCB < * > ) T*1 ' structure of a PHR Figure 5.3: T he structure of a Process C ontrol Block (PC B ) and th e Process H alting R esult (PH R ). • A rgum ent A rray: The argum ent array stores the actual d ata argum ents required by a process. In our process m igration scheme, th e required d ata set is m igrated w ithin the PC B to the destination node. 5.1.2 O perating S ystem P rim itives Run and suspend OS prim itives were originally suggested by Chowkwanyun [21] in order to exploit parallelism in Lisp program execution on m ulticom puters. The run creates a new process, and the suspend suspends an existing process. In [21], the d ata set for th e run and suspend is as simple as an integer. I use a defined d ata type D S to allow variable d a ta types to be associated w ith a process. The D S has the following fields as shown in F ig.5.4: • num .arg: num ber of argum ents required to run the process; DS (type,length, value) arg2 argl arg array num_arg Figure 5.4: The structure of Data Set (DS) • arg.avail; num ber of argum ents available; • arg_array: an argum ent array w ith each entry as a triple (type,length,value), where type specifies the d a ta type of the argum ent, length is the length of th e argum ent in term of bytes, and value is the actual value of the argum ent data. The reason for using this D S structure, is due to th e im plem entation.on top of the N X /2. If th e run and suspend are im plem ented w ithin the real OS, this overhead can be avoided. W hen a process is dispatched from the ready queue to run, it becomes a current process. By inserting ru n and suspend into a user program , the running current process can create new child processes th en be suspended. A process is always created at a node where its parent process is suspended. The sem antics of the run and suspend prim itives are show in F ig.5.5. I describe the operations of these two OS prim itives below: • T he run Prim itive T he run prim itive creates a new process by assigning it a PC B in the ready or m igration queue as shown in Fig.5.5 (a). T he current process is the parent process of this newly created process. The portSd identifies th e retu rn port 103 run(func, port_id, data) Create a new PCB and put it in the ready queue pid_count, node PID of current process port_id -------------------- ready ---------------------- func ----------------------- data PID (pid,pnode) Parent PID Port ID Status Executable Code Address Number of Arguments Argument Counter Argument Array (a) Semantics of run suspend(func,data) Modify the current PCB and put it in the suspend queue waiting func - data PID (pid,pnode) Parent PID Port ID Status Executable Code Address Number of Arguments Argument Counter Argument Array (b) Semantics of suspend Figure 5.5: T he Semantics of the ru n and suspend OS directives. at fhe parent process. The data specifies th e argum ents to run w ith this process. The actual argum ents are passed w ithin the data. T he status of the PCB is set to ready, which implies th a t the num ber of argum ents available m ust equal to the num ber of argum ents required for th e process. N ote, the run prim itive does not execute th e process im m ediately b u t only creates a ready process. Therefore, no retu rn value is expected from the use of run. There is a dispatcher in the kernel, which dequeues the processes from the ready queue and puts them for parallel execution. T he syntax of th e run prim itive is defined as follows: void run(func.port_id.data) int (* ftmc)(); /* pointer to the code address */ int port_id; /* port where result returns to */ DS data; /* structure contains the data set */ • T he suspend Prim itive T he suspend prim itive stops tem porarily the current process by changing it to a w aiting state. T he semantics of it is shown in Fig.5.5 (b). A pro cess needs to be suspended, when it has to w ait for results from its child processes. The fu n c points to the code address from where th e execution will be resum ed. The num ber of argum ents required to enable this process and th e num ber of argum ents available are specified are specified in the data field. In the case of suspend, the num ber of available argum ents is less th an the num ber of the argum ents actually required. This implies th a t the purpose to use the suspend prim itive is to allow the creation and parallel execution of child processes. The argum ents unavailable at the tim e of using a suspend, are expected to be returned from child processes in the future. T he syntax of th e suspend prim itive is defined as: void suspend(func,port _ id,dat a) int (* func)(); /* pointer to the code address */ DS data; /* structure contains the data set */ 105 In Fig.5.6 , an exam ple shows a process creation tree at node N 0 in executing a fibonacci function call fib(4). W hen the fib(4) program is invoked at node N 0, the kernel starts its execution by calling ru n(fib,0,l,4), where “1,4” simplifies the value of the structure data w ith “ 1” indicates th e num ber of argum ents and “4” as the actual argum ent data. In Fig.5.6 , the sequential and the parallelized program s of fib() are shown. T he process creation trees illustrates the control flow. The actual im plem entation is represented by a set of PCB s shown. T he invocation of run(fib,0,l,4) creates the first process p(0,0) w ith pid = 0 and pnode = 0, and puts it to run. Since 4 is greater th an 2, process p(0,0) is suspended, w ith the next execution function as plus(), port ID as 0, and w aiting for 2 d a ta argum ents. Two child processes p (l,0 ) and p(2,0) are created. W hen these two child processes halt, th e results will become the new argum ents of th e next “plusQ ” execution of p(Q,0). Process p(0,0) will be ready when these two argum ents are available. The sam e situation happened to p (l,0 ). T he pair (pid,pnode) is used to identify the process during th e m igration process. 5.1.3 C ontrol Level P arallelism in P rocess M igration By inserting the run and suspend prim itives in to a user program , each m ajor function call can be treated as a process. W hen a recursive call occurs, th e current execution of a process can be suspended and the new function call will create a new child process. This was illustrated by the exam ple in Fig.5.6 . W ithout process m igration, each process is created, executed and suspended at the local processor node. In a m ulticom puter system , parallelism can only be achieved by static allocation which distributes com putation to each proces sor node. If the run tim e conditions are unpredictable, then the operating sys tem has no control to balance the workload and maximize the parallelism . In Fig.5.7(a), th e process creation trees in a 4-node m ulticom puter are shown. The corresponding ready and suspend queues w ithout applying process m igration are 106 Process Creation Tree p(0,0) Sequential Program fib(x) int x; { if ( x <= 2 ) elseetUm <'X)’ pd,0) fib(3) J ? ]us(2,; p retum(fib(x-l) + fib(x-2)); plus (3,2) } fib(2) p(2,0) p(3,0) fib(2) fib(l) p(4,0) Parallel Program int fib(x) int x; |{ int int int int suspend(); run(); plus(); fib(); Process Control Blocks P(0,0) if ( x <= 2 ) retum(x); else { suspend (plus,0,2); ran(fib,0,l,x-l); run (fib,l,l,x-2); retum(NULL); } (0,0) (0,0) 0 fib suspended ♦ (0,0) (0,0) 0 2 plus P(1,0) (1,0) (0,0) 0 1 fib 3 suspended (1,0) / —s o o 'w' 0 2 plus | p(2,0) (2,0) (0,0) 1 fib P(3,0) Process Invocation (3,0) (1,0) 0 1 fib 2 run(fib,0,l,4); P(4,0) (4,0) (1,0) 1 1 fib 1 Figure 5.6: T he process creation tree and th e PC B s used in the invocation of the fibonacci function fib (4), at node 0 of a 32-node m ulticom puter system. 10 7 i N l N 2 N l d > (a) Process creation trees in a 4-node multicomputer. Illllll K I I I H I I H CPU CPU mu * linn CPU CPU (b) The ready and suspend queues without process migration. M i l l CPU N 2 CPU CPU (c) The ready and suspend queues with process migration. Figure 5.7: Exploiting parallelism by process m igration. shown Fig.5.7 (b). There are 12 processes w aiting for execution at node Ny and only 1 processes in th e ready queue at node N 0. Assume th at each process has the sam e am ount of execution tim e. Clearly, the to tal execution is bounded by the tim e at node N i. Therefore, parallelism cannot be maximized. By applying the process m igration technique, processes can be m igrated to a rem ote node for execution. If a node has created m ost processes in th e ready queue w aiting for execution, these processes can be spread to rem ote nodes which are idle for execution. For the same case in F ig/5.7 (a), the corresponding ready and suspend queues w ith applying process m igration techniques are shown Fig.5.7 (c). By applying dynam ic load balancing, the ready queues at 4 nodes will have an alm ost equal num ber of processes allowing all nodes to be busy during th e entire execution tim e. Note th a t the processes at th e suspend queue can be awakened by th e retu rn results obtained from th e local or rem ote execution of child processes. Details of m axim izing parallel will be discussed in Section 6 when benchm ark experim ents are presented. 5.2 Im p le m e n ta tio n o f th e L oad B alan cer To show the effectiveness of the proposed heuristic load balancing m ethods, I have im plem ented a prototype dynam ic load balancer on a 32-node iP S C /2 hypercube system . I describe the construction of the load balancer and discuss the message passing m echanisms used in th e im plem entation. 5.2.1 T he Softw are C onstruction T he load balancer is w ritten in C using iP S C /2 message passing library calls. It consists of a host program and identical distributed nodal program s. T he host program is executed at the host processor with Unix. The nodal program s are executed at i368 node processor and built on top of the N X /2 operating system . 109 To distributed nodes Window adjuster Load information updater Reporter Loader Asynchronous Mailman Figure 5.8: T he supervision kernel in the host processor. The host program consists of a loader, a window adjuster, a load information updater, an asynchronous mailman and a reporter shown in Fig.5.8. The loader loads in th e load balancers and distributes partitioned user program s into local nodes. T he time window adjuster updates the tim e window w ith inputs from the load inform ation updater. The load information updater periodically collects system load distribution from all nodes, severing the host processor. The reporter displays the necessary messages during the program execution and reports the user program execution result and the perform ance data. W hen all expected results of a user program are received from distributed nodes, the reporter sends a stop message to each node. T he kernel at each node will then stop the execution. All message passing to /fro m distributed nodes are coordinated by an asynchronous m ailm an in the load balancer. The control driven of th e host program is shown in the following pseudo code, where all param eters of procedure calls are not specified: h o stO - c in p u t( ) ; /* in p u t lo ad balancing param eters */ in it_ h o s t( ) ; /* i n i t i a l i z e th e h o st p ro cesso r */ 110 load_node(); /* load nodal program to nodes */ send_parm(); /* broadcast load balancing parameters */ while ( ! done ) f /* not all nodes finished the execution */ if ( time_window() ) ask_load(); /* collect load indices from nodes */ if ( msg_come() ) { /* is any message coming ? */ receive_msg(); /* receive the message */ check_msg(); /* check message type and take action * / y y /* end while */ done_signal(); /* broadcast shutdown message */ while ( i finish ) - [ /* not received all final reports */ if ( final_msg_come() ) /* is any final report coming ? */ receive_final_msg(); /* receive final report */ } /* end while */ final_result() ; /* report final result */ shutdownO ; /* kill node process * / The checkjmsg() routine actually takes various actions upon receiving differ ent type of messages. A fter receiving load indices from all nodes, th e action of broadcasting system load distribution will be taken; after receiving th e com puta tion result from th e last node, the flag done will be set; etc. T he asynchronous control can avoid deadlocks and any process blocking. The distributed load balancers are identical among all nodes. Figure 5.9 shows th e construction of th e load balancer and the process flow under its control. The kernel is the control m odule of the load balancer. It initiates th e execution of a user program by creating an initial process and assigning a processor to it. The kernel then repeatedly calls the dispatcher, the load transfer and the mailman, until receiving a stop message from the host. The mailman sends and receives messages asynchronously. A message can be a request for local load index U from the host, th e local load index h reporting to the host, the system load distribution L t broadcast from th e host, a m igrated processes fro m /to the rem ote nodes, a retu rn result of a halted process from /to rem ote nodes, and display or debug messages to th e host. The execution of a process may end up w ith one of th e following cases: 1) being halted w ith a retu rn result, 2) being suspended, 3) creating some new child 111 T o h o st o r o th e r n o d e s Kernel Suspend Queue Decision maker Ready Queue Migration Ou me -H lllllil 1 1 H Load Jnfor Updater Figure 5.9: The load balancer program at each node processor. processes and being suspended. A newly created or an awakened process is in the ready state. The decision maker uses updated threshold to decide w hether a ready process should be p u t into the ready queue or m igration queue. A process will be dispatched by the dispatcher and get the processor for its execution. The processor scheduling discipline is im plem ented by th e dispatcher. T he threshold updater updates th e threshold Si. The load inform ation u p d ater calculates th e local load index by checking the ready queue, sends it to th e host through the m ailm an and receives the system load distribution L from th e host. T he m igration destination is decided by a load transfer. It im plem ents the load transfer policy and handles the candidate list and th e load table. T he h alt ing result of a process execution is sent to an awaker. T he result can be directly obtained from local process execution or through m ailm an from th e process ex ecution at the rem ote node. The awaker finds the p aren t process identified by the parent PID of the result, and puts the return value to the appropriate p ort. 112 W hen th e argum ent counter of a suspended process researches the required value, it is changed to a ready state. The control of the nodal load balancer is describe by the following pseudo code, where the param eters are not specified: nodeO ■ c init_node(); /* initialize the node program */ make_run(); /* create the initial process */ while ( ! done ) { /* not receiving shutdown msg from host */ get_msg(); /* asyn receive msg and take action */ executeO; /* dispatch a process and execute it */ if (LB != NLB ) /* if load balancing applied */ trans(); /* transfer load if necessary */ > The getjm sgQ is a p art of th e asynchronous m ailm an. It tests w hether there is a message arriving and receives the message in case there is one. T he executeQ belongs to the dispatcher, which dispatches a process in front of the ready queue and p u t it to run. T he tra n s () is in th e load transfer m odule, it m igrates a process if the m igration queue is not em pty. 5.2.2 A synchronous M essage P assin g Asynchronous message passing m echanism is used in the im plem entation of the load balancer. I have developed a generic message p attern which can easily be j passed among nodes. A generic m essage has its own type defined as G M SG -TY PE. It is a d ata structure consisting of a type, a header and a body. The type is a flag to identify the kind of actual message. T he header specifies the source, destination and the length of the message body. T he body can be a PC B , a com putational result, a system load distribution L, a node load index U, etc; each is represented by a dedicated d a ta structure. T he iP S C /2 iprobeQ is invoked within th e control loops at host and node load balancers. It checks to see if there are any generic messages arrivals. Upon th e arrival of a message, the iP S C /2 crecvQ is called to 113 receive the message. By checking the message type, appropriate actions will be taken to respond to th e message. The code of th e getjm sgQ and check jm sg () are listed below, where gjm sg is a pointer to the generic message gmsg. void get_msg() ■ C if ( iprobe(GMSG_TYPE) ) { crecv(GMSG_TYPE,&gmsg,sizeof(gmsg)); check_msg(g_msg); > void check_msg(g_msg) GMSG g_msg; f switch ( g_msg->type ) { case AMSG_TYPE: /* ask load index */ send_load(); break; case LMSG_TYPE: /* load distribution L */ if (LB != NLB ) set_threshold(g_msg); break; case DQNE_TYPE: /* shutdown */ kill_kernel(); break; case MMSG_TYPE: /* process migrated in */ recv_pcb(g_msg); break; case RMSG_TYPE: /* return result from remote node */ recv_result(g_msg); def ault: break; > > Using this asynchronous mechanism, the distributed load balancing at each node acts independently. Each node works on its own w ithout w aiting for re sponses from other nodes. Before a new message comes, the inform ation obtained from the m ost recently received message affects the decision m aking and m igration actions to be taken. 114 Chapter 6 Performance Results of Benchmark Experiments In addition to the parallel sim ulation experim ents, th e perform ance of the dynam ic load balancing scheme w ith four heuristic process m igration m ethods proposed in C hapter 4 was also evaluated by parallel execution of selected bench m ark program s on a 32-node iP S C /2 hypercube system . In this section, I de scribe the com putational and com m unication characteristics of those benchm ark program s used, report th e software experim ents perform ed, and th en analyze the perform ance results th a t have been obtained. 6.1 B en ch m a rk P ro g ra m s T he benchm ark program s chosen have the following com m on characteristics: 1 ) strong data dependency on the argument set used and 2 ) unpredictable program behavior at run-time. These are C program s w ith run and suspend OS prim itives inserted for explicit control of parallel activities. I have im plem ented the ta k , fibonacci and quicksort program s. The program s tak and fibonacci are recursive com putation functions, where tak takes three argum ents and fibonacci uses only one argum ent. Q uicksort is a recursive sorting program on a one-dim ensional d ata array. T he execution tim e of these program s is totally d a ta dependent, and their behaviors are not predictable at compile tim e. The process trees created during the execution of these program s are usually unbalanced. This phenom enon can be 115 seen by exam ining the following sam ple program s w ith different argum ent values. • tak £afc(18,16,9) needs 11842 function calls. tafc(l8,16,15) needs 7 function calls. • fibonacci fibonacci{20) needs 13529 function calls. fibonacci(3) needs 3 function calls. • quicksort quicksort{2000) sorts a 2000-element array. It needs 400-600 calls, depend ing on the distribution of d ata set. quicksort( 1 0 ) sorts a 1 0 -elem ent array and needs 2 calls. These benchm ark program s are parallelized by inserting the run and suspend OS prim itives into th e sequential program s. Therefore, th e process creation trees can be created as the exam ple shown in Fig.5.6 . The sequential and parallelized benchm ark program s are presented below, where the d ata argum ents are defined explicitly for the ru n and suspend for easy understanding: 6 .1 . 1 T h e T a k P r o g ra m T a k {) is a function-call-heavy program which has short code b u t requires ex tensive com putation for some argum ent data. The tak() program takes three argum ents: x, y and z. If y < x, then z is returned, otherwise recursive calls will occur. By inserting the run and suspend prim itives, the original call can be suspended, and three new child processes can be generated. These three child processes m a3r be executed at rem ote nodes in parallel, depending on th e system load distribution. A fter all three child processes halt and the results are returned 116 to the appropriate p o rt, the suspended process is resum ed. The process creation tree of th e ta k {) program is usually bushy and unbalanced. Sequential Program i n t ta k ( x ,y ,z ) i n t x , y , z ; ■ c i f (y >= x ) r e tu x n ( z ) ; e ls e r e t u r n ( t a k ( t a k ( x - l,y ,z ) , t a k ( y - l , z ,x ) , t a k ( z - l , x ,y ) ) ; } Parallel Program int tak(x,y,z) int x,y,z; if (y >= x ) return(z); else { suspend(tak,3,0); run(tak,0,3,x-l,y,z); runftak,l,3,y-i,z,x); run(tak,2,3,z-l,x,y); return(NONE); > > 6 .1 .2 T h e F ib o n a c c i P r o g ra m The fibonacci program generates m any recursive calls and creates large process trees for some argum ent data. FibQ takes only 1 argum ent x. If x < 2 , then th e x itself is returned, otherwise th e addition of fib (x — 1) and fib (x — 2) is returned. In the parallel program , the first call is suspended, and two child process are created for the later case. FibQ differs horn takQ in th a t an addition operation needs to be perform ed when the results from the two child processes are returned, instead /* need 3 arguments, 0 available */ /* port_id = 0, 3 arguments */ /* port_id = 1, 3 arguments */ / * port_id = 2, 3 arguments */ /* dummy return */ 117 of perform ing fib ( ) itself. Therefore the plusQ will be the new place to start the execution. Sequential Program in t fib (x ) in t x ; i f ( x <= 2 ) r e tu r n ( x ) ; e ls e r e tu r n ( f ib ( x - l) + f ib ( x - 2 ) ); > Parallel Program in t fib (x ) in t x; ■ c in t p lu s O ; /* a d d itio n of two valu es */ in t f i b ( ) ; i f ( x <= 2 ) retu x n C x ); e ls e { su sp en d (p lu s, 2 , 0 ); r u n ( f i b , 0 , 2 , x - l ) ; r u n ( f i b , 1 , 2 ,x - 2 ) ; return(NONE); > > 6.1.3 T he Q uicksort P rogram In order to exploit parallelism , the d ata set which needs to be sorted is taken as one of th e argum ents of the parallel qksortQ program . A standard partition using th e “m ean-of-three” technique [69] is perform ed to split th e d ata set into two sub-sets and obtain a pivot. The current call is suspended after the partition, and two child processes are created to work on two different sub-sets. W hen these two child processes h alt, the suspended process is resum ed and the m erging of /* need 2 argum ents, 0 a v a ila b le */ /* p o rt_ id = 0 , 2 arguments */ /* p o rt_ id = 1 ; 2 arguments */ /* dummy re tu rn */ 118 these two sub-sets is perform ed. Being different from th e sequential program , the mergeQ is needed because two child processes m ay be executed on different nodes at different tim es. To exploit parallelism , the d a ta set needs to be stored in the PC B so th a t it can be m igrated along w ith th e process for rem ote execution. T he results containing the d ata array are returned to th e node where the parent process is suspended. T he TH RESH OLD used for th e insertion sort is set to 8 . T he d ata set is random ly generated in a uniform distribution. Sequential Program void qksort(l,r,x) int 1; /* left index of the array */ int r; /* right index of the array */ int x[]; /* array to be sorted */ - C int j, partitionO; /* partition */ void insortO; /* insertion sort */ if ( 1 < r ) - C if ( r - 1 < THRESHOLD ) /* insertion sort for small length */ insort(lfr,x); else { j = partition(l,r,x); /* j is the pivot */ qksort(1,j,x); /* sort first half * / qksort(j+1,r,x); /* sort second half */ > Parallel Program int *qksort(l,r,x) int 1; /* left index of the array */ int r; /* right index of the array */ int xt]; /* array to be sorted */ int j, partitionO ; /* partition */ void insort(); /* insertion sort * / int k; i f ( 1 < r ) i 119 if ( r - 1 < THRESHOLD ) - C /* insertion for small size */ insort(l,r,x); return(x); > else { j = partition(l,r,x); /* j is the split point */ suspend(merge,6,4.1.j,r,x,0,0); /* 6 args only 4 available */ run(qksort,4,3,l,j,x); /* port_id = 4, 3 arguments */ run(qksort ,5,3 , j + 1 ,r ,x) ; /* port_id = 5, 3 arguments */ return(&NONE); /* dummy return for suspend * / y y else return(x); int ♦merged, j ,r,z,x,y) int 1,j,r; int z [] , x □ , y [] ; { void copyO ; /* put two sub-array together */ /♦ j is the joint point */ copyCl,j,x,z); /* copy 1 to j elements from x to z */ copy(j+1,r,y,z); /* copy (j+1) to r elements from y to z */ retuxn(z); /* z is sorted */ 6.2 E x p e r im e n ts P erfo rm ed Based on the characteristics of benchm ark program s, I have constructed two types of experim ents. T he first type is used for executing th e ta k {) and fib ( ) program s. T he second type of experim ents are used for executing th e qksortQ program . These experim ents are explained in th e following: 6 .2 .1 E x p e r im e n ts o n th e T a k a n d F ib o n a c c i P r o g ra m s T he experim ents let both the tak() and fib {) program s calculate th e sum m a tion of a set of functions. T he execution of each function call or each search call is initially distributed to one node in th e m ulticom puter. After each node finishes its com putation, th e result is reported to the host. The host has to receive results 120 from all nodes, before sum m ing them up a t the very end. I intentionally chose special d ata argum ents which cause very unbalanced process creations at various nodes. Two cases of experim ents were carried out. Let n be th e num ber proces sor nodes used in th e m ulticom puter system . My experim ents cover m achine size ranging from 2 to 32 processor nodes Case 1: Extremely Unbalanced • Tak: c a lc u la te d = 2 ” =o2t ak (1 8 ,1 6 ,1 5 )+ ta k (1 8 ,16,9) — Execute tafc(18,16,9) at node iVl 5 ( 11842 calls ) — Execute tafe(18,16,15) node IV,, i ^ 1, ( 7 calls ) • Fibonacci: calculate X = ]C£ro 2 fib(3)-|-fib(20) — Execute fib(20) at node iVj, ( 13529 calls ) — Execute fib{3) at node Ni, i ^ 1 , ( 3 calls ) Case 2: Random Load Distribution • Tak: calculate X = Yfi=o tak( 18,16, j ) , 9 < j < 15 Execute tafc(18,16j) at each node Ni, such th a t j — random (9,15) • Fibonacci calculate X = fib (j), 1 < j < 20 Execute fib(j) at each node IV,, such th a t j = random( 1,20) In Case 1 , special argum ents were chosen to yield extrem ely unbalanced load distributions. Case 2 uses some random ly generated load distributions. W ithout load balancing, the to tal execution tim e required is dom inated by the node cre ating the m axim um num ber of processes. Let Ti be the com pute tim e of at each node N{. Let Sn be the speedup factor of using n processor nodes. Sn is defined as: p Y ^ n — 1 r p n _ _ 1 0 * Pn m a x (T i) 121 where Px is the tim e to execute the program on one processor node and Pn is th e tim e to execute th e same program using n processor nodes. Since the atom ic processes created in these benchm ark program execution are homogeneous w ith th e same executable code and the sam e m em ory dem and, th e execution tim e e(p) and the m em ory dem and m(p) of each process p will be th e same. T hus Ti can be estim ated as Ti = e(p )x (th e num ber of processes executed at N i). First, I analyze the speedup factor for these two cases w ithout any load balancing. For the experim ents in Case 1, node N x will create a large num ber of processes, while the other nodes will create only a few. After all nodes other th a n N x have their com putational results to th e host, th e host has to w ait until N x finishes its execution before sum m ing up th e results. For the exam ple of fib ( ) program , obviously Tx T ^ i, since T x = 13529 - e{p) and T * = 3 • e(p). Thus = Ti + Tl = ( n - l ) - 3 - p ( e ) + 13529-p(e) = (n - 1) • 3 + 13529 ^ Tx 13529 -p(e) 13529 ( 6. 2) This indicates the fact th a t no appreciable speedup can be achieved for Case 1 experim ents if load balancing is not applied. For those experim ents in Case 2 , the speedup Sn depends on th e random ness of d a ta argum ents generated. Since the num ber of processes created at each node Ni is dependent on the argum ent value, th e com puting tim e Ti also depends on th e random ly generated d a ta argum ent. Therefore Sn = be some value proportional to n. 6.2.2 E xp erim en ts on th e Q uicksort P rogram The experim ents for th e qksortQ program could also be constructed in the same way as for th e ta k( ) and fibQ . T h a t is to assign each node to sort on a different size of random ly generated d a ta array. However, lettin g all nodes in a m ulticom puter work together on sorting one d ata array appears to be more interesting. I let th e initial qksortQ call sta rt at node N 0. W ithout load balancing,1 122 ------------------------------------------------------- _ J the entire execution can only be executed at node N 0 and no speedup will be obtained. W ith dynam ic load balancing, the processes in the unbalanced tree can be spread to all th e nodes for execution. Therefore, the parallelism can be exploited at the process control level. T he experim ents provide a new way to do parallel quick sorting. It intends to use dynam ic load balancing for parallel processing, rath er than balancing th e load based on a static allocation, which is like the experim ents on th e ta k() and fibQ program s. For these two types of th e experim ents, I used the following load balancing and perform ance assum ptions. I assum e a unit tim e for executing each atom ic process. The load index at node Ni is thus estim ated as li — |i2,|, the cardinality of the set Ri. The initial tim e window Wo is set to be 2 sec before the first update, and th e constants used are: kx = 0 .0 0 1 , k2 = 0 . 1 and a — 0 .1 . 6.3 B en ch m a rk P er fo rm a n ce R e su lts I evaluated th e perform ance of th e heuristic load balancing m ethods by using th e following m easurem ents: • Speedup T he speedup factor is calculated as «n = j 1, using the actual values of Px and Pn m easured from parallel program executions. The speedup factor m easures the actual perform ance im provem ent achieved. • Efficiency The efficiency is defined by th e ratio It m easures more the effectiveness of the parallel program execution rath er th an th e load balancing. • Total Load Distribution 123 The total load distribution is used, to m easure the effectiveness of load bal ancing. It is defined as L = { # P i | 0 < i < n}, where # P i is th e actual num ber of processes executed at node Ni. The four heuristic m ethods are com pared w ith the NLB (No Load Balancing) and th e GRD (G R aD ient) m ethods, in order to show their relative m erits. By NLB, I m ean th a t processes created at each node m ust be executed locally w ithout process m igration. The GRD m ethod is based on sender-initiation w ith a fixed threshold as originally proposed by Lin and Keller [77]. This m ethod requires frequent load index exchange among com m unicating nodes. T he m ethod forms a gradient in which the loads are always transferred from busier nodes to idle ones. However th e GRD m ethod is ineffective when the system is heavily loaded, since th e fixed threshold will continue swapping th e load practically w ithout stopping. 6.3.1 Speedup Perform ance • Tak and Fibinacci programs The results from Case 1 experim ents are shown in P arts (a) of Figs. 6.1- 6.2. W ithout load balancing, there is no speedup (Sn = 1). As analyzed before, this is due to th e fact th a t the to tal execution tim e is determ ined by th e execution tim e at node N i. The speedup obtained from using four heuristic m ethods increases sublinearly as the m ulticom puter increases its size. Com pared to th e GRD m ethod, my heuristic m ethods are superior. T he GRD m ethod has linear speedup, when the system has fewer th an four processor nodes. W hen the system size becomes larger, its perform ance is worse th an any of th e heuristic m ethods used. T he results from the Case 2 experim ents are shown in P arts (b) of Figs. 6.1 and 6.2. Because Case 2 is different from Case 1 , m oderate speedups are obtained w ithout load balancing. This is due to th e random distribution of1 28.0-- 2 4 .0 - Speedup 20.0 1 6 .0 - 12.0 - 4.00-- 4 . 8 16. 32 M ulticom puter size (a) Case 1 E xperim ents * : LRR o : G RR * : LML O : GML * : GRD O : NLB 2 8 .0 - Speedup 20.0 16.0-- 12. 0 -- 4 8 16 32 M ulticom puter size (b) Case 2 Experim ents * : LRR O : G R R * : LML © : GML * : GRD > : NLB + — *- n Figure 6.1: Speedup obtained from balanced execution of the ta k program . 125 32.0 28.0 -- 24.0 -- 20.0 - - 16.0 -■ 12.0 -- 8.00 -- 4 .0 0 - 1.0 0 * M ulticom puter size ★ I LRR o : GRR • : LML o : GML * t GRD > : NLB n (a) Case 1 Experim ents 32.0 28.0-- 24.0 20.0 -- 16.0 12.0 ” 8. 0 0 -- 4.00-- 1.00 M ulticom puter size * : LRR o : GRR * : LML o: GML * : GRD t> : NLB n (b) Case 2 Experim ents Figure 6.2: Speedup obtained from balanced execution of the fibonacci program 126, the d a ta sets. T he to tal com pute tim e is still determ ined hy th e node which executes the m ost processes. For exam ple, if iafc(l8,16,9) is executed at node JV 4 and tak( 18,16,10) is executed at node iV0, then even if node iV4 has the m ost processes executed, the com pute tim e at node No is not eligible. The speedup is im proved significantly when dynam ic load balancing is applied. T he im provem ent is obviously obtained by reducing the com pute tim e at heavily loaded nodes via the heuristic process m igration. • Quicksort program T he results from parallel executing the qksortQ program are shown in Fig.6.3. Being very different from the results shown in Fig.6.1 and F ig.6.2, the speedup curves grow faster when th e m ulticom puter size is small and ten d to be flat when they reach a certain point. There is a bound for the m axim um num ber of processor nodes for parallel quick sorting [29]. M y ex perim ents verify this fact and agree w ith the analytical results reported in [29], even though their results were based on a shared-m em ory m ultiproces sor. T he bound of the m axim um num ber of processor nodes in th e iP S C /2 hypercube m ulticom puter environm ent is observed as 16. T he results also show th a t the speedup obtained from sorting a 4k-size d a ta is higher th a t obtained from sorting a 2k-size data. This also agrees w ith th e observations in th e previous work on parallel sorting [8 8 ]. Due to th e lim itation of m em ory space at each node processor, experim ents are not able to be perform ed on th e larger size of data. However, it is predictable th a t th e bound of m ax im um num ber of processor nodes will still exist and the curve will becom e flat or even dim inish a little and then becom e flat. 6.3.2 E fficiency A nalysis • Tak and Fibonacci Programs 127| Speedup Speedup 6.0 -- 5.0 -- 3.0 -- 2. 0 0 -- 1.00 0.00 M ulticom puter size ★ : LR R o : G R R • : LML o : GML * : GRD j > ; NLB (a) D ata Set Size = 2k 7.0 -- 6.0 - - 5.0 -- 3 .0 - 2 .0 0 - 1.00 0.00 M ulticom puter size ★ ; LRR o : G R R • i LML o : GML * : GRD > : NLB *- n (b) D ata Set Size = 4k Figure 6.3: Speedup obtained from balanced execution of the quicksort program (random ly generated d a ta set). The efficiency of m y load balancing m ethods varies from 60% to 100% from executing th e tak and fibQ program , as shown in Fig.6.4 and Fig.6.5. They are derived from th e speedup curves associated w ith executing the ta k() and fibQ program s. T he efficiency decreases w ith occasional fluctua tions, as th e m ulticom puter size increases. However, my m ethods m aintain a 60% lower bound. The NLB m ethod shows a very low efficiency as expected. • Quicksort Program As derived from the speedup evaluation, th e efficiency of parallel execution on the qksortQ program is not very great, as shown in Fig 6 .6 . Sorting is always a very interesting non-num erical problem . However, th e synchroniza tion bottleneck is critical in sorting. As observed from th e parallel qksortQ ' program , the m erge() has to be perform ed after two sub-sets are com pletely sorted. O n th e other, th e d ata set has to be m igrated w ith th e P C B , which causes a higher message passing cost. Therefore, the lower efficiency on a large size m ulticom puter is really not an unexpected result. As a m atter of fact, th e efficiency is even b e tte r th an the sim ulation results for parallel, quick sorting on a shared-m em ory m ulticom puter [8 8 ]. .3 L o a d D is tr ib u tio n E v a lu a tio n • The Tak and Fibonacci Program s The load distribution in distributed nodes is shown in Table 6.1 and 6.2 for the balanced execution of th e benchm ark program s. T he entries of this table are based on experim ents in Case 1, using the LR R m ethod. The load index is indicated by the num ber of processes actually executed at each node. Let x, a and g be the mean value, the variance and th e standard deviation of th e load distribution to all nodes respectively. Since I use the , § J L n Efficiency 100 80-- 60- 40- 2 0 - M ulticom puter size GML n (a) Case 1 Experim ents Efficiency 100 8 0 - 6 0 - 4 0 - 2 0 - M ulticom puter size ~ k : LR R o : G R R • • LML o : GML * i GRD > : NLB n (b) Case 2 Experim ents Figure 6.4: Efficiency obtained from balanced execution of th e ta k program . 130 §.u n Efficiency Efficiency 100 8 0 - 6 0 - 4 0 - 2 0 - M ulticom ptiter size I LRR o : G R R • : LML o : GML * : GRD o : NLB n (a) Case 1 Experim ents 100 8 0 - 6 0 - 4 0 - 2 0 - M ulticom puter size : LR R o : G R R • : LML © : GML * : GRD ! > : NLB n (b) Case 2 Experim ents Figure 6.5: Efficiency obtained from balanced execution of th e fib o n a cci program . 131 Efficiency 100 8 0 - 6 0 - 4 0 - 2 0 - M ulticom puter size * : LRR o : G R R . : LML o : GML * : GRD > : NLB n (a) D ata Set Size = 2k S a . n Efficiency 100 8 0 - 6 0 - 4 0 - 2 0 - M ulticom puter size * : LRR o : G R R • : LML o : GML * ; GRD > : NLB n (b) D ata Set Size = 4k Figure 6 .6 : Efficiency obtained from balanced execution of th e Q uicksort program (random ly generated d ata set). 132 num ber of processes resident in the ready queue as th e load index, the system load distribution is relatively balanced which is dem onstrated by the table entries. Note th a t if only one processor node is used, all spaw ned processes have to be executed at th a t node. Therefore the sum m ation of all processes executed at all nodes divided by th e num ber of processes executed at the busiest node, gives a first-order approxim ation of th e expected speedup. However, the actual speedups m easured by the execution tim es are higher than those guessed values. This is due to th e FC FS processor scheduling used, where th e queue m anipulation takes a longer tim e to com plete for a longer queue length. I view this as only a negligible im plem entation issue. T he results I obtained verify th a t these heuristic dynam ic load balancing m ethods do perform b etter w ith low im plem entation overhead. • The Quicksort Program As shown in Table 6.3, th e load distribution for a balanced parallel execution j of qksortQ program is much b etter th a n th e speedup and efficiency. The results show th a t parallelism can be achieved at th e process control level by the m ean of dynam ic load balancing. T he result are obtained from sorting a 2k-size random ly generated d a ta set. Based on my experim ents, as the d ata set size increase, the load distribution will becom e m ore balanced. How ever, the speedup will still be bounded. This result agrees w ith a common understanding in parallel processing, th a t is, “parallelism ^ speedup”. 6 .3 .4 C o m p a ris o n o f H e u ris tic M e th o d s No drastic differences exist in the perform ance am ong the four heuristic m eth ods. Two explanations are given: 1) T he system size 71=32 used in th e experim ents is too small to m ake a difference. 2) T he hypercube netw ork is restricted only to point-to-point com m unications. As far as locality is concerned, the local m igration i I Table 6 . 1 : T he Process Execution D istribution For P rogram T a k Q In Case 1 I Experim ents. ! System Size 1 2 4 8 16 32 X 11842 5924.5 2965.7 1486.4 754.2 376.8 c r 0 263.5 110.3 73.7 139.2 114.1 V 0 16.2 10.5 8.5 1 1 . 8 10.7 No 5661 3094 1420 805 463 iVi 11842 6188 3055 1576 1054 6 8 8 n 2 2875 1513 814 382 N z 2839 1546 904 583 N* 1384 727 388 n 5 1480 643 523 No 1396 628 361 n 7 1576 634 367 No 865 376 n 9 883 496 N 10 802 454 N l t 808 382 N i 2 532 379 N iS 595 448 N u 655 427 N is 598 358 N ie 346 N i 7 601 N is 340 N \g 319 n 20 304 N 2i 349 N n 322 N iz 307 N l 4 262 N is 304 N ie 253 N n 388 N is 262 N ig 307 Nzo 115 N zi 205 Table 6.2: T he Process Execution D istribution For Program F ib In Case 1 Ex perim ents. System Size 1 2 4 8 16 32 X 13529 6631 3334 1694 848 426 a 0 181 195 60 43 48 0 13.5 13.9 7.7 6.6 6.9 No 6450 3185 1705 923 483 Nr 13529 6812 3623 1589 909 495 n 2 3401 1669 855 485 Ns 3129 1747 837 473 n 4 1677 885 511 n 5 1765 769 433 N 6 1765 831 473 Nr 1633 811 389 N8 871 505 N 9 799 451 N 10 893 503 N n 811 405 Nr 2 857 407 Nr 3 813 411 N ia 899 363 N 1 S 811 385 Nie 505 Nrr 401 Nrs 383 Nro 407 n 2 Q 431 n 2 1 359 n 2 2 375 n 23 357 n 24 461 n 2 5 407 n 26 391 N27 375 N 2s 421 n 29 403 n 30 383 n 3 1 391 135 Table 6.3: T he Process Execution D istribution For Program Q ksortQ on Sorting a 2k-size D ata Set. System Size 1 2 4 8 16 32 x 884 442 221.5 108.4 56.125 28.56 < r 0 5 1 1 . 8 6 1 1 . 6 12.3 18.6 M 0 2.23 3.44 < ■ 3 .4 3.5 4.3 N 0 884 437 229 119 71 63 Nr 447 237 123 65 65 n 2 209 1 1 1 71 71 n 3 2 1 1 109 75 43 n 4 131 63 49 N s 87 57 69 N 6 113 61 59 n 7 97 55 29 N 8 61 27 No 53 2 1 Nio 55 13 iVu 57 33 N 1 2 49 1 1 N is 35 23 N u 29 25 n 1 6 41 7 N 16 27 N n 1 1 n 1 8 2 1 n 1 9 9 n 2 0 15 N2i 17 n 2 2 39 n 23 19 n 24 5 n 2 5 31 n 26 17 n 27 2 1 C O 25 N 2o 13 n 30 2 1 n 3 1 15 m ethods (LRR and LML) are better than th e global m igration m ethods (G R R and GML) for point-to-point network as the system size grows. T he global m eth ods dem and a higher cost in process m igration in large point-to-point systems. For a m ultiple-bus netw ork, the com m unication cost between two nodes is inde pendent on the physical distance between them . Therefore, th e global m ethods m ay appear b etter, since th e load is balanced on a global basis. As far as the use of heuristics is concerned, th e Round-Robin m ethods (LR R and GRR) perform b etter when th e system size becomes larger. B oth minimal- load m ethods (LML and GML) are b etter for smaller system sizes. Basically, the m inim um load m ethods use inform ation which is more accurate in selecting the destination node th an th e Round-Robin m ethods. However, the overhead to find a proper destination node w ith the m inim al load becomes higher when the system size becomes very large. There is a tradeoff betw een the accuracy desired and th e im plem entation cost. In sum m ary, th e choice among these m ethods is sensitive to th e size of th e m ulticom puter, th e interconnection topology, and th e im plem entation costs incurred. All of these factors contribute to th e efficiency of th e program m ing environm ent to be used. Chapter 7 Summary and Conclusions In this thesis, I have explored th e use of load balancing m ethods to im prove the perform ance of a message-passing m ulticom puter. This thesis, as described in the previous chapters, is composed of the developm ents of the static and dynam ic load balancing schemes, the design of th e sim ulated annealing and heuristic process mi gration m ethods, also presented th e theoretical proofs on th e near optim ality of th e static m apping m ethod, th e software tool and the sim ulator to im plem ent and evaluate the proposed load balancing m ethods, th e im plem entation of a dis trib u ted load balancer, and th e perform ance evaluation from th e sim ulation and benchm ark results. P art of this thesis work has been published or subm itted for publication in [63], [64], [127], [128] and [129]. This chapter reiterates th e m ost significant results of the thesis and discusses directions for future research. 7.1 P rim a r y R e su lts o f T h e sis This thesis has studied b o th static and dynam ic load balancing m ethods on a message-pas sing m ulticom puter. T he effectiveness of the proposed sim ulated annealing m ethod for static load balancing and the adaptive scheme for dynam ic load balancing has been verified by th e sim ulation and benchm ark results. The prim ary results of this thesis indicate th at: 1 . Sim ulated annealing is a flexible and suitable m ethod to solve the static m apping problem. 138 2. An adaptive and relatively simple dynam ic load balancing m ethod using heuristics, can reduce the overhead and achieve b e tter perform ance. 3. D ynam ic load balancing can exploit parallelism at the process control level w ith operating system support. 4. T he dual-level load balancing scheme is im plem entable on a m essage-passing m ulticom puter, and can be extended to distributed and netw ork operating system s. These conclusions are discussed in greater detail in the following subsections. 7.1.1 F lex ib ility and S u itability of th e Sim ulated A n n ealin g M eth od for S ta tic Load B alancing Sim ulated annealing is a well-known heuristic m ethod to solve com binato rial problem s. T he proposed static m apping m ethod uses sim ulated annealing to achieve a suboptim al allocation which minimizes the to ta l com putational tim e and com m unication cost and balances the load distribution at post com pile tim e. By defining th e cost function, the optim ization objectives can be em phasized in dif ferent ways and easily be changed. My experim ents have shown th a t this m ethod can be used to solve either th e general m apping of partitioned program modules or special m apping of m ultiple rules in a production system . T he m ethod can also be used for static task allocation or job scheduling. The m ethod is flexible enough to be used in different distributed com puting environm ents. In general, sim ulated annealing always appears as a slow process because th e quality of the solution is usually expected to be extrem ely high. However, this thesis shows th a t if only a suboptim al solution is desired, sim ulated annealing is suitable to solve th e m apping problem within a reasonably short tim e. T he theoretical proofs in the near optim ality and sub-quadratic com plexity show th a t this m ethod can 139f achieve a satisfiable solution quality and has much less overhead th an the trad i tional m athem atical 0-1 linear program m ing. Com pared to th e graph theoretical m ethods, it is less complex to be applied to large sized problem s. The quality of the sim ulated m apping m apping m ethods is higher th an th e simpler heuristic m ethods such as greedy and iterative improvement. 7.1.2 H igh Perform ance of th e A d aptive M eth o d for D yn am ic Load B alancing I have presented four simple and easy-to-im plem ent heuristic m ethods which avoid frequent load indices exchanges among nodes w ith the coordination of a host processor. In m y scheme, each node updates the threshold on a periodic basis. The host processor is restricted to perform only th e collection and broadcasting of load indices from ah nodes, and does not p articipate in th e decision of process m igration or actual execution of user jobs. By using th e inform ation provided by th e host, the distributed decision m aking requires no handshaking among nodes. B oth parallel sim ulation and benchm ark program execution experim ents show th a t the proposed new adaptive m ethod for dynam ic load balancing can achieve high perform ance. This leads to a new lesson: the simple and effective m ethods can provide b etter perform ance th an complex and com plicated m ethods. Process m igration is not an easy issue in th e im plem entation of distributed operating system s. Previously proposed adaptive m ethods generally require relatively high overhead, which reduces the feasibility to be im plem ented. T he new adaptive m ethod proposed in this thesis is distinct in its sim plicity and cost-effectiveness. It therefore leads to a higher perform ance with a lower com m unication and control cost. 140 7.1.3 P arallelism at the P rocess C ontrol Level The developm ent of the dynam ic load balancer V L B at a 32-node hypercube m ulticom puter provides a new way to exploit parallelism at th e process control level, supported by the ru n and suspend OS prim itives. T he benchm ark experi m ents show th a t parallelism can be maximized autom atically in a dynam ic load balancing environm ent. This m ethod exploits the chaotic-grain parallelism , which is m uch easier to be achieved in a distributed environm ent and has less lim ita tion th an fine-grain parallelism . Fine grain parallelism usually requires detecting the dependency at the compile tim e and is not suitable for d istributed com put ing. The speedup obtained from the benchm ark program execution verifies the effectiveness of this control level parallelization m echanism . 7.1.4 Im p lem en tation of th e D ual-L evel Schem e T he experim ents carried out in this thesis work include th e developm ent of th e software m apping tool (SIM AL), the parallel discrete event-driven sim ulator (PSIM ), th e prototype dynam ic load balancer (DLB) and a set of benchm ark experim ents. T he perform ance evaluation derived from these experim ents has verified the effectiveness of the dual-level load balancing schem e on a message- passing m ulticom puter. On th e other hand, th e software experim ents themselves have shown the capability of the load balancing m ethods to be im plem ented on a real distributed or netw ork operating system. Recent works on process m igration in th e netw ork operating systems provide variable im plem entation choices [3], [18], [27] [130]. This makes th e im plem entation of the proposed load balancing very feasible. 141 7.2 S u g g e stio n s for F u tu re R ese a rch A lthough m any issues regarding the use of load balancing in parallel and dis trib u ted processing have been addressed in this thesis, m any m ore rem ain to be addressed. In this last section, I will discuss some of these issues. The sim ulated annealing m ethod for static load balancing is addressed to the code allocation at th e post compile tim e. In a distributed system , the static task allocation or job scheduling can also use this m ethod. In this case, th e objectives will be more complex and have m ore constraints, especially in a real-tim e system . T hus, the cost function needs to be defined to reflect the optim ization objectives. The sim ulated annealing m ethod can work on a set of independent ready jobs in coordinating with the dynam ic scheduling. This thesis is prim arily addressed to a m essage-passing m ulticom puter. Due to th e lim itations of an available facility, th e experim ents axe restricted to a 32-node iP S C /2 hypercube m achine. However, the load balancing problem addressed is general to all parallel and d istributed system s. T he proposed m ethods can be ex tended to any loosely-coupled system s, such as local netw ork system s. W hen sys tem size increases, th e proposed dynam ic load balancing m ethod can be clustered. This means th a t the current models can be used as a cluster in th e entire system. T he scalability can be achieved w ith a hierarchical load balancing scheme. Achieving parallelism at th e process control level, w ith th e supports of OS prim itives and dynam ic load balancing m echanism , appears m ore interesting th an expected. The experim ents done on th e parallel quicksort derived some valuable results. Since this thesis concentrates on the load balancing scheme, further ex perim ents on parallel sorting have not been carried out. However, it would be a m eaningful research effort when using th e proposed control level parallelization scheme to do experim ents on various parallel sorting algorithm s, such as column- sort and m ergesort. I expect the results will agree w ith th e previously proposed 142 work either on the analytical models or sim ulations based on shared-m em ory m ul tiprocessors. Process m igration is always a difficult problem in distributed operating system im plem entation. This thesis did only prototype im plem entation on top of the hypercube node operating system , due to th e inability to access the kernel. Real im plem entation issues need to be studied further and exercised as a feedback to im prove th e proposed adaptive model and heuristic m ethods. The v irtual page m igration and cashing are very im portant issues which need to be studied. The integration of the static and dynam ic load balancing needs to be im ple m ented on a m ultitasking m ulticom puter system . T he iP S C / 2 hypercube system doesn’t support m ultitasking at distributed nodes. An integrating scheme can be developed for a m ultitasking network operating system . Careful consideration is needed when designing the interface betw een the static and dynam ic load balanc ing scheme. Clustering the dual-level load balancing for a very large distributed system is predicted as a meaningful research topic for the future. 143 References [1] E.H .L.A arts, P.J.M .Laarhoven, “S tatistical Cooling: A G eneral A pproach to C om binatorial O ptim ization Problem s” , Philips J. o f Research, vol. 40, pp. 193-226, 1985. [ 2 ] A. Achaya and M. Tam be, “Production System on M essage Passing Com puters: Sim ulation Results and A nalysis” , Proc. International Conference on Parallel Processing, pp. 11-246-254, 1989. [3] Y. A rtsy and R. Finkel, “Designing a Process M igration Facility: T he C har lo tte Experience” , IE E E Com puter, vol.22, pp.47-56, Sept. 1989. [4] BBN Advanced Com puters Inc. “B utterfly P ro d u cts Overview” 1986. [5] A. B arak, A. Shiloh, and R. W heeler, “Flood Prevention in the MOSIX Load-balancing Scheme” , IE E E Com puter Society Technical Com mittee on Operating System s Newsletter, vol. 3, p p .23-27, W inter, 1989. [6 ] J. Barhen and J.F . Palm er, “The H ypercube in Robotics and M achine Intel ligence”, Computers in Mechanical Engineering, pp. 30-38, M arch, 1986. [7] K.M . B aum gatner and B.W . W ah, “GAM M ON: A Load Balancing Strategy on Local C om puter System w ith M ultiaccess N etw orks” , IE E E Trans, on Computers, p p .1098-1109, vol. 38, no. 8 , A ugust, 1989. [8 ] J . B axter and J.H . Patel, “The LAST A lgorithm : A H euristic-Based Static Allocation A lgorithm ” , International Conference on Parallel Processing, pp. 11-217-222, 1989. [9] S.H. Bokhari, “On the M apping Problem ” , IE E E Trans, on Computers, pp. 207-214, C-30, 3, M arch 1981. [10] S.H. Bokhari, “P artitioning Problem s in Parallel, Pipelined and D istributed Com puting” , IE E E Trans, on Computers, vol. 37, no. 1, January, 1988. 144 [11] F. Bonomi and A. K um ar, “A daptive O ptim al Load Balancing in a Heteroge neous M ultiserver System with a C entral Job Scheduler” , IE E E International Conference on Distributed Com puting, pp. 500-507, 1988. [12] J.D . Brock, A.R. Omondi and D.A. Plaisted, “A M ultiprocessor A rchitec ture for M edium -G rain Parallelism ” , IEE E. International Conference on Dis tributed Com puting, pp. 167-174, 1986. [13] L. Brownston, R. Farrel, E. K ant, N. M artin, Programming Expert System s in OPS5 : A n Introduction to Rule-Based Programming. Addison-Wesley, 1985. [14] T.L. Casavant and J.G . Kuhl, “A Form al M odel of D istributed Decision M aking and Its A pplication to D istributed Load B alancing” , Proc. o f Inter national Conference on Distributed Com puting, pp. 232-239, A ugust, 1986. [15] T.L. Casavant and J.G . Kuhl “Analysis of Three Dynam ic D istributed Load- Balancing Strategies with Varying Global Inform ation R equirem ents” , Proc. o f International Conference on D istributed Com puting, pp. 185-192, 1987. [16] T.L. Casavant and J.G . Kuhl, “A Taxonomy of Scheduling in G eneral-Purpose D istributed Com puting System s” , IE E E Trans, on Software Engineering, vol. 14, No.2, February 1988. [17] H.Y. Chang and M. Livny, “D istributed Scheduling under D eadline Con straints: a Com parison of Sender-initiated and receiver-initiated A pproaches” , Proc. o f IE E E Real Tim e System Sym posium , pp. 175-180. 1986. [18] D.R. Cheriton, “The V D istributed System ” , C om m unications o f the A C M , vol.31, pp.314-333, M arch, 1988. [19] Y.C. Chow and W .H. Kohler, “M odels for D ynam ic Load Balancing in a Heterogeneous M ultiple Processor System ” , IE E E Trans, on Computers, vol. c-28, No. 5, May, 1979. [20] S. Chowdhury, “T he Greedy Load Sharing A lgorithm ”, Journal o f Parallel and Distributed Com puting, vol. 9, 1990. [21] R. Chowkwanyun, “Dynam ic Load Balancing In C oncurrent LISP Execution on a M ulticom puter System ” , Ph.D . D issertation, D ept, of EE Systems, USC. 1988. 145 [22] T.C . Chou and J.A . A braham , “Load Balancing in D istributed System s” , IE E E Trans, on Software engineering, vol.SE-8 , no. 4, July, 1982. [23] T.C . Chou and J.A . A braham , “D istributed Control of C om puter System s” , IE E E Trans, on Com puters, vol.C-35, no. 6 , June, 1986. [24] W .W . Chu, “O ptim al File Allocation im M ultiple C om puting System ” , IE E E Trans, on com puters, pp. 885-889, vol. c-18, no. 10, 1969. [25] G. Cybenko, “Dynam ic Load Balancing for D istributed M emory M ultipro cessors” , Journal o f Parallel and Distributed Com puting pp. 279-301, vol.7 no. 2, O ctober, 1989. [26] V.V. Dixit and D. I. Moldovan “T he Allocation Problem in Parallel P roduc tion System s” , Journal o f Parallel and D istributed Computing, vol. 8 , pp. 20-29, 1990. [27] F. Douglis and J. O usterhout, “Process M igration in th e Sprite O perating System ” , Proc. 7th International Conference on D istributed Computing Sys tem s, Berlin, W est Germany, pp. 18-25, IE E E , Sept. 1987. [28] F. Douglis, “Experience w ith Process M igration in Sprite” , Workshop on Experiences with Distributed and M ultiprocessor System s, F t. Lauderdale, Florida, O ctober 5-6, 1989. [29] M. D ubois, F. A. Briggs, I. P atil, M. Balkrishnan, “Perform ance Analysis of Parallel Quicksort in Shared-M em ory M ulticom puters” , EE D ept. USC, 1987. [30] D. Eager, E. Lazowksa and J. Zahorjan, “A daptive Load Sharing in Homo geneous D istributed System s” IE E E Trans, on Software Engineering, pp. 662-675, SE-12, 5, May, 1986. [31] D. Eager, E. Lazowsk, and J.Z ahorjan, “A Com parison of R eceiver-initiated and Sender-initiated A daptive Load Sharing” , Perform ance Evaluation, Vol.6 , N orth Holland, 1986. [32] K. Efe, “H euristic Models of Task Assignm ent Scheduling in D istributed Sys tem ” , IE E E Computer, pp. 50-56, June, 1982. [33] K.Efe and B. Groselj, “M inimizing Control O verhead in A daptive Load Shar ing” , Proc. of International Conference on D istributed Com puting, pp. 307- 314, 1989. 146 [34] A.K. E zzat, R.D. Bergeron and J.L . Pokoski, “Task Allocation Heuristics for D istributed Com puter System s” , Proc. o f International Conference on D istributed Com puting, pp. 337-346, 1986. [35] A.K. E zzat, “Load Balancing in N EST: A Network of W orkstations” , Proc. o f Fall Joint Conference, pp. 1138-1149, 1986. [36] W. Fello, A n Introduction to Probability Theory and Its Applications, vol. 1, W iley, New York, 1950. [37] D. Ferguson, Y. Yemini and C. Nikolaou, “M icroeconomic A lgorithm s for Load Balancing in D istributed C om puter System s” , Proc. o f International Conference on Distributed Com puting, pp. 491-499, 1988. [38] R. Ferrari and S. Zhou, “A load Index for Dynam ic Load B alancing” , Proc. o f A C M -IE E E Fall Joint Comp. Conf., pp. 684-690, 1986. [39] D. Ferrari, G. Serazzi and A. Zeigner, M easurem ent and Tuning o f Com puter System s, Prentice-H all, Englewood Cliffs, N J. 1983. [40] C. L. Forgy, “R ete : A fast algorithm for m any p a ttern /m a n y object p atte rn m atch problem .” A rtificial Intelligence, 19:17-37, 1982. | [41] G.Fox, S. O tto and E. Um land, “M onte Carlo Physics on a C oncurrent Pro cessor” , Journal o f Statistical Physics, vol 43, June, 1986. [42] G. Fox, “A Review of A utom atic Load Balancing and Decom position M eth ods for th e H ypercube” , M innesota Institute fo r M athem atics and its Appli cation Workshop November 6 , 1986. [43] M .R. G arey and D.S. Johnson, Com puters and Intractability: A Guide to the Theory o f NP-Com pleteness, Freem an Publisher, San Francisco, 1979. [44] J.L. G audiot and J.I. Pi, “Program G raph Allocation in D istributed Multi- com puters” parallel Com puting, N orth Holland, vol 27, pp227-247. 1988. [45] J. G ait, “Scheduling and Process M igration in P artitioned M ultiprocessors” , Journal o f Parallel and Distributed Com puting, vol. 8 , pp 274-279, 1990. 147 [46] S.Gem an and D. G em an, “Stochastic Relation, Gibss D istribution, and the Bayesian R estoration of Im ages” , IE E E Trans, on P attern Analysis and M achine Intelligence, Vol.6 , p p .721-741, November, 1984. [47] S. Gem an and C.-R Hwang, “Diffusion for Global O ptim ization” , Div. o f Applied Math, Brown University, December 1984. [48] B. Gidas, “N on-stationary M arkov Chains and Convergence of the Annealing A lgorithm ” , J. S tat. Phys.39, pp. 73-131, 1985. [49] G. G rest, C.M. Soukoulis and K. Levin, “Cooling-Rate D ependence for the Spin-Glass G round-State Energy: Im plication for O ptim ized by Sim ulated A nnealing” , Physical Review Letters, vol. 56, no. 11, pp. 1148-1151. M arch 17, 1986. [50] V.B. Gylys and J.A . Edw ards, “O ptim al P artitioning of W orkload for Dis trib u ted System ” , Digest o f papers, CO M PCO N, pp .353-357, Fall, 1976. [51] A. G upta, C. Forgy, D. K alp, A. Newell, M.S. Tam be, “Parallel OPS5 on the Encore M ultim ax.” Proc. o f International Conference on Parallel Processing. A ugust, 1988. [52] A. G upta, M. Tam be, “Suitability of M essage Passing Com puters for Im ple m enting P roduction System s.” Proc. o f N ational Conference & A I, A A A I-88, A ugust 1988. [53] A. G upta, Parallelism in Production System s, M organ K aufm an Publishers, Inc. 1987. [54] T . Ishida, “M ethods and Effectiveness of Parallel Rule Firing” , Proc. o f the sixth IE E E Conference on Artificial Intelligence Applications, pp. 116-122, Santa B arbara, 1990. [55] A. H a’c and X. Jin , “Dynam ic Load Balancing in a D istributed System Using a D ecentralized A lgorithm ” , Proc. of International Conference on Distributed Com puting, pp. 170-177, 1987. [56] E.K. H addad, “P artitioned Load Allocation for M inim um Parallel Processing Execution T im e” , Proc. o f International Conference on Parallel Processing, pp. 11-192-199, 1989. 148 [57] J. Hayes, T.M udge, Q. S tout, S. Colley and J. Palm er, “A rchitecture of a H ypercube Supercom puter”, Proc. o f International Conference on Parallel Processing pp. 653-660, 1986. [58] B. H ajek, “A Tutorial Survey and A pplications of Sim ulated A nnealing” , IE E E Proc. o f 24th Conference on Decision and Control pp. 755, December 1985. [59] F. H arary, Graph Theory, Addison-W esley, New York, N.Y., 1969. [60] J.J. Hopfield and D.W . Tank, “Com puting w ith Neural Circuits: A M odel” , Science, pp. 225-233, 1986. [61] K. Hwang, W. C roft, G. Goble, B. W ah, F. Briggs, W . Sim m ons, and C. Coates, “A UNIX-based Local C om puter N etwork w ith Load Balancing” , IE E E Com puter, vol.10, no.4, p p .55-66, April, 1982. [62] K. Hwang and D. D eG root (E ds), Parallel Processing fo r Supercomputers and Artificial Intelligence, M cGraw Hill, N.Y. M arch 1989. [63] K. Hwang and J. X u, “Efficient Allocation of P artitio n ed Program Modules in a Mess age-Pas sing M ulticom puter” , to appear Proc. IS M M International on Parallel and D istributed Computing and System s, O ctober, 1990. [64] K. Hwang and J. X u, “H euristic Process M igration for Dynam ic Load Bal ancing in A M essage-Passing M ulticom puter” , subm ited to IE E E Trans, on Parallel and D istributed System s, July 12, 1990. [65] C.H. Hsu and J.W . Liu, “Dynam ic Load Balancing A lgorithm s in Homoge neous D istributed System s” , IEE E. International Conference on Distributed Com puting, pp. 216-223, 1986. [ 6 6 ] J. Roller, “A D ynam ic Load Balancer on the Intel H ypercube” , Tech. Report 158-79, Caltech 1988. [67] J. Kim, C.R. Das and W. Lin, “A Processor Allocation Scheme for H ypercube C om puters” , Proc. o f International Conference on Parallel Processing, pp. 11-231-238, 1989. [ 6 8 ] S. K irkpatrick, C.D. G elatt, Jr., and M .P. Vecchi, “O ptim ization by Simu lated A nnealing” , Science, vol. 220, no. 4598, pp. 671-680 May, 1983. 149 [69] D. K nuth, The A rt o f Com puter Programming, Addison-W esley, 1969. [70] P. Krueger and R. Finkel, “An A daptive Load Balancing A lgorithm for a M ulticom puter” , Tech. Rep. 539, University o f Wisconsin, M adison, April 1984. [71] S. Kuo, D. Moldovan and U. Schw uttke, “Parallel Rule Firings in P roduction System Using D ata Dependence Analysis” , EE D ept., University of Southern California, Los Angeles, 1989. [72] S. Kuo and D. Moldovan, “Control in P roduction Systems w ith M ultiple Rule Firings”, Teh Rept. CENG 89-14, EE D ept., University of Southern California, Los Angeles, 1989. [73] S. Kuo and D. M oldovan, “Control in P roduction Systems w ith M ultiple Rule Firings” , Proc. o f International Conference on Parallel Processing, A ugust 1990. [74] P.J.M . van Laahoven and E.H. A arts, Sim ulated Annealing: Theory and A p plications, D. Reidel, T he N etherlands, 1988. [75] D. Lawrie, “Access and Alignment of D ata in an A rray Processor” , IE E E Transaction on Computers, vol c-24, pp. 1145-1155 Decem ber 1975. [76] S.S. Lavenberg, “A Perspective on Queueing Models of C om puter Perfor m ance” , Perform ance Evaluation, North Holland vol. 10, pp. 53-76, 1989. [77] F.C .Lin and R.M .Keller, “G radient model: a D em and-driven Load Balancing Scheme” , IE E E Proc. o f 6th Conference o f D istributed Computing, pp. 329- 336, A ugust, 1986. [78] M. Livny and M. M elman, “Load Balancing in Homogeneous B roadcast Dis trib u ted System s” , Proc. o f A C M Com puter Network Perform ance Sym po sium , April, 1982. [79] V.M . Lo and J.W .S. Liu “Task Assignm ent in D istributed M ultiprocessors System ” , Proc. o f International Conference on Parallel Processing, pp. 358- 360, A ugust, 1981. [80] V.M . Lo, “H euristic Algorithm s for Task Assignment in D istributed Sys tem s” , IE E E Trans, on Computers, vol 37, N o.1 1 , pp. 1384-1397, November, 1988. 150 [81] M. Lundy and A. Mess, “Convergence of an Annealing A lgorithm ” , M ath. Prog, vol 34. pp 111-124, 1986. [82] I. M astsuda, “O ptim al Sim ulated-A nnealing M ethod Based on Stochastic- dynam ic Program m ing” , Physical Review , vol. 39, no.3, pp. 2635-2640, M arch 1989. [83] N. M etropolis, A. Rosenbluth, M. R osenbluth A. Teller and E. Teller, “E qua tion of S tate Calculations by Fast C om puting M achines” , Journal o f Chemical Physics, pp. 1087, vol. 21, 1953. [84] R. M irhcandaney and D. Towsley, “A daptive Load Sharing in Heterogeneous System s” , Proc. o f International Conference on D istributed Com puting, pp. 298-306, 1989. [85] V.M . M ilutinovi’c, J .J . Crnkovic and C.E. H oustis, “A Sim ulation S tudy of Two D istributed Task Allocation Procedures” , IE E E Trans, on Software Engineering, vol. 14, no. 1, January, 1988. [ 8 6 ] D. M itra, F. Romeo, A.L.S.-Vincentelli, “Convergence and F inite T im e Be havior of Sim ulated Annealing” Proc. IE E E Int. Conf. on Com puter Design, P o rt Chester, pp 652-657, November, 1984. [87] D. M oldovan, “RUBIC: A M ultiprocessor for Rule-Based System s” , IE E E Trans, on System s, M an and Cybernetics July, 1989. [8 8 ] A. M untz and J. Xu, “Perform ance Analysis of Three Parallel Sorting Algo rithm s in Shared-M em ory M ultiprocessors” , CS D ept. USC, 1988. [89] L.M. Ni and K. Hwang, “O ptim al Load Balancing Strategies for a M ultiple Processor System ” , Proc. o f International Conference on Parallel Processing, pp 362-367, A ugust, 1981. [90] L.M. Ni and K. Hwang, “O ptim al Load Balancing in M ultiple Processor System w ith M any Job Classes” , IE E E Trans, on Software Engineering, pp. 491-496, vol.SE-1 1 , no. 5, May, 1985. [91] L.M. Ni, C. Xu, and T.B. G endreau, “A D istributed G rafting A lgorithm for Load Balancing” , IE E E transactions on Software Engineering, vol. SE-11, N o.10, p p .1153-1161, O ctober, 1985. 151 [92] M.L. Powell and B.P. Miller, “Process M igration in D em os/M P ” , Proc. o f the ninth Sym posium on Operating system s Principles, pp. 110-119, December, 1983. [93] P.L. Price and D.W . Sm ith, “Analysis and Use of an Integer Program ming Model for O ptim ally Allocating Files in a M ultiple C om puter System ” , D T N S R D C -78/102, Bethesda, M D., November, 1978. [94] S. Pulidas, D. Towsley and J. Stankovic “Em bedding G radient E stim ators in Load Balancing Algorithm s” , Proc. o f International Conference in D is tributed Com puting, pp. 482-490, 1988. [95] K. Oflazer, “P artitioning in Parallel Processing of P roduction System s.” PhD thesis, Carnegie-M ellon University, M arch, 1987. [96] I. O utsterhout, “T he Sprite Network O perating System ” , IE E E Com puter, vol.2 1 , pp.23-36, Feb. 1988. [97] K .R am am ritham , J . Stankovic, “D istributed Scheduling of Tasks w ith D ead lines and Resource Requirem ents” , IE E E Trans, on C om puters, vol. 38, No. 8 , A ugust, 1989. [98] D. Reed and R. Fujim oto, M ulticom puter Networks, M essage-Based Parallel Processing, T he M IT Press, 1985. [99] F. Reif, Fundam ental o f Statistical and Therm al Physics, M cGraw Hill, New York, 1965. [100] F. Romeo, A.L. S.-Vincentelli and C. Sechen, “Research on Sim ulated An nealing a Berkeley” , Proc. o f IE E E Int. Conf. on Com puter Design, P ort Chester, pp 652-657, November, 1984. [101] R. R utenbr, “Sim ulated Annealing Algorithm s: An Overview” , IEEE, Cir cuits and Devices Magazine, pp 19-26, January, 1989. [102] C.Sechen and A.S. Vincentelli, “The Tim berW olf Placem ent and Routing Package” , IE E E Journal o f Solid-State Circuits, vol. sc-20, no. 2 , April, 1985. [103] K.G. Shin and Y.C. Chang, “Load Sharing in D istributed Real-Tim e Sys tem s with State-C hange Broadcasts” , IE E E Trans, on Com puters, vol. 38, No. 8 , p p .1124-1142, A ugust, 1989. 152 __________________________________________________________________________________________ I [104] C. Seitz, “The Cosmic C ube” , Com m unications o f the ACM , vol. 28, no. 1, pp. 22-33. January, 1985. [105] C.S. Steele, “Placem ent of Com m unicating Processes on M ultiprocessors on Networks” , Caltech Technical Report 5184-'TR:85, A pril, 1985. [106] R.G. Sm ith, “Control in a D istributed Problem Solver” , IE E E Trans, on Computers, C-29, Decem ber 1980. [107] J.A . Stankovic, “A Perspective on D istributed C om puter System s” , IE E E Trans, on Computers, c-33(12), pp. 1102-1115, Decem ber, 1984. [108] J.A . Stankovic and I.S. Sidhu, “An A daptive Bidding A lgorithm for P ro cesses, Clusters and D istributed G roup”, Proc. o f International Conference on Parallel Processing, A ugust, 1984. [109] J. Stankovic, “An A pplication of Bayesian Decision Theory to Decentralized Control of Job Scheduling” , IEEE. Trans, on Com puters, vol. c-34, no. 2, February, 1985. [110] J.A . Stankovic, “S tability and D istributed Scheduling A lgorithm s” , IE E E Trans, on Software Engineering, vol. SE-11, no. 10, pp. 1141-1152, O ctober, 1985. [111] J.A . Stankovic K. R am am ritham and S. Cheng, “E valuation of a Flexible Task Scheduling A lgorithm for D istributed H ard R eal-Tim e System ” , IEEE. Trans, on Computers, vol. c-14. no. 12, p p .1130-1142, Decem ber 1985. [112] J.A . Stankovic and D. Towsley, “Dynam ic Reallocation in a Highly In te grated Real-Tim e D istributed System s” , IEEE. International Conference on D istributed Computing, pp. 374-381, 1986. [113] H.S. Stone, “M ultiprocessor Scheduling w ith th e Aid of Network Flow Al gorithm s” , IE E E Trans, on Software Engineering, vol.SE-3, n o .l, January, 1977. [114] Tantaw i, D. Towsley, “O ptim al Static Load Balancing in D istributed Com p u ter System ” , Journal o f the ACM , vol.32, pp. 445-465, A pril, 1985. [115] F.M . Tenorio and D. I. Moldovan, “M apping P roduction Systems into M ul tiprocessors” , Proc. o f International Conference on Parallel Processing, A u gust, 1985. 153 [116] M. Theim er, “P reem ptable Rem ote Execution Facilities for Loosely-Couple D istributed System s” , Ph.D . Thesis, Standford University, 1986. [117] , A. Thom asian, “A Perform ance Study of Dynam ic Load Balancing in Dis tributed System s” , Proc. o f International Conference on D istributing Com- puting, pp. 178-184. [118] A.M. Tilborg and L. W ittie, “W ave Scheduling - Decentralized Scheduling for Task Forces in M ulticom puters” , IE E E Trans. Com puters, Vol. C-33, pp. j 835-844, Septem ber, 1984. * ; [119] J.N . Tsitsiklis, “M arkov Chains w ith Rare Transitions and Sim ulated An- j nealing” , P reprint, M IT L aboratory for Inform ation and Decision System s, j A ugust, 1985. I I ! [120] J.N . Tsitsiklis, D.P. Bertsekas, and M. A thans, “D istributed Synchronous j D eterm inistic and Stochastic G radient O ptim ization A lgorithm s” , IE E E Trans. I on A utom atic Control, vol. AC-31, no. 9, Septem ber, 1986. j [121] A.S. Tanenbaum , Com puter Networks, Prentice Hall, Inc., Englew ood Cliffs, i ‘ N .J. 1989. I | [1 2 2 ] B. W alker, R. English, C. Kline and G. Thiel, “T he LOCUS D istributed O perating System ” , Proc. o f the N inth Sym posium on Operating System s | Principles, O ctober, 1983. i ! [123] Y. W ang and R. M orris, “Load Sharing in D istributed System s” , IE E E 1 Trans, on Com puters, pp. 204-217, vol. c-34, no.3, 1985. I ; [124] S. Zhou and D. Ferrari, “A M easurem ent Study of Load Balancing Perfor- j m ance” , Proc. o f 7th I n t’l Conf. on Distributed Processing, pp. 490-497, 1987. ! [125] B.W . W ah and P. M ehra, “Learning Parallel Search in Load B alancing” , Proc. o f Workshop Parallel Algorithm s M achine Intelligence P attern Recog- i nition , AAAI, M inneapolis, MN. A ugust, 1988. r I [126] B.W . W ah and J.Y . Jan g , “An Efficient Protocol for Load Balancing on C SM A /CD Networks” , Proc. o f 8th Conference on Local Com puter Netw orks, . pp .55-61, O ctober, 1983. [127] J.X u and K. Hwang, “Sim ulated Annealing M ethod for M apping Produc- j tion Systems onto M utlicom puters” , Proc. of the sixth IE E E Conference on j Artificial Intelligence Applications, pp. 130-136, Santa B arbara, M arch 7-9, i 1990. | I [128] J. Xu and K. Hwang, “Parallel M apping of Production Rules onto M ulti- i processor w ith D istributed M em ory” , subm ited to th e Special Issue on E n abling Technology fo r Knowledge-Based System s in the IE E E Transactions ■ on Knowledge and Data Engineering by invitation, February, 1990. \ * ; [129] J.X u and K .H w ang, “H euristic M ethods for Dynam ic Load Balancing in I a M essage-Passing Supercom puter” , to appear in Proc. IE E E and A C M Supercomputing ’90, New York, NY. November 12-16, 1990. i I , [130] E. Aayas, “A ttacking the Process M igration B ottleneck” , Proc. 11th A C M | Sym posium on Operating System Principles, A ustin, TX . p p .13-22, Nov. I 1987. Appendix A The SIMAL Code S I M A L (SIM ulated AnneaLing) is a software m apping tool which im plem ents the static load balancing m ethod using sim ulated annealing. T he code presented here is used to m ap parallel production system s. To m ap partitio n ed program m odules, another version of code will be used w ith slight difference in d a ta struc tures and procedures. A .l H ead er M o d u le This m odule defines th e constants, d ata structures, global variables and the function returns of the SIMAL program . /* simal.h -- header file of SIMAL (Simulated Annealing) */ #include <stdio.h> #include <math.h> #include <ctype.h> #include "graph.h" /* define constant */ #define HYPERCUBE 1 #define MESH 2 #define TREE 3 ftdefine RING 4 /* Hardware topology */ /* generating function */ #define L_EX 1 /* local exchange */ #define G_EX 2 /* global exchange */ 156 # d e fin e MIGRATE 3 /* l o c a l m ig ra te * / #define ERROR -9 #define EMPTY -1 / * cooling method */ #define SIMPLE1 1 #define SIMPLE2 2 #define ELABORATE 3 #define MAX.NODE 32 / * max number of nodes * / #define MAX.RULE 1600 /* max number of rules * / #define MAX.RN 50 / * max rules per node */ #define MAX_T 2000 /* max number of trails */ / * estimated cost * / #define PAR_C0ST 3 #define PSTAR_C0ST 1.5 #define COMM_TIME 1 #define MATCH.TIME 1.8 #define max(x,y) = ( x > y> ? x : y; /* max(x,y) #define min(x,y) = C x < y) ? x : y; /* min(x.y) /*****************************************/ /* data structure */ I if; He * **>(< typedef struct conn { int id; int comm_time; struct conn *next; > CONN, *C; /* connection: logic neighbors */ /* rule number of neighbor */ / * communication cost with it */ /* pointer to next neighbor */ ef struct node ■ { /* node -- processor */ int id; / * node number */ int tag; / * tag for modification */ float load; 1* load of node */ float par.cost; /* parallelism cost */ int num.rule; / * number of rules in node */ float comm; /* communication cost with others int rules[MAX..RN] ; /* rules assigned to node * / > NODE, *ND; typedef struct rule { int id; int tag; int node; float comm; /* rule */ /* rule number */ /* tag to indicate new assignment */ /* node being assigned to */ /* communication cost with others */ struct conn *neighbor; / * list of neighbors * / 157 > MODULE, *MD; /*****************************************/ /* input parameter */ extern int extern int extern int extern double A; extern int init_choice; extern int extern int extern int extern int M; N; cool_choice; gen_fun; topology; DIM; K; / * number of rules */ /* number of nodes */ /* choice of cooling method */ /* cooling constant */ /* choice of initial assignments */ /* choice of generate function */ / * network topology */ /* dimension or range of topology */ /* termination constant */ / a t e * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* global variable */ j * * * * * * * * % ate * * * j(c * * * * * * * * * * * * * * * * * * * * * * sfc s i t * * * / extern float T; /* initial temperature */ extern double esp; / * termination epsilon */ extern float load.avg; /* system load average * / extern float old.load.avg; extern float total.load; / * system total load */ extern float old.total.load; extern float E; / * cost function of accepted config */ extern float El; /* cost function of new trial */ extern float delta.E; /* El - E */ extern float init.delta.E; / * max delta E * / extern float E.par; /* total parallelism loss */ extern float old.E.par; extern float E.comm; /* total communication of accepted */ extern float old.E.comm; extern float E.imb; /* total load imbalance */ extern float old.E.imb; extern int init.T.ok; /* flag for init T setting */ extern int num.accept; /* number of trials accepted */ extern int debug; /* high level debug flag */ extern int debug1; /* low level debug flag */ /* static array */ extern int com_matrix[MAX_RULE] [MAX.RULE] ;/’ ''communication matrix */ extern int W.matrix[MAX.RULE] ; /*firing frequency matrix */ extern float par_cost_matrix[MAX_RULE][MAX.RULE]; extern int par.matrix[MAX.RULE][MAX.RULE]; /* parallelism matrix */ extern int alloc.matrix[MAX.RULE] ; /* allocation matrix */ extern int D.matrix [MAX.NODE][MAX.NODE] ;/* distance matrix */ extern float E.tab[MAX.RULE] ; extern ND nodetab[MAX.NODE]; /* cost table for each trial /* node table, new config */ extern ND old_nodetab[MAX.NODE]; /* old node table, saved config */ extern ND init_nodetab[MAX_N0DE];/* init node table, saved confg */ extern MD rule_tab[MAX.RULE]; /* rule table */ extern float E_avg[MAX_T]; /* average cost at T */ /*****************************************/ /* Variables used for graphics */ extern int XO; extern int YO; extern int Xlen; extern int Ylen; extern int y_EO; extern int X_interval; extern int old_x; extern int old_y; extern float EO; extern float rel.EO; I* return-value functions */ /*****************************************/ extern float cal_load_avg() ; * system load average */ extern float cal_E_original(); * initial E */ extern float cal_E(); * new E */ extern float cal_E_avg(); * average E at a temperature */ extern float cal_rule_comm(); * rule communication cost */ extern float cal_comm(); * node communication cost */ extern float cal_para_ cost C); ★ node parallelism cost */ extern float cal_load(); * node’s load */ extern float coolingO ; * temperature */ extern float set_init_T(); * initial temperature * / extern double flt_abs() ; * absolute value of float */ extern int terminate(); * termination flag */ extern int BoltzmannC); * Boltzmann function */ extern int l_select_node(); * node number, local */ extern int g_select_node(); * node number, global * / extern int select_rule(); * rule index in node rule array */ extern ND node.alloc(); * pointer to an allocated node */ extern MD rule_alloc(); * pointer to an allocated rule */ extern C conn_alloc(); * pointer to an connection node */ extern C rel_gen(); * pointer to a relation list */ extern FILE *file_alloc(); * pointer to an allocated file */ extern void input(); * input from terminal*/ extern void input1(); * input parameters */ extern void system_data(); * build up system data */ extern void comm_gen(); * generate communication matrix */ 159 — - ------------- extern void para_gen(); / * generate parallelism matrix */ extern void w_gen(); / * gen. firing frequency matrix *Z extern void rule_tab_gen(); /* generate rule table */ extern void init(); / * generate initial config. */ extern void init_window(); / * initialize graphic window */ extern void save_init(); /* save initial config. */ extern void restore_init() ; /* restore older initial config. */ extern void hardware_topo(); /* generate D matrix */ extern void anneal(); /* annealing process */ extern void solutionO ; / * final config. */ extern void print_init_T(); / * print initial temperature */ extern void print_EO(); z* print EO in E-curve */ extern void print_E(); z* print E in E-curve */ extern void gen_update(); z* generate and update */ extern void save_conf(); / * save a config. */ extern void update_node(); z* update node table */ extern void accept_conf(); z* accept a config. */ extern void rej ect_conf(); z* reject a config. */ extern void get_rule(); / * assign a rule to a node */ extern void exchange(); / * exchange rules * / extern void migrate(); /* migrate rules */ /* terminal print functions */ extern void print_commO ; extern void print_rule_tab(); extern void print_rel(); extern void print_par_cost() ; extern void print_node(); extern void print_nodetab(); A .2 G rap h ics H ea d er M o d u le This is the header code for th e graphics m odule which displays the E-curve during the annealing process. It is SUN-3 w orkstation m achine dependent. /* graph.h -- header file of graphics library */ /**************************************************************/ #include <stdio.h> #include <math.h> #include <ctype.h> 160 ttinclude "graph.h" #include <stdio.h> #include <sys/file.h> #include <sys/types.h> #include <sys/dir.h> #include <fcntl.h> #include <signal.h> #include <errno.h> #include <suntool/sunview,h> #include <suntool/caxivas ,h> #include <suntool/panel.h> / I * * * * * * * * : * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* define constant */ #define XMAX 1000 #define XMIN 0 #define YMAX 900 #define YMIN 0 #define XLOW 0 #define YLOW 0 #define MONO 1 #define COLOR 2 #define INLIMITS(a,b,c) (((a)>=b) && ((a) <=c)) #define MAXCMAPSIZE 256 #define DEFAULT_F0NT "/usr/lib/fonts/fixedwidthfonts/cour.r.12" #define serif_r_12 "/usr/lib/fonts/fixedwidthfonts/serif.r.12" #define serif_r_14 "/usr/lib/fonts/fixedwidthfonts/serif.r.14" #define cour_b_12 "/usr/lib/fonts/fixedwidthfonts/cour.b.12" #define cour_b_14 "/usr/lib/fonts/fixedwidthfonts/cour.b.14“ #define cour_r_12 "/usr/lib/fonts/fixedwidthfonts/cour.r.12" #define cour_r_14 "/usr/lib/fonts/fixedwidthfonts/cour.r.14" #define ABNORMAL -1 #define NORMAL 0 #define YES 2 #define NO 1 #define FOREGROUND 1 #define BACKGROUND 0 #define FOUND 1 #define N0T_F0UND -1 #define M0USE_LEFT 1 #define M0USE_MIDDLE 2 #define MOUSE_RIGHT 3 1 6 1 /* types and variables */ /***** * * * * ** * * * ** * * * * * * **** Jk**************/ typedef unsigned char color [MAXCMAPSIZE] Frame frame,frammesg; Canvas canvas; Pixwin *pixwin; Panel Mesgpanel; extern int errno; int width,height; int xlow.ylow; int monitor; int done; int cmapsize; int process_id; Pixfont *default.font; int my_client_object; int *me; /*****************************************/ /* function returns */ extern void signal_handler(); extern void quit_proc(); extern void initgraphicsO ; extern void drawpointQ ; extern void endgraphicsO ; extern void sunexit(); extern void display(); extern int read_pixel(); extern void init_colortable(); extern void erase_window(); extern int mousegetO; extern void erase_box() ; extern void put_text(); extern void paint_circle(); extern void draw_circle(); extern void draw_line(); A .3 M ain M od u le T he m ain m odule is a control driven m odule w hich calls other m ajor m odules to perform the sta tic m apping of rules to m ultiple nodes in a m ulticom puter. 162 /**************************************************************/ / * s imal.c * / /* This file only contains the main of simulated annealing. */ / s ^ * ^ * * ^ * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / #include "simal.h" /************ ******* ******************* *****/ /* main() */ /* Main first inputs parameters from user,*/ /* then generates logic graph, initialize */ /* system configuration, performs simulated*/ /* until a suboptimal configuration is */ /* obtained. */ / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / main(argc, axgv) int argc; char *argv[]; { if ( argc == 1 ) input(); else input 1(argc,argv); system_data(); /* build system data */ initQ ; anneal(); printf("Done\n"); do • { /* for mouse operation */ i = mouseget(&x,&y); > while (i != MOUSE.LEFT); endgraphicsO ; } A .4 In p u t M od u le / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* input.c */ /* This file only contains the input routine. */ /* There are two possible input methods: */ /* i. input from screen, calling input() */ /* 2. input by passing parameter, calling inputlO */ #include "simal. h" /* input() */ /* input() is a routine to accept the */ /* parameters from the user */ /* They are : */ 163 /* M : number of rules */ /* N : number of nodes (processors)*/ /* cool_choice: cooling choice */ /* A : constant to control cooling */ /* speed */ /* init_choice: init-assignment choice*/ /* gen_fun: generating function */ /* topology : hardware topology */ /* DIM : hardware dimension */ /* K : termination constant */ / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / v o i d input() - C double explOQ; p r i n t f ( " E n t e r number o f r u l e s : " ) ; s c a n f ( " ‘ /,d" ,&M) ; p r i n t f ("*/td \n " ,M ) ; p r i n t f ( " E n t e r num ber o f n o d e s : " ) ; s c a n f 0 7 .d " ,& N ) ; p r i n t f ("7,d\n" ,N ) ; p r i n t f ( " E n t e r c h o i c e o f i n i t i a l a s s i g n m e n t s : " ); s c a n f ( " ‘ /,d" , & i n i t _ c h o i c e ) ; p r i n t f ( ” */,d\n" , i n i t _ c h o i c e ) ; p r i n t f ( " E n t e r c h o i c e o f g e n e r a t i n g f u n c t i o n : " ); s c a n f ("'/.d" ,& g e n _ fu n ) ; p r i n t f ("7,d\n" ,g e n _ f u n ) ; p r i n t f ( " E n t e r c h o i c e o f c o o l i n g m eth o d : "); s c a n f ("7.d" ,& c o o l _ c h o i c e ) ; p r i n t f ("7.d\n" , c o o l _ c h o i c e ) ; p r i n t f ( " E n t e r c o o l i n g c o n t r o l c o n s t a n t A: "); s c a n f ("7.f",& A ) ; p r i n t f (" 7 .f\n " , A) ; printf("Enter termination constant K: "); s c a n f ( “ 7.f" ,&K) ; p r i n t f (" 7 .f\n " , K) ; e s p = e x p i O ( - K ) ; p r i n t f ( " E n t e r h a rd w a re d im e n s io n : " ); s c a n f ("7.d" ,&DIM) ; p r i n t f ("7.d\n" , DIM) ; printf("Enter hardwaxe topology : "); scanf ("7.d" ,&topology) ; 164 p r i n t f ( "*/,d\n" , t o t p l o g y ) ; > /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / / * i n p u t l O * / / * i n p u t l O i s a r o u t i n e t o p a s s t h e * / /* p a r a m e t e r s fr o m t h e m ain p rogram * / v o i d i n p u t 1 ( a r g c , a r g v ) i n t a r g c ; c h a r * a r g v [ ] ; -c i n t a t o i O ; d o u b le a t o f ( ) ; d o u b le e x p lO C ) ; i f ( a x g c != 10 ) p r i n t f ( " E r r o r num ber o f i n p u t p a r a m e t e r s \ n " ) ; e l s e { M = a t o i ( a r g v [ l ] ); N = a t o i ( a r g v [2 ] ) ; i n i t _ c h o i c e = a t o i ( a r g v [ 3 ] ) ; g e n _ f u n = a t o i ( a r g v [4 ] ) ; c o o l _ c h o i c e = a t o i ( a r g v [5 ] ) ; A = a t o f ( a r g v ( 6 ] ) ; K = a t o i ( a r g v [ 7 ] ) ; e s p = e x p l O ( - K ) ; DIM = a t o i ( a r g v [ 8 ] ) ; t o p o l o g y = a t o i ( a r g v [ 9 ] ) ; p r i n t f ( " S y s t e m and C o n t r o l P a r a m e te r s \ n " ) ; p r i n t f ( " ---------------------------------------------------------------\n " ) ; p r i n t f ("Number o f p rogram r u l e s = 7 d \ n " ,M ) ; p r i n t f ("Number o f c o m p u ter n o d e s = 7 .d \n " ,N ); p r i n t f ( " C h o ic e o f i n i t i a l a s s ig n m e n t = 7 .d \n " , i n i t _ c h o i c e ) ; p r i n t f ( " C h o ic e o f g e n e r a t i n g f u n c t i o n = 7 .d \n " ,g e n _ f u n ) ; p r i n t f ( " C h o ic e o f c o o l i n g m eth o d = 7 .d \n " , c o o l _ c h o i c e ) ; p r i n t f ( " C o o lin g c o n s t a n t = 7 .f \n " ,A ) ; p r i n t f ( " t e r m i n a t i o n c o n s t a n t = 7 .f \n " ,K ) ; p r i n t f ("H ardw are d im e n s io n = 7.f \n " , DIM) ; p r i n t f ("H ardw are t o p o l o g y = 7.f\n " . t o t p l o g y ) ; > A .5 S y stem D a ta B u ilt-U p M o d u le /it:* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / / * s y s t e m _ d a t a . c * / 165 /* This file contains the functions to generate data to specify a */ /* production system. */ /* It includes: */ /* system_data(), comm_gen(), rule_tab_gen() , */ /* rel_gen(). */ /* The print functions and allocation functions are in simal_lib.c */ # in elude "s imal.h" /it:*****************************************************/ /* system_data() */ / * Generate communication matrix, parallelism matrix,*/ /* firing frequency matrix and rule table. */ void system_data() comm_gen(); para_gen(); w_gen(); if ( debug1 ) print_comm(); rule_tab_gen(); if ( debug1 ) print_rule_tab(); J s | c ^ ^ |/ ^ ^ ^ ^ ^ ^ ^ ^ ^ s |^ ^ ^ ^ ^ ^ j / * com m _gen() * / / * G e n e r a te r u l e i n t e r c o n n e c t i o n - c o m m u n ic a tio n * / / * m a t r ix by in p u t d a t a fro m f i l e c o m .d a t . * / v o i d com m _gen() { i n t i , j ; FILE * f p , * f o p e n ( ) ; f p = f o p e n ( " c o m .d a t " , " r " ) ; f o r ( i = 0; i < M ; i+ + ) - { f o r ( j = 0 ; j < M; j+ + ) { f s c a n f ( f p , "‘ /,d" , & ( c o m _ m a t r i x [ i ] [ j ] ) ) ; i f ( d e b u g 1) p r in tf(" c o m m [7,d] [7,d] = 7,d\n" , i , j , c o m _ m a tr ix [ i ] [ j ] ) ; > > f c l o s e ( f p ) ; > 166 / * p a r a _ g e n ( ) * / / * G e n e r a te p a r a l l e l i s m m a t r i x . * / v o i d para_gen() { i n t i , j I FILE * f p , * f o p e n ( ) ; fp = fo p e n (" p a r.d a t" , " r" ) ; fo r ( i = 0; i < M; i++ ) •{ f o r ( j = 0 ; j < M ; j++ ) { fscan fC fp , "7,d" , & (par_m atrix [i] [j] ) ) ; sw itch (p a r.m a trix [ i]C j] ) { c a s e 0 : p a r . c o s t . m a t r i x [ i ] [ j ] = PAR.COST; b r e a k ; c a s e 9: p a r . c o s t . m a t r i x [ i ] [ j ] = PSTAR.COST; b r e a k ; case 1 : p a r.c o s t.m a trix [i]C j] = 0 ; b rea k ; def a u l t :b rea k ; > i f ( d e b u g l ) p r in tf ("p ar C ‘ /,d] [7,d] = 7.d\n" , i , j .p a r.c o s t.m a trix [i] Cj] ) ; > } f c lo s e ( f p ) ; > /* w.genO */ / * G e n e r a te f i r i n g f r e q u e n c y m a t r i x */ v o i d w.genO { i n t i ; FILE * f p , * f o p e n ( ) ; fp = fo p en ("w eg .d at", " r" ) ; fo r ( i = 0 ; i < M ; i++ ) { f s c a n f ( f p , "7.d" , t (W .m a tr ix [ i ] ) ) ; i f ( d e b u g l ) p r i n t f ( "w [7.d] = 7.d\n" , i , W .m a tr ix [ i ] ) ; > f c lo s e ( f p ) ; > /* ru le _ ta b _ g en () */ /* G e n e r a te r u l e t a b l e by s e t t i n g t h e r u l e i d a n d * / 167 /* g e n e r a t i n g t h e n e ig h b o r r e l a t i o n s . * / /if:* * * * ))!* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / v o i d r u l e _ t a b _ g e n ( ) - c i n t i ; f o r ( i = 0; i <= M-l; i+ + ) { r u l e _ t a b [ i ] = r u l e _ a l l o c ( ) ; r u l e _ t a b [ i ] - > i d = i ; r u l e _ t a b [ i ] - > n e i g h b o r = r e l _ g e n ( i ) ; > > / * r e l _ g e n ( ) */ /* G e n e r a te t h e r e l a t i o n a l c o n n e c t i o n l i s t o f */ /* r u l e i by c h e c k in g t h e c o m m u n ic a tio n m a t r i x . */ /***************************************************/ c r e l _ g e n ( i ) i n t i ; { C p , q ; i n t j ; i n t f i r s t ; f i r s t = 1; f o r ( j = 0; j <= M-l; j++ ) { i f ( c o m _ m a t r i x [ i ] [j] != 0 ) { i f ( f i r s t ) { p = c o n n _ a l l o c ( ) ; q = p; f i r s t = 0; > e l s e { p - > n e x t = c o n n _ a l l o c ( ) ; p = p - > n e x t ; > p - > i d = j ; i f ( c o m _ m a t r i x [ i ] [j] == 1 II c o m _ m a t r i x [ i ] [j] == 9 ) p-> com m _tim e = C0MM_TIME; e l s e p-> com m _tim e = 0; p - > n e x t = NULL; > > i f ( f i r s t ) r e tu r n (N U L L ); e l s e r e t u r n ( q ) ; c 168 > A .6 C ooling M od u le /* cooling.c */ /* This file contains functions to perform simulated */ /* annealing cooling schedule procedure. */ /* It includes: */ /* anneal(), set_init_T(), */ /* print_E0(), cal_E_original(), * / / * print_E(), print_init_T(), */ /* cal_E_avg(), cooling(), */ /* terminateO , solutionC). */ #include “simal.h" I* anneal() */ /* This is the main routine to perform simulated */ /* annealing. */ void anneal() ■ c int ml,m2th,nl,n2; int ok; int i; int stop; FILE *fopen(), *fp; fp = fopen("simal.out","w"); /* output file */ E = cal_E_original(0); print_EO(E); print_nodetab(0); printf("Xn INIT E = */.f\n,,,E); T = set_init_T(E); print_init_T(T); printf(" Init Temperature = 7.f\n",T); i = 0; stop = 0; while ( ! stop ) { for ( ml = 0; ml < M; ml++ ) - [ if ( debug ) printf ("-------------------------------------\n") ; gen_update(ml,T); 169 > E_avg[i] = cal_E_avg(); print_E(E_avg[i]); if ( debug ) { printf (" Temperature ' / . f \n\n",T) ; printf ("E_avg [V.d] = ' / . f \n" ,i ,E_avg[i] ) ; > fprintf (fp,"’ /.f \n" ,E_avg[i] ) ; if ( terminate(i) ) stop = 1; i++; T = coolingCT,i); > fclose(fp); solution(i); } /* cal_E_original() */ /* Calculate the cost function E of original */ / * system configuration. */ I* The global variables total_load and */ / * total_comm are saved for further use. */ float cal_E_original(flag) int flag; • c int i; E_par = 0; E_imb = 0; E_comm = 0; total_load = 0; for ( i = 0; i < N; i++ ) { total_load += nodetab[i]->load; E_par += nodetab[i]->par_cost; if ( flag == 1 ) { nodetab[i]->comm = cal_comm(i); > E_comm += nodetab[i]->comm; > load_avg = total_load / N; for ( i = 0; i < K; i++ ) E_imb += (float ) flt_abs( nodetab[i]->load - load_avg); if ( flag == 0 ) printf (" E_par = '/,f, E_imb = '/,f, E.comm = * / . f \n" , E_par,E_imb,E_comm); returnC E_par + E_imb + E_comm ); } 170 f * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * f /* cal_E_avg */ /* Calculate average cost function E of all */ /* rule trials at a temperature. */ float cal_E avg() { int i; float avg; int choice; choice = 1 ; switch ( choice ) - ( case 1 : avg = 0; for ( i = 0; i < M; i++ ) { avg += E_tab[i]; > avg = avg / M; break; case 2: /* if choice=2, then is the minimum */ avg = E_tab[0] ; for ( i =1 ; i < M; i++ ) { if ( E_tab[i] < avg ) avg = E_tab[i] ; > break; default : break; > return(avg); > /* solutionO */ / * Print out the final accepted configuration. */ / ************************* *************** *******/ void solution(i) int i; { printf("\n The solutions is :\n"); print_nodetab(1); printf ("\n E_par = 7,f, E_imb = 7.f, E_comm = 7,f\n", E_par, E_imb, E_comm) ; printf ("\n E = 7.f\n",E); printf ("\n T = 7.f\n",T); printf ("\n Number of iterations = 7.d\n",i); 171 > /* terminateO */ /* Using termination rule to stop the process */ /*************************************************/ int t erminat e(i) int i; { int ok; if ( i > K ) { if ( ( flt_abs( E_avg[i-1] - E_avg[i] ) / E_avg[i] * init_delta_E) <= eps) ) { „ ok = 1; for ( j = K; j > 0; j— ) if ( E_avg[i-j] != E_avg[i-j-l] ) { ok = 0; break; > return(ok); else return(O); > else return(0); > /* coolingO */ / * Decrement the temperature based on the cool- */ /* choice. */ /I*************************************************/ float cooling(T,i) float T; int i; { float t; double log(); switch ( cool_choice ) ■ { case 1: if ( ! init_T_ok ) { printf ( " num accept = '/,d\n" ,num_accept) ; if ( num_accept / M > 0.8 ) { t = A * T; init_T_ok = 1; > 172 els© { num_accept = 0; t = 2 * T ; > } else t = A * T; break; case 2: t = A * T; break; case 3: t = TO / log(i+l); break; default: printf("Cooling choice error \n“); break; } return( t ) ; > /*************************************************/ /* set_init_T() */ /* Set the initial temperature according to the */ /* cooling method chosen. */ /lie************************************************/ float set_init_T(E) float E; - c double x,t; float delta_E; float Max_E; float Min_E; float init_E; float new_E; double flt_abs(); double loglOQ ; int ml,m2th,nl,n2; int l_select_node(); int g_select_node(); init_T_ok = 0; num_accept = 0; switch ( cool_choice ) { case SIMPLE1: x = 2; t = loglO(x); t = (float) ( E / t ); init_T_ok =1; 173 break; case SIMPLE2: delta_E = 0; save_init(); init_E = E; for ( ml = 0; ml < M; ml++ ) nl = rule_tab[ml]->node; switch ( gen.fun ) { case L_EX : /* local exchange */ n2 = l_select_node(nl); save.conf(nl,n2); m2th = select_rule(n2); exchange(nl,ml,n2,m2th); break; case G_EX : /* global exchange */ n2 = g_select_node(nl); save.conf(nl,n2); m2th = select_rule(n2); exchange(nl,ml,n2,m2th); break; case MIGRATE : /* migrate */ n2 = l_select_node(nl); save.conf(nl,n2); m2th = EMPTY; migrate(nl,ml,n2); break; default: break; > accept.conf(nl,n2); new.E = cal.E.original(l); if ( ml == 0 ) { Max.E = new.E; Min.E = new.E; > else - [ if ( new.E > Max.E ) Max.E = new.E; if ( new.E < Min.E ) Min.E = new.E; > delta.E += ( float ) flt.abs ( init.E - new.E ); > delta.E = Max.E - Min.E; printf ("max delta E * = ’ /.f\n" .delta.E); x = 1 / 0.8; t = log(x); t = (float ) ( ( delta.E ) / t ) ; restore.init(); break; case 3: 174 t = (float ) ( 2 * E ); init_T_ok = 1; break; def au.lt: break; > init_delta_E = delta_E; return(t); > /if:*******************#*******************!*********/ /* save_init() */ /* Save the old configuration and cost values. */ void save_init() { int i; for ( i = 0; i < N; i++ ) init_nodetab[i] = nodetab[i]; old_E_par = E_par; old_total_load = total_load; old_E_comm = E_comm; old_E_imb = E_imb; old_load_avg = load_avg; > /I*************************************************/ /* restore_init() */ /* Restore the initial configuration. */ void restore_init() • c int i; for ( i = 0; i < N; i++ ) nodetab[i] = init_nodetab[i]; E_par = old_E_par; total_load = old_total_load; E_comm = old_E_conun; E_imb = old_E_imb; old_load_avg = load_avg; > /* print_init_T() */ /* Print the initial temperature in E-curve. */ void 175 print_init_T(t) float t; ■ c char s[5]; int T; T = (int) t; sprintf (s , "’ /.d" ,T) ; put_text(X0-10,Y0+20,1,s); old_x = XO; } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* print_E0() */ / * Print the EO in E-curve. */ j ********* * ***************************************/ void print_EO(E) float E; { char s[5]; int eO; EO = E; rel.EO = EO * 0.75; eO = (int) EO; sprintf (s , "7,d" ,e0) ; y_E0 = (int) ( Ylen * 0.6 ); put_text(X0-30, YO - y_E0fl,s); eO = ( int ) rel_E0; sprintf (s ,"'/,d" ,e0) ; put_text(X0-30,YO,1,s); old_y = YO - y_E0; > /*************************************************/ 1 * print_E() * / /* Print the cost function line in E-curve. */ /*************************************************/ void print_E(E) float E; int y; y = (int ) ( (E-rel_E0) / (E0-rel_E0) * y_E0); draw_line(old_x,old_y,old_x + X_interval,YO-y,1); old_x = old_x + X_interval; old_y = YO-y; } 176 A .7 In itiation M o d u le /* init.c */ /* This file contains the functions to generate the initial */ /* system configuration. */ /* It includes: */ /* initO, get_rule(), cal_comm(), */ /* cal_rule_comm(), cal_load(), init_window(). */ /* The hardware_topo() is in hardware.c */ /* The print functions and allocation functions are in */ /* simal-lib.c */ / * : > j e * > f c > f c * j 4 s > t e s t ; * 3 | c > t c 3 l c i f c j | e j i < j ( e s f c s f : j f c j t c ) f : > f c j f c ) t c > k s f c > j c * a l e 5 | c i f c a f c s f c s f c a f c * > f c } f c > f c > j e } J e s i c j | C 5 ( e s | c j l e j f c j f c s t c s l c j | < > f c s f c j f c s J « j f c s f c > | e j | c > t i : j j e / #include "simal.h" / * init.c */ /* Initialize the system configuration */ /* by arbitrary distributing rules onto */ / * each node . */ void initO { int i,j,k,r; float avg,total; hardware..topoO ; /* build up distance matrix */ init_window(); / * initialize graphics window */ if ( I >= M ) { /* number of nodes > number of rules, */ for ( i = 0; i < M; i++ ) { /* each node gets one rule */ nodetab[i] = node_alloc(); nodetab[i]->tag = 0; nodetab[i]->num_rule = 0; get_rule(i,i); > > else - ( /* number of rules > number of nodes * / j = 0; for ( i = 0; i < K; i++ ) { nodetabCi] = node_alloc(); nodetab[i]->num_rule = 0; } switch ( init_choice ) { /* 1-1: RR: Round-Robin */ case 1: while ( j < M ) • { for ( i = 0; i < N; i++ ) { nodetab [i]->tag = 0; get_rule(i , j); j++; 177 if ( j >= M ) break; > > break; /* 1-2: Equal Divider */ case 2: k = M / N; r = M - k * N; for ( i =0; i < N; i++ ) { for ( j =0; j<k; j++) { nodetab[i]->tag = 0; get_rule(i,( i*k +j)) ; > } j = N * k; for ( i = 0; i < r; i++ ) { nodetab[i]->tag = 0; get_rule(i,j) ; j++; > break; default : break; } /* end switch */ y /* end else */ for ( i = 0; i < N; i++ ) { nodetab[i]->comm = cal_comm(i); nodetab[i]->par_cost = cal_para_cost(i); nodetab[i]->load = cal_load(i); > print_rule_tab(); } /* get_rule */ /* Assigns rule onto node, incrementing */ /* the number of rules on the node. */ /j*****************************************/ void get _rule(node,rule) int node,rule; { MD m; int i; m = rule_tab[rule]; m->node = node; m->tag = 0; m->id = rule; if ( nodetab[node]->num_rule == 0 ) { 178 nodetab[node]->rules[0] = rule; > else { for ( i = 0; i < nodetab[node]->num_rule; i++ ) I nodetab[node]->rules[i] = rule; > nodetab[node]->num_rule++; > /* cal_comm */ / * Calculate the node communication */ /* cost with other nodes by accumulating */ /* each rule's communication cost. */ float cal_comm(node) int node; { float comm; int i; comm = 0;. for ( i = 0; i < nodetab[node]->num_rule; i++ ) { comm += cal_rule_comm(nodetab[node]->rules[i],node); > return(comm); > /* cal_rule_comm() */ / * Calculate rule communication cost */ / * with modules in other nodes. If two */ /* rules are within the same node, then */ / * communication cost between them is 0. * / / * The distance between two nodes is */ /* based on the topology of the network, */ /* which is defined in D_matrix. */ float cal_rule_comm(rule.node) int rule; int node; { C tmp; float comm; MD m; int j; 179 m = rule.tab[rule]; comm = 0; for ( tmp = m->neighbor; tmp != NULL; tmp = tmp->next ) { if ( rule_tab[tmp->id]->node != node ) { j = rule_tab[tmp->id]->node; comm = comm + D[node][j] * tmp->comm_time; > > rule_tab[rule]->comm = comm; return(comm); > /*****************************************/ /* cal_para_cost() */ /* Calculate the additional cost of the */ /* parallelism lost, */ /*****************************************/ float cal_para_cost(node) int node; -[ int i,j,x,y; float cost; for ( i = 0; i < nodetab [node]->num_rule; i++ ) {. for ( j = i+1; j < nodetab[node]->num_rule; j++ ) { x = nodetab[node]->rules [i]; y = nodetab[node]->rules[j]; if ( par.cost.matrix[x][y] != 0 ) cost += W_matrix[x] * par_cost_matrix[x][y]; > > return(cost); > /* cal_load() * / /* Calculate the local load at each node.*/ /jit****************************************/ float cal_load(node) int node; { int i,j,x,y; float cost; cost = 0; for ( i = 0; i < nodetab [node]->num_rule; i++ ) • { x = nodetab[node]->rules[i]; cost += MATCH_TIME * W.matrix[x]; 180 > cost += nodetab[node]->par_cost; return(cost); } /*****************************************/ /* init_window() */ /* Initialize the sun graphics window for*/ /* drawing the E-curve. */ /*****************************************/ void init_window() - c initgraphics(0,0,600,300,»M’,ME - c u r v e " ); X0 = 30; YO = 270; Xlen = 520; Ylen = 250; draw_line(X0,Y0,X0+Xlen,Y0,1); draw_line(X0,Y0,X0,YO-Ylen,1); put_text(15,10,1,"E"); put_text(580,280,1,"T"); X_interval = (int ) ( Xlen / 120 ); > A .8 H ardw are Sim ulator M od u le /* hardware.c */ /* This file contains the functions to generate the distance */ /* matrix according to the hardware topology and dimension. */ /> * ( > fe > |e s te ) 4 e ) f c ) f c ) t:) f c jf c j( c :4 c ) t:s |c 3 f c :f c jie 3 f c s ! < * * * * * * * * * ^ * + *******************************/ /* hardware_topo() */ /* generate distance matrix D according */ /* the architecture topology. */ void hardware(_topo) • c int i,j,k ,d; unsigned x,y; for ( i = 0; i < N; i++ ) for ( j = 0; j < N; j++ ) { switch ( topology ) { case HYPERCUBE : /* Hypercube topology */ 181 ___ X = i j ; d = 0; for ( k = 0; k < DIM; k++ ) { y = x >> (k); if((yfei)-=i) d++; > D[i] [j] = d; break; case MESH: /* mesh topology */ if ( k = abs( i- j) < DIM; D[i][j] = k; else - [ d = int ( abs( i - j ) / DIM ); D[i] [j] = d + k; > break; case RING : /* ring topology */ if ( k = abs( i-j)< =(N /2) ) D[i] [j] = k; else D [i][j] = N - max(i.j) + min(i.j); break; default: break; > > } > A .9 U p d a te M od u le /* update.c */ /* This file contains function to update the system state. */ /* It includes: */ / * gen_update(), update_node(), */ /* reject_conf(), accept.conf(), */ /* save_conf(), cal_E(), */ /* flt_abs(), BoltzmannO . */ / * The functions to generate a new configuration are in */ / * generate.c. */ #include "simal.h" /sic*********************#****************************/ / * gen_update() */ /* This function generates a new configuration */ / * for a given ml and a temperature T, then decides*/ 182 /* if this new configuration can be accepted. */ /j***************************************************/ void gen_update(ml,T) int ml; float T; { int m2th,nl,n2; int ok; int i ; nl = rule_tab[ml]->node; switch ( gen_fun ) { case L_EX : /* local exchange */ n2 = l_select_node(nl); save_conf(nl,n2); m2th = select_rule(n2); exchange(nl,ml,n2,m2th); break; case G_EX : /* global exchange */ n2 = g_select_node(nl); save.conf(nl,n2); m2th = select_rule(n2); exchange(nl,ml,n2,m2th); break; case MIGRATE : /* migrate */ n2 = l_select_node(nl); save.conf(nl,n2); m2th = EMPTY; migrate(nl,ml,n2); break; default: break; > if ( debug1 ) print_nodetab(0); El = cal_E(0); if ( debug ) printf ("\n El = 7,f\n",El); delta_E = El -E; if ( debug ) printf (" delta E = ’ /. f \n\n" ,delta_E) ; if ( delta_E <= 0 ) ok = 1; else ok = Boltzmann(T,delta_E); if ( ok == 1 ) I accept_conf(nl,n2); if ( debug ) { printf("*** Accept ***\n“); print_nodetab(0); 183 > E = El; > els© { if ( debug ) - printf("*** Reject ***\n"); reject_conf(ml,m2th,nl,n2); > E_tab[ml] = E; > /* save_conf() */ /* Save the previous configuration by copying */ /* the nodes which will exchange rules to form a */ /* new configuration to the old node table. */ /***************************************************/ void save_conf(nl,n2) int nl,n2; • c int i,j; nodetab[nl]->tag = 1; nodetab[n2]->tag = 1; for ( i = 0; i < N; i++ ) { if ( nodetab[i]->tag == 1 ) { old_nodetab[i] = node_alloc(); old_nodetab[i]->tag = 0; old_nodetab[i]->load = nodetab[i]->load; old_nodetab[i]->par_cost = nodetab[i]->par_cost; old_nodetab[i]->num_rule = nodetab[i]->num_rule; old_nodetab[i]->comm = nodetab[i]->comm; for ( j =0; j <= old.nodetab[i]->num_rule; j++ ) ■ c old_nodetab[i]->rules[j] = nodetab[i]->rules [j]; > > else • { old_nodetab[i] = nodetab[i]; > > > /* BoltzmannO */ /* Boltzmann function. */ int Boltzmann(T,delta_E) 184 float T; float delta.E; ■ c double drand48 C); double x,y ; double exp(); x = ( - delta_E ) / T; x = exp(x); y = drand48(); if ( y < x ) return(1); else return(O); > /* update_node() */ /* Update node information by only changing data */ /* of previously assigned rule ml to new assigned */ /* rule m2. */ /* EMPTY is used for migrate case. * / / I * * * * * # * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * : * / void update_node(n, ml,m2) int n,ml,m2; - c ND node; int i,j; node = nodetab[n]; if (ml != EMPTY && m2 != EMPTY ) i node->load -= node->par_cost; for ( i = 0; i < node->num_rule; i++ ) { j = node->rules [i]; if ( j != ml && j != m2 && par.cost.matrix[j][ml] != 0 ) node->par_cost=W_matrix[ml] * par_cost_matrix[j][ml]; > for ( i = 0; i < node->num_rule; i++ ) - [ j = node->rules[i]; if ( j != m2 && j != ml && par.cost.matrix[j][m2] != 0 ) node->par_cost += W.matrix[m2] * par.cost.matrix[j][m2]; } node->load +=(-W_matrix[ml] + W.matrix[m2]) * MATCH.TIME; node->load += node->par_cost; } 185 else { if (ml != EMPTY && m2 == EMPTY ) { node->load -= node->par_cost; for ( i = 0; i < node->num_rule; i++ ) { j = node->rules[i]; if ( j != ml && par_cost_matrix[j][ml] != 0 ) node->par_cost-=W_matrix[ml]*par_cost_matrix[j][ml]; > node->load += ( - W_matrix[ml] ) * MATCH_TIME; node->load += node->par_cost; > else { if ( ml == EMPTY M m2 != EMPTY ) { node->load -= node->par_cost; for ( i = 0; i < node->num_rule; i++ ) ■ [ j = node->rules[i]; if ( j != m2 && par_cost_matrix[j][m2] != 0 ) node->par_cost += W_matrix[m2] * par_cost_matrix[j][m2]; > node->load += ( W_matrix[m2] ) * MATCH_TIME; node->load += node->par_cost; > else printf("Error in update\n"); > > /I!***************************************************/ /* accept_conf() */ /* Accept new configuration by simply destroying */ / * the saved old node information. */ void accept_conf(nl,n2) int nl,n2; t free(old_nodetab[nl] ); free(old_nodetab[n2]); old_nodetab[nl] = nodetab[nl]; old_nodetab[n2] = nodetab[n2]; > / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / / * reject_conf() * / /* Reject the new configuration by destroying */ /* the new node information, restoring the old */ /* node information, and restoring the previous */ 186 /* r u le t a b le in fo r m a tio n . * / void reject_conf(i,jth.nl,n2) int i,jth,nl,n2; { int k; free(nodetab[nl]); free(nodetab[n2]); nodetab[nl] = old_nodetab [nl]; nodetab[n2] = old_nodetab[n2]; rule_tab[i]->node = nl; if ( jth != EMPTY ) { k = nodetab[n2]->rules[jth]; rule_tab[k]->node = n2; > E_par = old_E_par; E_imb = old_E_imb; E_comm = old_E_comm; total_load = old_total_load; load_avg = old_load_avg; J /* flt_abs() */ /* Return the absolute value of a float. */ float flt_abs(i) float i; ■ C returnC ( i >= 0 ) ? i : (-i) ); > /* cal_E() */ /* Calculate the new cost function E for a new */ /* system configuration. The previously calculated*/ /* total_comm and total_load are used, only */ /* updating the information of modified nodes. */ /* The global variables total_comml and */ /* total_loadl are saved for updating total_comm */ /* and total_load if the new configuration is */ /* is accepted. */ float cal_E() { int i; 187 old„E_par = E_par; old_total_load = total_load; old_E_comm = E_comm; old_E_imb = E_imb; old_load_avg = load.avg; E_comm = 0; E_imb = 0; for ( i = 0; i < N; i++ ) if ( nodetab[i]->tag == 1 ) { total_load -= old_nodetab[i]->load + nodetab[i]->load; E_par -= old_nodetab[i]->par_cost + nodetab[i]->par_cost; > nodetab[i]->comm = cal_comm(i); E_comm += nodetab[i]->comm; > load_avg = total_load / N; for ( i = 0; i < N; i++ ) { E imb += ( float )flt_abs(load_avg - nodetab[i]->load); > returnC E_pcir + E_imb + E_comm ) ; > A .10 G en eratin g M od u le /************************************************************/ /* generate.c */ /* This file contains functions to generate a new system */ / * configuration. It includes: */ /* select_rule(), l_select_node(), g_select_node(), */ /* exchangeO , migrateO. */ #include "simal.h" / * l_select_node() */ /* Randomly select a node from local neighborhood */ /* Note : neighborhood is network topology */ /* dependent, defined by D_matrix. */ int l_select_node(node) int node; ■ c int i,x,ok; double drand48(); if ( debug1 ) printf("Local select node\n"); 188 if ( N != 1 ) { x =. ( int ) ( N * drand48() ); while ( D_matrix[x][node] != 1 ) x = ( int ) ( N * drand48() ) ; return(x); } else return(ERROR); > /***************************************************/ /* g_select_node() */ /* Randomly select a node globally. */ int g_select_node(node) int node; { int x; if ( N != 1 ) - [ x = ( int ) ( N * drand48() ); while ( x == node ) - [ x = ( int ) ( N * drand48() ); > return(x); > else return(ERROR); > /* select_rule * / / * Randomly select ith rule in node for new */ /* trial. */ /* Note : it returns the ith rule, not rule i. */ void select_rule(node) int node; { float drand48(); int i; i = ( int ) ( nodetab[node]->num_rule * drand48() ); return(i); > /***************************************************/ /* exchange(nl,ml,n2,m2th) */ /* Exchange rule ml in node nl with m2th rule */ 189 /* in node n2. */ /********* void exchange(nl,ml,n2,m2th) int nl,n2,ml,m2th; { int i,mlth,m2; if ( debug1 ) printf (" rule ' / , d in node 7,d <-> rule */,dth in node '/.d\n", ml,nl,m2th,n2); m2 = nodetab[n2]->rules[m2th]; for ( i = 0; nodetab[nl]->rules[i] != ml; i++ ) t mlth = i; nodetab[nl}->rules[mlth] = m2; nodetab [n2]->rules[m2th] = ml; rule_tab[ml]->node = n2; rule_tab[m2]->node = nl; rule_tab[ml]->tag = 1; rule_tab[m2]->tag = 1; update_node(nl, ml,m2); update_node(n2, m2,ml); > /***************************************************/ /* migrate(nl,ml,n2) */ /* migrate rule ml in node nl to node 2. */ void migrateCnl,ml,n2) int nl,n2,ml; int i,j; for ( i = 0; nodetab[nl]->rules[i] != ml; i++ ) f nodetab[nl]->num_rule--; for ( j = i; j < nodetab[nl]->num_rule; j++ ) { nodetab[nl]->rules[j] = nodetab[nl]->rules[j+l]; > i = nodetab [n2]->nvim_rule; nodetab[n2]->rules[i] = ml; nodetab[n2]->num_rule++; rule_tab[ml]->node = n2; 190 rule_tab[ml]->tag = 1; update_node(nl, ml,EMPTY); update_node(n2, EMPTY,ml); > A .11 U tility and Library M odule /* simal_lib.c */ /* This file contains all print functions and structure */ /* allocation functions, */ /* It includes: */ /* print_comm(), print_nodetab(), print_node(), */ /* print_rule_tab(), print_rel(), print_par_cost() */ /* rule_alloc(), node_alloc(), conn_alloc(), */ /* file_alloc(). */ ^ j f c j t e a f c } f c 3 | c j J « > | c » | c j J e j f c s l c j f c } t e j f c j | c » f c H < > + c s f c 5 l e > f c a f c > | c > J c } f c j J c j + c j f c > J e j t c 3 i e > | « s f c 5 i c > | c > j < j | « s t e j f c > l c ) f c s t e j f c > f : * j f c s i e 3 ( e > f c j l c s l c > ( e a i < > J e 5 t : j f c s f c 3 | < ) f : J t c } f e j f c a | c / #includ© "simal.h" /* print_comm() */ /* Print rule communication matrix. */ void print_comm() ■ c int i , j ; printf(" Rule communication\n\n"); for ( i = 0; i <= M-l; i++ ) { printf ("*/.d ",i); > printf ("\n \n") ; for ( i = 0; i <= M-l; i++) { for ( j = 0; j <= M-l; j++ ) ■ { printf ("*/,d " , com_matrix [i] [j] ) ; } printf("\n"); > > /* print_rule_tab() */ /* Print the rule table, showing the rule id, */ /* number of processes in rule, its connection */ /* with other rules and the communication amount. */ 191 /I***************************************************/ void print_rule_tab() { int i; printf("\n Rule Information\n\n"); for ( i = 0; i <= M-l; i++ ) { printf(" rule [ * / . d ] : \n",i); print_rel(i); print_par_cost(i); printf (" node 7,d \n" ,rule_tab[i] ->node) ; > > /* print_rel() */ / * Print the relationship with other rules and */ /* the communication amount. */ /I******************************************!!!********/ void . print_rel(i) int i; { C p; printf(" intercommunication:\n"); for ( p = rule.tab[i]->neighbor; p != N U LL; p = p->next ) { printf (" with rule * / . d : '/.d\n" , p->id, p->comm_time); > > /*******************************%*******************/ /* print_par_cost() */ f* Print the addition cost for parallelism loss. */ /***************************************************/ void print _par_ cost(i) int i ; { int j ; printf(" If in same node, parallelism cost is:\n"); for ( j = 0; j < M ; j++ ) { i f C par.matrix[i][j] != 0 ) printf (" with rule ‘ /,d : */,d\n" , j ,par_cost_matrix [i] [j] ) ; > > 192 /* conn_alloc() */ /* Allocate a space for connection structure. */ /***************************************************/ c conn.alloc() { char *calloc(); return( (C) callocCl, sizeof(CONN) ) ); } /* file_alloc() */ /* Allocate a space for open file. */ FILE *file_alloc() ■ c char *calloc(); return( (FILE *) calloc(l, 10000) ); > /* rule_alloc() */ /* Allocate a space for rule structure. */ MD rule_alloc() - c char *calloc(); return( (MD) calloc(l, sizeof(MODULE) ) ); > /* node_alloc() * / /* Allocate a space for node structure. */ m node_alloc() { char *calloc(); return( (ND) calloc(l, sizeof(NODE) ) ); > /* print_nodetab() */ /* Print out the node information according to */ 193 / * f l a g , i t f l a g = 0 , p r i n t t h e node t a b l e , e l s e * / / * p r i n t t h e o l d node t a b l e . * / /♦lit***********************************************/ v o id p r i n t _ n o d e t a b ( f l a g ) i n t f l a g ; { i n t i ; i f ( f l a g == 0 ) p r i n t f ( " \ n New Node T able I n f o r m a t i o n \ n \ n " ) ; e l s e p r i n t f ( " \ n Old Node T a b le I n f o r m a t i o n \ n \ n " ) ; f o r ( i = 0; i <= N - l ; i++ ) { p r i n t _ n o d e ( i , f l a g ) ; > } /*************************************************/ / * p r i n t _ n o d e ( ) * / /* P r i n t in f o r m a t i o n o f node i . I f f l a g =0, */ / * p r i n t node i i n node t a b l e , e l s e p r i n t node i * / /* i n o l d node t a b l e . * / v o i d p r i n t _ n o d e ( i , f l a g ) i n t i , f l a g ; { M D tm p; i n t k ; printf (“node C ' / , d ] : rules : " , i) ; if ( flag == 0 ) - ( for ( k = 0 ; k < nodetab[i]->num_rule; k++ ) { printf ("'/,d, " ,nodetab Hi] ->rules [k] ) ; > printf ("\n load : */.f\n" .nodetab [i]->load) ; printf(" comm: 7,f \n“ ,nodetab[i]->comm) ; printf (" par cost: '/.f\n" , nodetab [i]->par_cost) ; printf (" num_rule: 7,d\n“ .nodetab[i]->num_rule) ; > else • { for ( k = 0 ; k < old_nodetab[i]->num_rule; k++ ) { printf (“' / . d , " ,old_nodetab [ij ->rules [k] ) ; > printf ("\n load : 7,f \n" ,old_nodetab[i]->load) ; printf (" comm: ' / . f \n" ,old_nodetab[i]->comm) ; printf (" par cost: '/.f\n“ .nodetab [i]->par_cost) ; printf (" num.rule: '/.d\n" ,old_nodetab[i]->num_rule) ; > 194 A .12 D eclaration M od u le This m odule declares the global variables. It is alw ays nice to declare them in a separate file and define them in the header file. /* simal.decl.c ---- Declaration file of SIMAL */ #include Ms imal.hM /* input parameter */ /*****************************************/ int M ; /* number of rules */ int N; / * number of nodes */ int cool.choice; /* choice of cooling method */ double A; /* cooling constant */ int init_choice; /* choice of initial assignments int gen_fun; /* choice of generate function */ int K; /* termination constant */ int topology; /* network topology */ int DIM; /* dimension or range of topology / * global variable */ float T; /* initial temperature */ double esp; / * termination epsilon */ float load_avg; /* system load average */ float old_load_avg; float total_load; /* system total load */ float old_total_load; float E; /* cost function of accepted config */ float El; /* cost function of new trial * / float delta_E; /* El - E */ float E_par; /* total parallelism loss */ float old_E_par; float E_comm; /* total communication of accepted * / float old_E_comm; float E_imb; /* total load imbalance */ float old_E_imb; int init_T_ok; /* flag for init T setting */ int num_accept; /* number of trials accepted */ int debug; /* high level debug flag * / int debug1; /* low level debug flag * / /* static array */ 195 int com.matrix[MAX.RULE][MAX.RULE] ;/* communication matrix */ int W.matrix[MAX.RULE]; /* firing frequency matrix * / float par.cost.matrix[MAX.RULE][MAX.RULE];/* par cost matrix */ int par.matrix[MAX_RULE][MAX_RULE]; /* parallelism matrix */ int alloc.matrix[MAX.RULE]; /* allocation matrix * / int D_matrix[MAX_NODE][MAX.NODE]; /* distance matrix */ float E.tab[MAX.RULE] ; /* cost table for each trial */ ND nodetab[MAX.NODE]; /* node table,new config */ ND old.nodetab[MAX.NODE]; /* old node table,saved config */ MD rule.tab[MAX.RULE]; /* rule table */ float E_avg[MAX_T]; /* average cost at T */ / * Variables used for graphics * / /Hot'***************************************/ int XO; int YO; int Xlen; int Ylen; int y.EO; int X.interval; int old.x; int old.y; float EO; float rel.EO; A .13 G raphics M odule /* graph.c */ / * This file contains the functions as graphic tools to */ / * generate the E-curve during the annealing process. */ /* It includes: */ / * singal.handler(), quit_proc(), */ /* initgraphics(), draw.pointO , */ /* endgraphics(), sunexitO , */ /* display(), read-pixel(), */ / * init.colortableC), erase.windowO , */ /* mouseset(), erase.boxO , */ / * put.texC), paint.circleC) , */ /* draw-circle(), draw_line(), */ #include "graph.h" #include "simal.h" /*****************************************/ /* signal.handlerO */ 196 /jit******************************** ********/ void signal_handler(who.signal,when) int *who,signal; Notify_signal_mode when; { if(signal==SIGSEGV) fprintf(stderr,"Segmentation violation. Core not dumped \n"); sunexit(ABNORMAL); > /* quit_proc() */ /*****************************************/ void quit_proc() { done=YES; } /* initgraphicsO */ void initgraphics(xp,yp,wid.hei,monitor.type.heading,my_font) int xp.yp,wid.hei; char monitor_type; char * heading; char * my.font; { int accessfl; char ptrxCiO],ptry[lO],ptrwid[lO]; M J ) 'c O CO) { if ((monitor_type != 'm') && (monitor_type && (monitor_type && (monitor_type fprintf(stdout,"Unknown device \n"); sunexit(ABNORMAL); > else switch(monitor_type) - { case ’m’ : case ’M’ : fprintf(stdout, "Device recognized as Black and White Monitor \n“); monitor=M0N0; break; case 1c ’ : case JC’ : fprintf(stdout,"Device recognized as Color Monitor \n"); monitor=CQLOR; break; } if(!(INLIMITS(xp.XMIN.XMAX) && INLIMITS(xp+wid,XMIN,XMAX) 197 && INLIMITS(yp.YMIN.YMAX) && INLIMITS(yp+hei,YMIN,YMAX))) { display("Parameters not in range "."Quit"); sunexit(ABNORMAL); } if((default.font = pf.open(my.font))==NULL) { display("Font does not exist. Default used.","Resume"); default_font = pf_open(DEFAULT_FONT); > xlow=XL0W; ylow=YL0W; width=wid-i; height=hei-i; cmapsize=2 ; me= (int *)&my_client_object; notify_set.signal.func(me,signal.handler.SIGTERM.NOTIFY.ASYNC); notify_set_signal_func(me,signal.handler.SIGSEGV,NOTIFY.ASYNC); notify_set.signal.func(me,signal.handler,SIGINT.NOTIFY.ASYNC); notify_set.signal.func(me,signal.handler,SIGTRAP.NOTIFY.ASYNC); notify_set.signal.func(me.signal.handler,SIGFPE,NOTIFY_ASYNC); frame=window_create(NULL.FRAME, FRAME.LABEL,heading, WIN.X, xp, WIN.Y, yp, o ) ; canvas=window_ereate(frame,CANVAS, WIN.HEIGHT,height+l. WIN.WIDTH,width+1, CANVAS.RETAINED.TRUE, 0); window.fit(frame); pixwin=canvas_pixwin(canvas) ; window.set(frame,WIN.SHOW,TRUE,0); (void)notify.dispatchO ; > / a * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* draw.pointO */ /*****************************************/ void drawpoint(x,y,color) int x.y,color; ■ C if(!INLIMITS(color,0,cmapsize-1)) color = 1; if(INLIMITS(x,xlow,width) && INLIMITS(y,ylow.height)) pw.put(pixwin,x,y,color); (void)notify.dispatchO ; > /*****************************************/ /* endgraphicsO */ /*****************************************/ 198 void endgraphics() { sunexit(NORMAL); > /* sunexitO */ void sunexit(i) int i; ■ c window.set(frame,FRAME_N0_C0NFIRM,TRUE,O); window.destroy(frame); (void)notify_dispatch(); if(i==ABNORMAL) exit(l); else return; > j 3fc^e^es{e3fc^;i|c9{c^e^c^c^c>fc3fe]^^e^e% ^c^^esjc^cjfc3^}fe)^9^>|e^c}tc)|e]te}fc^c34e3|c)|e^e34e>^ j /* display() */ void display(mesg.butmesg) char *mesg,*butmesg; { frammesg=window_create(NULL.FRAME,0); Mesgpanel=window_create(frammesg,PANEL,0); panel.create.item(Mesgpanel,PANEL.MESSAGE, PANEL.LABEL.STRING.rnesg, 0); panel_create_item(Mesgpanel,PANEL_BUTTON, PANEL.LABEL.IMAGE, panel.button.image(Mesgpanel.butmesg,0,0), PANEL.N0TIFY.PR0C,quit.proc, 0 ) ; window.fit(Mesgpanel); window.fit(frammesg); window.set(frammesg,WIN.SHOW,TRUE,0); done=N0; while(done != YES) notif y.dispatchO ; window_set(frammesg,FRAME.N0.C0NFIRM,TRUE,0); window.destroy(frammesg); } /* read.pixelO */ 199 int read.pixel(x.y) int x,y; < int i; (void)notify.dispatchO ; if(!(INLIMITS(x.xlow,width) && INLIMITS(y,ylow.height))) return (0); i=pw_get(pixwin, x, y ) ; (void)notify.dispatchO; return(i); } /* init.colortableO */ void init_colortable(mapname.red,green,blue.maxcolor) char *mapname; color red.green,bine; int maxcolor; { pw_setcmsname(pixwin,mapname); pw.putcolormap(pixwin,0,MAXCMAPSIZE,red,green,blue); cmapsize=maxcolor; } /* erase.windowQ */ void erase.windowO - c pw.writebackground(pixwin,0,0,width+1,height+l.BACKGROUND); (void)notify_dispatch(); > /* mousegetO */ int mouseget(x.y) int *x,*y; { Event *event,*event_in_canvas; short id; int fd; window.set(canvas,WIN.CONSUME.PICK.EVENTS, WIN.NO.EVENTS,WIN.MOUSE.BUTTONS,0, 0); event=(Event *)malloc(sizeof(Event)); fd=(int)window_get(canvas.WIN.FD); while(l) • { 200 w in d o w _ r e a d _ e v e n t ( c a n v a s .e v e n t ) ; i d = e v e n t _ i d ( e v e n t ); i f ( (id==MS_LEFT)I I(id==MS_RIGHT)I I (id==MS_MIDDLE)) b r e a k ; > / * e v e n t _ i n _ c a n v a s = c a n v a s _ e v e n t ( e v e n t ) ; * / * y = e v e n t _ y ( e v e n t ) ; * x = e v e n t _ x ( e v e n t ) ; / * P u r i s t s may c o v e r t h e i r e y e s * / w in d o w _ r e a d _ e v e n t ( c a n v a s .e v e n t ) ; / * S a f e t o open e y e s now * / if(id==MS_LEFT)return(MOUSE_LEFT); i f ( id==MS_RIGHT) return(MOUSE.RIGHT); i f ( id==MS_MIDDLE)return(MOUSE_MIDDLE); > / * e r a s e _ b o x ( ) * / v o i d e r a s e _ b o x ( x l , y l , x 2 , y 2 , c o l o r ) i n t x l , y l , x 2 , y 2 , c o l o r ; p w _ r o p (p ix w in , x l , y l , x 2 - x l , y 2 - y l , P I X _ S R C I P IX _ C 0 L 0 R (co lo r), ( P i x r e c t *)N U L L ,0,0); ( v o i d ) n o t i f y _ d i s p a t c h ( ); > / * p u t _ t e x t ( ) * / v o i d p u t _ t e x t ( x , y . c o l o r , s t r ) i n t x , y , c o l o r ; ch ar * s t r ; i f ( m o n i t o r != COLOR) p w _ t e x t ( p i x w i n , x , y , P I X _ S R C , d e f a u l t _ f o n t , s t r ) ; e l s e pw_t e x t ( p i x w i n , x . y , PIX_SRC I P I X _ C O L O R (c o lo r ),d e fa u lt_ fo n t, s t r ) ; } / * p a i n t _ c i r c l e ( ) * / j If. If. if.-if.it • * ( . If. if. if. if. 1 (. If If. if. i f . - i f if.it. " if. If. if / v o i d p a i n t _ c i r c l e ( c x , c y , r , c o l o r ) i n t c x , c y , r , c o l o r ; ■ c i n t x , y , p ; 201 x = 0 ; y = r; p = 3 - 2 * y; while (x <= y) - C pw_vector(pixwin,cx-x,cy-y,cx+x,cy-y, PIX.SRC I PIX_C0L0R(color), color); pw_vector(pixwin,cx-x,cy+y,cx+x,cy+y, PIX_SRC I PIX_C0L0R(color), color); pw_vector(pixwin,cx-y,cy-x,cx+y,cy-x, PIX.SRC I PIX_C0L0R(color), color); pw_vector(pixwin,cx-y,cy+x,cx+y,cy+x, PIX.SRC I PIX_COLOR(color), color); if (p < 0) p += 4 * x + 6 ; else { p += 4 * (x - y) + 1 0; y— ; > x++; } > /* draw_circle() */ void draw_circle(cx,cy,r,color) int cx,cy,r.color; • c int x.y.p; x = 0 ; y = r; p = 3 - 2 * y; while (x <= y) { pw_put(pixwin,cx-x,cy-y.color) ; pw_put(pixwin,cx-x,cy+y,color); pw_put(pixwin,cx+x,cy-y.color); pw_put(pixwin,cx+x,cy+y,color); pw_put(pixwin,cx-y,cy-x,color); pw_put(pixwin,cx-y,cy+x,color); pw_put(pixwin,cx+y,cy-x,color); pw_put(pixwin,cx+y,cy+x,color); if (p < 0) p += 4 * x + 6 ; else ■ { p += 4 * (x - y) + 1 0; y--; } x++; > 202 > /***** * ******** * I******************* *******/ /* draw_line() */ void draw_line(xl,yl,x2 ,y2 ,c) int xl, yl, x2 , y2 , c; { pw_vector(pixwin,xl,yl,x2,y2,PIX_SRC I PIX_C0L0R(c),c); } 203 A .14 Screen Dum p of the E-curve The following is a screen dump example of the E-curve (cost function curve) ! generated during the annealing process created by the graphics facility of SIMAL. 1 ( E -curva 8446 6250 Figure A .l: The E-curve generated during the annealing process in m apping th e ! Toru-w altz production system onto an 8 -node hypercube m ulticom puter | f j I 204 Appendix B Benchmark Production Systems The benchm ark production systems used in the static m apping experim ents are originally w ritten in 0 P S 5 and provide by CMU. Steve Kuo modified them using the Single-Context-M ultiple-Rule model and copy-and-constrained technique to generate more rules which can be fired in parallel. B .l T oru-W altz Program t B .l .l Program Code Torujw altz is a production system im plem enting the W altz’s algorithm for edge labelling. ; INITIAL 0PS5 VERSION OF WALTZ’S ALGORITHM by Toru Ishida »»>*>* I II M M ) M M M M M I M M M M n M n M I) > M M > M M M ) M M M M M M > ; Data & Knowledge Structure for Waltz’s Algorithm ; ) ) ) ) ) I > > > I > > ) ) ) > ) » ) I ) ) ) ) M ) ) I ) ) > M I I ) ) ) M ) ) M I I ) ) I I I ) i I ) > > ) ) > ) I ) > ) ^ I ) ) > (literalize possible-junction-label junction-type line-1 line-2 line-3) 205 (literalize junction junction-type junction-ID line-ID-1 line-ID-2 line-ID-3) (literalize labelling-candidate junction-type junction-ID line-1 line-2 line-3) (literalize stage status) >> j >>»>>>>>>>>»> » )>»»>>>» I I > f »» > >» l »>»*»>>» » Knowledge of Possible Junction Labelling > > > > > > j » > » j » J > I » » > » J ) > I » > > > J > ) » > > ) > I ) > > > l ) » > ;(literalize possible-junction-label junction-type line-1 line-2 line-3) \ / Junction type : L 1 \ / 2 V (p 0 .initialize (stage “status initialize) --> (remove 1) (make (make stage “status make-data) possible-junction-label “junction-type L “line-1 out “line-2 in line-3 nil) (make possible-junction-label “junction-type L “line-1 in "line-2 out “line-3 nil) (make possible-junction-label “junction-type L "line-1 + “line-2 out “line-3 nil) (make possible-junction-label “junction-type L “line-1 in “line-2 + “line-3 nil) (make possible-junction-label “junction-type L “line-1 - "line- 2 in “line-3 nil) (make possible-junction-label “junction-type L “line-1 out “line-2 _ “line-3 nil) 1 \ / 3 Junction type: FORK V 2 1 206 (make possible-junction-label “junction-type FORK “line-1 + “line-2 + “line-3 + ) (make possible-junction-label “junction-type FORK "line-1 - ‘line-2 _ ‘line-3 - ) (make possible-junction-label ‘junction-type FORK “line-1 in “line-2 — “line-3 out) (make possible-junction-label “junction-type FORK “line-1 - “line-2 out ‘line-3 in ) (make possible-junction-label “junction-type FORK “line-1 out “line-2 in ‘line-3 - ) ; l ; Junction type: T 3 1 12 (make possible-junction-label “junction-type T ‘line-1 out “line-2 + ‘line-3 in) (make possible-junction-label “junction-type T “line-1 out ‘line-2 ‘line-3 in) (make possible-junction-label “junction-type T “line-1 out “line-2 in ‘line-3 in) (make possible-junction-label ‘junction-type T ‘line-1 out “line-2 out “line-3 in) ; / i \ ; Junction type: ARROW 1 / 1 \ 3 ; / 12 \ (make possible-junction-label “junction-type ARROW ‘line-1 in ~line-2 + ~line-3 out) (make possible-junction-label "junction-type ARROW "line-1 - ‘line-2 + “line-3 - ) (make possible-junction-label “junction-type ARROW ‘line-1 + “line-2 - ‘line-3 + ) ) Scene to be Analyzed ;(literalize junction junction-type junction-ID line-ID-i line-ID-2 line-ID-3) A B / \ / \ 1 / \ 2 3/ \ 4 207i / C \ / D \ / 5/l\6 \ / /1\ \ E /10/+1 \ \ / / 1 \ \ 1 W 1 \ \ / 7/ 81 \9 \ 1 1 F 1- \ \ / / 1 \ \ 1 112 141 +\ G /+ 1- \ \ 1 1 1 1 + 1 \ / 1 \ \ 1 1 1 L\ / M l +\ \ 1 114/K\ 1 / \ \39\ 0 H 1 1-/ \15 116 17/ \18 N\+/l /-\1/ P \ - 1+ - / Q \ - 1 1 1 1 / 13J /1\ \ 1 / /1\ \ 191 120 1 1 21/ / 1 \ \ 1 / / 1 \ \ + 1 1 1 1 / 22/ 1 \ \ 1 / / 1 \ \ 11 1 1 R/ / I 24\ \1/ / 1 27\ \ 1 1 V 1\30 /+ 231 +\ T /+ 261 +\ \ 1 1 1 \+ / -1 \ /25 -1 \ W\1 1 1 \S/ 1 U \ / 1 V \ /+ 1 291 1 1 1 1 128 1 1 311 /X\ 321+ /Y\ +1 1 1 1+ / \ 1 / \ 331 1 1 1 / \ 1 / \ 1 1 Z \ 1 /40 \ 1 / \ 1 / DD \ 1 / 35\ 1 /36 37\ 1 / 38 34\ 1 / \ 1 / \ 1 / \1/ \1/ \1/ AA BB CC (p l.make-data (stage "status make-data) — > (remove 1) (make stage “status enumerate-possible-candidates) (make junction “ junctio'n-type L “junction-ID A "line-ID-1 2 "line-ID-2 1 “line-ID-3 NIL) (make junction "junction-type L “junction-ID B “line-ID-1 4 “line-ID-2 3 "line-ID-3 NIL) (make junction “junction-type ARROW “junction-ID C “line-ID-1 5 “line-ID-2 41 “line-ID-3 6) (make junction “junction-type ARROW "junction-ID D “line-ID-1 7 “line-ID-2 8 “line-ID-3 9) (make junction “junction-type ARROW “junction-ID E “line-ID-1 11 “line-ID-2 10 “line-ID-3 1) (make junction “junction-type FORK “junction-ID F “line-ID-1 10 "line-ID-2 12 “line-ID-3 5) (make junction “junction-type L “junction-ID G “line-ID-1 2 “line-ID-2 3 “line-ID-3 NIL) (make junction “junction-type FORK "junction-ID H 208 "line-ID-1 11 "line-ID-2 21 "line-ID-3 13) (make junction "junction-type ARROW "junction-ID J "line-ID-1 14 "line-ID-2 12 "line-ID-3 13) (make junction “junction-type FORK "junction-ID K “line-ID-1 41 “line-ID-2 14 “line-ID-3 15) (make junction "junction-type FORK "junction-ID L "line-ID-1 6 “line-ID-2 16 "line-ID-3 7) (make junction "junction-type FORK “junction-ID M “line-ID-1 8 ‘line-ID-2 17 "line-ID-3 18) (make junction “junction-type FORK "junction-ID N "line-ID-1 9 “line-ID-2 19 "line-ID-3 39) (make junction “junction-type ARROW “junction-ID 0 "line-ID-1 4 "line-ID-2 39 "line-ID-3 20) (make junction "junction-type ARROW “junction-ID P "line-ID-1 22 "line-ID-2 23 "line-ID-3 24) (make junction "junction-type ARROW "junction-ID Q "line-ID-1 25 "line-ID-2 26 “line-ID-3 27) (make junction "junction-type ARROW "junction-ID R “line-ID-1 29 "line-ID-2 30 "line-ID-3 21) (make junction "junction-type FORK “junction-ID S "line-ID-1 30 ‘line-ID-2 31 "line-ID-3 22) (make junction "junction-type ARROW "junction-ID T "line-ID-1 17 "line-ID-2 16 “line-ID-3 15) (make junction "junction-type FORK “junction-ID U "line-ID-1 24 "line-ID-2 32 "line-ID-3 25) (make junction “junction-type FORK "junction-ID V “line-ID-1 27 "line-ID-2 33 "line-ID-3 28) (make junction “junction-type ARROW "junction-ID W “line-ID-1 19 “line-ID-2 18 “line-ID-3 28) (make junction "junction-type FORK “junction-ID X "line-ID-1 23 "line-ID-2 40 "line-ID-3 35) (make junction "junction-type FORK “junction-ID Y “line-ID-1 26 "line-ID-2 36 "line-ID-3 37) (make junction “junction-type L “junction-ID Z “line-ID-1 29 "line-ID-2 34 “line-ID-3 NIL) (make junction "junction-type ARROW "junction-ID AA “line-ID-1 40 "line-ID-2 31 "line-ID-3 34) (make junction “junction-type ARROW "junction-ID BB "line-ID-1 36 "line-ID-2 32 "line-ID-3 35) (make junction "junction-type ARROW “junction-ID CC “line-ID-1 38 “line-ID-2 33 “line-ID-3 37) (make ) junction "junction-type L "junction-ID DD "line-ID-1 38 “line-ID-2 20 “line-ID-3 NIL) ) t l )))«>*> l > l » ; Temporal Labelling Candidates ; ;(literalize labelling-candidate junction-ID line-1 line-2 line-3) 209 j I > > » > I » > » » J I J » ) > > > I > » M » M » I » > » » I > M > I > > ) f > » I » > » > » » » > > > » » > » > > I » 1 > I > Production Rules for Waltz's algorithm > > » » » » » > » ; Start ; (p 2.start-Waltz (start) — > (remove l) (make stage “status initialize)) ; Enumerate Possible Candidates ; I ) ) M ) ) > > I M > f 1 M ) M ) > I ) I ) ) ) I ) I > > (p 3.enumerate-possible-candidates-1 .out.in.nil (stage “status enumerate-possible-candidates) (junction "junction-type 1 “junction-ID <j-ID>) (possible-junction-label “junction-type 1 “line-1 out “line-2 int “line-3 nil) -(labelling-candidate “junction-type 1 "junction-ID <j-ID> "line-1 out "line-2 int “line-3 nil) — > (make labelling-candidate "junction-type 1 "junction-ID <j-ID> “line-1 out “line-2 int “line-3 nil)) (p 4.go-to-reduce-candidates (stage “status enumerate-possible-candidates) — > (remove 1) (make stage “status reduce-candidates)) ; Reduce Candidates ; (P 5.one-one-plus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>}- “line-ID-1 <1-ID>) 210 (labelling-candidate “junction-ID <j-ID-X> “line-1 +) -(labelling-candidate “junction-ID <j-ID-Y> “line-1 +) — > ....... (remove 4)) (P 6 .one-one-minus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-1 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> “line-1 -) -(labelling-candidate “junction-ID <j-ID-Y> “line-1 -) —> (remove 4)) (P 7.one-one-in (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> "line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-1 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> "line-1 in) -(labelling-candidate “junction-ID <j-ID-Y> “line-1 out) —> (remove 4)) (P 8 .one-one-out (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>> “line-ID-1 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> "line-1 out) -(labelling-candidate “junction-ID <j-ID-Y> “line-1 in) — > (remove 4)) (P 9.one-two-plus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-1 <1-ID>) (junction "junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-2 <1-ID>) (labelling-candidate "junction-ID <j-ID-X> “line-1 +) -(labelling-candidate “junction-ID <j-ID-Y> “line-2 +) — > (remove 4)) (P 10.one-two-minus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> "line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>> “line-ID-2 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> “line-1 -) -(labelling-candidate “junction-ID <j-ID-Y> “line-2 -) — > (remove 4)) (P 11.one-two-in 2 1 1 j (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-2 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> “line-1 in) -(labelling-candidate “junction-ID <j-ID-Y> “line-2 out) — > (remove 4)) (P 12.one-two-out (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>} "line-ID-2 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> “line-1 out) -(labelling-candidate “junction-ID <j-ID-Y> “line-2 in) — > (remove 4)) (P 13.one-three-plus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> "line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-3 <1-ID>) (labelling-candidate “junction-ID <j~ID~X> “line-1 +) -(labelling-candidate “junction-ID <j-ID-Y> “line-3 +) — > (remove 4)) (P 14.one-three-minus (stage “status reduce-candidates) (junction "junction-ID <j-ID~X> "line-ID-1 <1-ID>) (junction "junction-ID -[<j-ID-Y> <> <j-ID-X>} “line-ID-3 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> “line-1 -) -(labelling-candidate “junction-ID <j-ID-Y> “line-3 -) — > (remove 4)) (P 15.one-three-in (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>}- “line-ID-3 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> “line-1 in) -(labelling-candidate “junction-ID <j-ID-Y> “line-3 out) — > (remove 4)) (P 16.one-three-out (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID~X>} “line-ID-3 <1-ID>) (labelling-candidate "junction-ID <j-ID-X> “line-1 out) -(labelling-candidate “junction-ID <j-ID-Y> “line-3 in) --> (remove 4)) (P 17.two-two-plus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-2 <1-ID>) (junction "junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-2 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> "line-2 +) -(labelling-candidate “junction-ID <j-ID-Y> “line-2 +) — > (remove 4)) (P 18.two-two-minus (stage “status reduce-candidates) (junction "junction-ID <j-ID-X> “line-ID-2 <1-ID>) (junction “junction-ID -[<j-ID-Y> <> <j-ID-X>} “line-ID-2 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> ' “line-2 -) -(labelling-candidate “junction-ID <j-ID~Y> "line-2 -) — > (remove 4)) (P 19.two-two-in (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-2 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-2 <1-ID>) (labelling-candidate "junction-ID <j-ID-X> “line-2 in) -(labelling-candidate “junction-ID <j-ID-Y> “line-2 out) — > (remove 4)) (P 20.two-two-out (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-2 <1-ID>) (junction "junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-2 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> "line-2 out) -(labelling-candidate “junction-ID <j-ID-Y> “line-2 in) — > (remove 4)) (P 21.two-three-plus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-2 <1-ID>) (junction "junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-3 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> “line-2 +) -(labelling-candidate “junction-ID <j-ID-Y> “line-3 +) — > (remove 4)) (P 22.two-three-minus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-2 <1-ID>) (junction “junction-ID -{<j-ID-Y> <> <j-ID-X>} “line-ID-3 <1-ID>) 213 (labelling-candidate “junction-ID <j-ID-X> “line-2 -) -(labelling-candidate “junction-ID <j-ID-Y> “line-3 -) : —> (remove 4)) (P 23.two-three-in (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-2 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-3 <1-ID>) (labelling-candidate "junction-ID <j-ID-X> "line-2 in) -(labelling-candidate “junction-ID <j-ID-Y> “line-3 out) --> (remove 4)) (P 24.two-three-out (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-2 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>> “line-ID-3 <1-ID>) (labelling-candidate “junction-ID <j-ID~X> “line-2 out) -(labelling-candidate “junction-ID <j-ID-Y> “line-3 in) — > (remove 4)) (P 25.three-three-plus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-3 <1-ID>) (junction “junction-ID -[<j-ID-Y> <> <j-ID-X>} “line-ID-3 <1-ID>) (labelling-candidate "junction-ID <j-ID-X> “line-3 +) -(labelling-candidate “junction-ID <j-ID-Y> “line-3 +) (remove 4)) (P 26.three-three-minus (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-3 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>} “line-ID-3 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> “line-3 -) -(labelling-candidate "junction-ID <j-ID-Y> “line-3 -) — > (remove 4)) (P 27.three-three-in (stage “status reduce-candidates) (junction “junction-ID <j-ID-X> “line-ID-1 <1-ID>) (junction “junction-ID {<j-ID-Y> <> <j-ID-X>> “line-ID-1 <1-ID>) (labelling-candidate “junction-ID <j-ID-X> “line-1 in) -(labelling-candidate “junction-ID <j-ID-Y> “line-1 out) — > (remove 4)) (P 28.three-three-out 214 (stag© "status reduce-candidates) (junction "junction-ID <j-ID-X> "line-ID-3 <1-ID>) (junction "junction-ID {<j-ID-Y> <> <j-ID~X>} "line-ID-3 <1-ID>) (labelling-candidate "junction-ID <j-ID-X> "line-3 out) -(labelling-candidate "junction-ID <j-ID-Y> “line-3 in) — > (remove 4)) (p 29.go-to-print-out (stage “status reduce-candidates) — > (remove 1) (make stage "status print-out)) ; Print Out ; (p 30.print-out (stage “status print-out) — > (remove 1) (halt)) ;;;;;; run ;(make start) ;(watch 2) ;(run) ;(ppwm) ;;;;;; revision for ioru3.cl (p 31.enumerate-possible-candidates-1.in.out.nil (stage "status enumerate-possible-candidates) (junction "junction-type 1 “junction-ID <j-ID>) (possible-junction-label.. "junction-type 1 "line-1 in “line-2 out “line-3 nil) -(labelling-candidate “junction-type 1 “junction-ID <j-ID> "line-1 in "line-2 out "line-3 nil) — > (make labelling-candidate “junction-type 1 “junction-ID <j-ID> “line-1 in "line-2 out "line-3 nil)) (p 32.enumerate-possible-candidates-1.+.out.nil (stage “status enumerate-possible-candidates) (junction "junction-type 1 "junction-ID <j-ID>) (possible-junction-label “junction-type 1 “line-1 + “line-2 out “line-3 nil) 215 - ( la b e llin g - c a n d id a t e “ju n c tio n -ty p e 1 “ju n c tio n -ID <j-ID> “l i n e - 1 + “l i n e - 2 out “l i n e - 3 n i l ) - - > (make labelling-candidate “junction-type 1 “junction-ID <j-ID> “line-1 + “line-2 out “line-3 nil)) (p 33.enumerate-possible-candidates-1.in.+.nil (stage “status enumerate-possible-candidates) (junction “junction-type 1 “junction-ID <j-ID>) (possible-junction-label “junction-type 1 “line-1 in “line-2 + “line-3 nil) -(labelling-candidate “junction-type 1 “junction-ID <j-ID> “line-1 in “line-2 + “line-3 nil) — > (make labelling-candidate “junction-type 1 “junction-ID <j-ID> “line-1 in “line-2 + “line-3 nil)) (p 34.enumerate-possible-candidates-1.-.in.nil (stage “status enumerate-possible-candidates) (junction “junction-type 1 “junction-ID <j-ID>) (possible-junction-label "junction-type 1 “line-1 - “line-2 in “line-3 nil) -(labelling-candidate “junction-type 1 “junction-ID <j-ID> “line-1 - “line-2 in "line-3 nil) — > (make labelling-candidate “junction-type 1 “junction-ID <j-ID> "line-1 - "line-2 in "line-3 nil)) (p 35.enumerate-possible-candidates-1.out.-.nil (stage “status enumerate-possible-candidates) (junction "junction-type 1 "junction-ID <j-ID>) (possible-junction-label “junction-type 1 “line-1 out “line-2 - “line-3 nil) -(labelling-candidate “junction-type 1 “junction-ID <j-ID> “line-1 out "line-2 - "line-3 nil) — > (make labelling-candidate “junction-type 1 “junction-ID <j-ID> “line-1 out “line-2 - "line-3 nil)) (p 36.enumerate-possible-candidates-fork.+.+.+ (stage “status enumerate-possible-candidates) (junction “junction-type fork “junction-ID <j-ID>) (possible-junction-label “junction-type fork “line-1 + “line-2 + "line-3 +) -(labelling-candidate “junction-type fork “junction-ID <j-ID> "line-1 + "line-2 + "line-3 +) — > (make labelling-candidate "junction-type fork “junction-ID <j-ID> "line-1 + "line-2 + "line-3 +)) (p 37.enumerate-possible-candidates-fork (stage “status enumerate-possible-candidates) (junction “junction-type fork “junction-ID <j-ID>) (possible-junction-label “junction-type fork “line-1 - "line-2 - “line-3 -) -(labelling-candidate “junction-type fork "junction-ID <j-ID> “line-1 - "line-2 - “line-3 -) - - > (make labelling-candidate “junction-type fork "junction-ID <j-ID> “line-1 - "line-2 - "line-3 -)) (p 38.enumerate-possible-candidates-fork.in.-.out (stage “status enumerate-possible-candidates) (junction “junction-type fork “junction-ID <j-ID>) (possible-junction-label “junction-type fork "line-1 in “line-2 - “line-3 out) -(labelling-candidate “junction-type fork “junction-ID <j-ID> “line-1 in “line-2 - “line-3 out) — > (make labelling-candidate “junction-type fork “junction-ID <j-ID> “line-1 in “line-2 - "line-3 out)) (p 39.enumerate-possible-candidates-fork.-.out.in (stage "status enumerate-possible-candidates) (junction “junction-type fork “junction-ID <j-ID>) (possible-junction-label “junction-type fork “line-1 - "line-2 out “line-3 in) -(labelling-candidate “junction-type fork “junction-ID <j-ID> “line-1 - "line-2 out “line-3 in) — > (make labelling-candidate "junction-type fork "junction-ID <j-ID> "line-1 - "line-2 out "line-3 in)) (p 40.enumerate-possible-candidates-fork.out.in. - (stage “status enumerate-possible-candidates) (junction “junction-type fork “junction-ID <j-ID>) (possible-junction-label “junction-type fork "line-1 out "line-2 in “line-3 -) -(labelling-candidate "junction-type fork “junction-ID <j-ID> “line-1 out “line-2 in “line-3 -) — > (make labelling-candidate “junction-type fork "junction-ID <j-ID> “line-1 out “line-2 in “line-3 -)) (p 41.enumerate-possible-candidates-t.out.+.in (stage “status enumerate-possible-candidates) (junction “junction-type t “junction-ID <j-ID>) (possible-junction-label “junction-type t “line-1 out “line-2 + “line-3 in) -(labelling-candidate “junction-type t “junction-ID <j-ID> “line-1 out “line-2 + “line-3 in) — > 217 (make la b e llin g - c a n d id a t e “ju n c tio n -ty p e t *ju n c tio n -ID <j-ID> “l i n e - 1 out “l i n e - 2 + " lin e - 3 i n ) ) (p 42.enumerate-possible-candidates-t.out.-.in (stage "status enumerate-possible-candidates) (junction "junction-type t "junction-ID <j-ID>) (possible-junction-label "junction-type t "line-i out "line-2 - “line-3 in) -(labelling-candidate “junction-type t “junction-ID <j-ID> “line-1 out "line-2 - “line-3 in) — > (make labelling-candidate “junction-type t “junction-ID <j-ID> “line-1 out “line-2 - "line-3 in)) (p 43.enumerate-possible-candidates-t.out.in.in (stage “status enumerate-possible-candidates) (junction "junction-type t “junction-ID <j-ID>) (possible-junction-label “junction-type t “line-1 out “line-2 in “line-3 in) -(labelling-candidate “junction-type t “junction-ID <j-ID> “line-1 out “line-2 in “line-3 in) — > (make labelling-candidate “junction-type t “junction-ID <j-ID> "line-1 out "line-2 in “line-3 in)) (p 44.enumerate-possible-candidates-t.out.out.in (stage “status enumerate-possible-candidates) (junction “junction-type t “junction-ID <j-ID>) (possible-junction-label “junction-type t “line-1 out “line-2 out “line-3 in) -(labelling-candidate “junction-type t “junction-ID <j-ID> “line-1 out “line-2 out “line-3 in) — > (make labelling-candidate “junction-type t “junction-ID <j-ID> “line-1 out “line-2 out “line-3 in)) (p 45.enumerate-possible-candidates-arrow.in.+.out (stage “status enumerate-possible-candidates) (junction “junction-type arrow “junction-ID <j-ID>) (possible-junction-label “junction-type arrow “line-1 in “line-2 + “line-3 out) -(labelling-candidate “junction-type arrow “junction-ID <j-ID> “line-1 in “line-2 + “line-3 out) — > (make labelling-candidate “junction-type arrow “junction-ID <j-ID> “line-1 in “line-2 + “line-3 out)) Cp 46.enumerate-possible-candidates-arrow (stage "status enumerate-possible-candidates) (junction “junction-type arrow “junction-ID <j-ID>) (possible-junction-label “junction-type arrow 218 “line-1 - “line-2 + “line-3 -) -(labelling-candidate “junction-type arrow “junction-ID <j-ID> “line-1 - “line-2 + “line-3 -) - - > (make labelling-candidate “junction-type arrow “junction-ID <j-ID> “line-1 - “line-2 + "line-3 -)) (p 47.enumerate-possible-candidates-arrow.+.-.+ (stage "status enumerate-possible-candidates) (junction “junction-type arrow “junction-ID <j-ID>) (possible-junction-label “junction-type arrow “line-1 + "line-2 - “line-3 +) -(labelling-candidate “junction-type arrow “junction-ID <j-ID> “line-1 + “line-2 - “line-3 +) — > (make labelling-candidate “junction-type arrow “junction-ID <j-ID> "line-1 + “line-2 - “line-3 +)) B .1 .2 C h a ra c te ris tic M a tric e s The characteristic m atrices describe the production system. They include the parallelism m atrix P , the com m unication m atrix C and the firing frequency m atrix W. P a ra lle lis m M a tr ix P 1 1 1 0 0 00 0 00 0 0 0 00 0 0 0 0 0 00 0 0 00 00 000 0 0 0 00 0 00 00 00 0 0 0 00 1 1 00 1 00 0 0 0 0 0 0 00 00 0 00 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 1 0 1 00 0 0 0 00 0 0 0 00 0 0 0 0 000 0 0 0 0 0 0 0 0 0 0 0 0 00 0 00 0 0 0 0 0 0 0 0 0 0 0 0 1 1 00 1 1 0 0 0 I 00 0 X0 0 X X0 0 X000 1 0 0 00 00 000 0 0 0 0 0 0 0 0 0 0 0 01 0 1 1 0 00 00 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0 0 0 X0 1 X I X 1 X2 1 I I X XI XXXX 0 00 00 1 0 0 0X0 0 0 I 0 0 0 X 1 X X 1 X I X 1 X0 XX00 I0 00 1 00 0 00 0 0 0 0 0 1 0 000 0 0 1 0 0 0 X0 00I 0 0 X IX X I 1XX 1X0 XX00 0 0I 0 0 I 0 1 00 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0X 00 0 10 X I X X I X I X 1 XI I X0 I 0 X0 X0 0 I 0 I X I X 1X0 0 0 0 0 1 0 000 X00 0 X0 0 0 1 I I X XI X XX 1 X0 XX0 X0 X0 X0 0 1 0 X1 1 X XX0 0 0 0 000 1 0 0 0I 0 0 0 X0 00 1 XX XXX1 X X X0 1 X0 0 I X00 X00 0 02 0 0 0XX 1 0 0 000 0 1 0 0 0 X0 0 0X0 0 X 1 X I I 1 X X X I 0 XX00 0 0X X0 X 1 X00X0 00 XI 0 000 0 0 0 1 0 00 X 0 0 0 10 XI I I 1 X 1 1 I XX I X0 1XI 0 0 0 0 I X0 0 00 X X00 0 00 1 0 00 0 1 0 0 0 X0 0 0 I XXX 1 XXX X 1 X0 XX00 00 XX0 0 0 0 X I 1 X 1 0 0 0 0 0 0 0 0 1 00 0 X0 0 0X0 0 0 I XX 1I X XX I X0 XI 00 X0001 0 0 0 0 0 0 0 0 0 0 I 0 0 00 0 0 1 0 00 I 0 0 0 X00 X 1 X X XI X1X 1 0 XX00 0 0 X0 0 X0 XX0 0 0 0 0 X0 0 00 00 00 1 0 0 0 1 0 0 0 X0 XX1X XX X 1 1XI XX0 1 0 X00 0 0 X 0 0 0 0 0 0 X00 0 00 1 0 0 00 X00 0 X0 0 0 X XX1 1XX X1 X X0 XX00 00 0 I 0 0 0 I 1XXX X0 00 0 00 00 1 1 1 1 I I 1 X X1 X X X0 00 X0 0 0 1XXI X0 0 0 X00 1 0 0 0 0 X0 0 0XI 0 0 00 00 1 11XI X1XI X 1 X0 I 00 0X0 0 X11 XX00 0 0 0 1 0 1 X 0 0 0 X0 0 0 0 X 0 0 0 1 01 11 X1 I 1 1 1 XX X0 0I 0 0 0 X0 1XX I X0 1 I0X0 0 0 0 I X0 0I X0 00 0 0 0 1 0 1X1 1 X X1 X 1 X 1 X0 0 0 I 0 0 0 1 XX21I0 I 2 0X0 0 0 0 I I 00 I X0 0 0 0 000 011 11 X XXX XI XX X0 0 0 X000 I XX 1 X0 00 X0 01 0 0 0 0 I 0 0 0 XX X 0 000 01 1 1 XX XXX X1 1 X0 1 000X00 XXX1 X000 0 0 X0 X X0X0 X0 0 0 I X 0 00 X0 1X 1 1 X XX 1 XI XX0 01 00 0 1 0 XXI I X0 0 0 01 00 0 X0 X0 0 X 0 X0 0 0 0 0 0 01 I 1 1 I I I X1 1 XX0 0 0 1 0 0 0 1 XXI I X0 X X0 00 0 0 0 I 0 XXX 1 0 0 0 0 000 01 I 1 XX XI X 1 1 I X XXX X1 1 X1 I 0X0 X0 0 00 0 0 1 0 0 0 0 0 0 0 00 0 X 0 00 00 1 11 I 1 XI X1 I I XX IX I I 1 X1 0 XX0 X0 0 00 000 X0 0 I 0 0 0 00 X0 0 0 0 1 0 00 1 0 00 X0 0 0 I 0 X I I I 1I X X X I XX X0 X0 1 0 1 0 0X 0X I XI I I 0 0 0 0 0 00 1 1 XXX XX 1 I I X X 1 XX I XX X1 0 0X XI0 00 0 0 0 0 0 X X0 XXI X X00 0 0 00 1 X X I I X XX 2 XX X I 1 12 X 1 X I 2 X XX 1 2 10 00 00 0 00 0 00 0 0 0 0 0 0 0 0 00 0 000 0 00 0 0 0 0 0 00 0 000 0 0 00 0 0 0 I X 000 00 00 0 0 00 0 0 0 0 0 0 0 0 00 1 00 1 I 00 X0 0 0 I 0 00 1 I 0 0 0X0 0 X0 0 0 X00 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 1I 0 0 0 X0 I 0 X0 00 00 1 I 00 0X0 0 00 0 0 0 1 0 0 00 0 0 0 00 0 0 0 00 0 0 0 0 0 1 0 0 1 1 X0X 0 0 0 X0 I 0 00 X0 0 0 0 0X0 0 0 0 0 1 0 0 0 0 0 0 00 0 0 00 0 0 0 0 00 1 0 1 0 0 0 X0 1 0 1 0 0 00 1 1 0 0 X0 0 0 0 0 0 0 0 0 0 1 00 0 0 0 00 0 0 0 000 0 0 0 0 1 00 1 X0 X0 X0 0 0 1 0 X00 0X0 0 00 X0 0 0 00 0 0 I 0 0 0 0 0 0 0 0 00 00 0 0 0 0 1 1 00 0X0 0 0X0 00 1 0 00 X0 0 0 X0 00 0 0 00 00 0 X0 0 0 0 0 0 0 00 00 0 0 0 0 1 0 1 0 0 0 1 0 0 0 X00 0 X00 0 1 0 0 0 X0 00 0 0 00 000 X0 0 00 0 0 0 0 00 0 0 0 0 1 0 0 1 1 0 X10 0 0 1 00 X00 0I X000 X 1 0 0 0 0 0 00 0 0 X 0 0 0 0 0 00 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 I 0 1 0 0 1 1 0 0 0 1 0 0 0 X0 0 0 0 0 000 0 0 X00 0 D 0 0 00 0 0 0 0 1 0 0 XX0 0 0 X0 I 0 1 0 0 1 1 0 I I 0 0 XX00 000 0 000 00 0 X0 0 0 0 00 0 0 00 0 1 0 0 I XI 0 0 I 0 0 0 1 X0 0 0 X00 X00 I X0 0 0 00 0 00 0 0 0 0 X000 0 0 0 0 0 0 0 1 0 0 XX0 X0 1 0 0 0 2 0 X 0 00 I 0 1 00 X X 6 0 0 0 Q 00 0 0 0 0 0 0X0 0 0 00 0 000 1 0 0 1 X00 0 I 0 0 0 1 00 1 X0 0 1 X00 X X0 0 0 0 00 00 0 0 00 0 0 X0 0 00 0 0 00 1 0 0 1 X00X 1 0 0 0 1 00 1 X 0 0 0 X00 X X0 0 0 0 00 00 0 0 00 0 00 X0 00 0 0 00 X0 0 1 X X0X 0 0 0 X0 1 0 0 0 X0 X0 0 0 X I0 0 00 0 0 0 0 0 0 00 0 00 0X00 0 0 0 0 1 0 X 0 0 1 X0 0 0 X0 0X0 0 0 I 1 0 0 0 1 0 0 0 0 00 0 0 0 0 0 0 00 0 0 0 0 0 X0 0 0 001 1 0 0 0 1 X0 0 X0 0 0 0I0 0 1 I 0 0 I 0 0 0 0 0000 00 0 0 0 0 0 0 00 0 0 0 1 220 C o m m u n ic a tio n M a tr ix C 1 I 0 0 0 0 0 0 0 000 00 000 000 000 0 0 00 0 0 1 1 1 1 X 1 1 X X 1 I 1 XX I X 0 0 0 0 0 0 0 0 000 000 00 0 0 0 0 00 0 0 0 0 000 1 0 0 1 0 0 0 0 0 0 0 0 0 0 00 0 I 000 1 1 XX1 1 1 X X1 1 00 0 0 0 0 001 XX X X X0 00 00 0 0 0 0 1 I XX X 0 00 0 0 0 0 X X XXX X X X 00 0 0 0 0 XX X X XXX X X1 00 0001 0 00 1 X X X1 X0 0 00 00 0 0 0 X I X X X0 0 0 0 00 0 0 1 X X 1 I X 1 000 0 0 0 X 1 X X X 1 00 0 00 1 0 0 0 X X 0 0 0 0 00 0 0 0 I I X X X 1 0 0 0 0 0 0 0 0 X X I X I X 1 0 0 0 0 0 0 1 I 1 XX 1 X 1 0 000 0 1 00 0 1 1 0 00 00 1 1 X I 1 0 0 0 1 1 0 0 0 0 1 0 0 X 1 000 0 1 1 1 1 0 0 X 1 1 0 0 00 0 1 1 0 00 1 X 00 0 0 0 1 1 1 1 X0 0 0 1 1 000 0 1 1 1 i I 0 0 XI 1 X 1 0 0 0 0 0 1 1 0 0 X X1 1 000 00 1 1 1 X 1X X1 0 1 000 0 G 1 1 1 X X X1 X 1 0 001 0 0 0 X1 1 000 0 0 I 1 XX X 1 0 1 0 0 0 00 1 1 1 I I I 1 X X 1 0 000 0 0 0 0 0 00 0 0 0 00 0 0 0 0 0 0 00 0 0000 1 0 0 1 00 00 0 00 0X 0 0 0 0 G 1 0 0 Q0 0 0 G 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 1 0 0 1 I 0 0 0 0 0 00 0 1 0 0 00 0 1 0 000 1 01 0 0 0 1 0 1 0 1 0 0 0 0X 00 0 0 0 0 0 0 0 01 0 0 1 0 1 0 1 0 0 0 1 0X0 0 0 1 0 0 0 0I 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 00 0 1 00 0 X 0 0 0 00 0 1 0 1 0 0 0 1 00 0 0 00 I 0 0 0I0 0 0 X0 0 00 0 1 0 0 XX0 I X00 0 I 0 0 0 0 0X 0 0 0X 0 00 01 0 1 0 0 0 X o 0 X0 X0 0 0 0 0 X0 0 0 0 00 0 1 0 0 1X000 1 0 0 X001 0I X00 1 1 0 0 0 0 0 0 XXX0 0 X0 0 0 XX0 0 0 X00 X0 0X 0 0 0 0 1 0 0 XX0 X0 X0 0 0 X0 X0 0 0X0 0 0 X 0 0 0 0 1 0 0 XX0 0 0 X0 0 0 X00 I 00 I 00X 0 0 0 0 0 0 1 I 0 0 1 00 0 X0 0 00 0 X 0 0 1 0 0 0 0 0 0 XX 0 X00GX0 0 0 0 I0 I 0 0 0 I 0 0 0 0 0 1 0 0 X0 0 0 1 0 0 0 0 0 I X0 00 X0 000 0 1 0 0 0 X00 1 0 0 00I 0 0 I 0 0 0 0 0 0 1 1 X X 1 X X1 1 X 1 1 1 I 11 X 00 1 X I X 1 1I 1X X 1 1 1 1 I 1 1 0 0 0 0 0 0 000 0 00 0 00 0 0 0 0 00 0 0 0 0000 0 00 0 0 0 0 0 0 0 1 0 1 1 1 X X X I I 1 X 1 X1 X1 1 1 X0 0 1 0 0 0 1 0 0 0 0 0 00 0 0 0 X X0 0 00 I 00 I 0 X0 0 0 0 0 0 X0 X0 1 0 1 0 I00 I 0 1 1 I X 1 I 00 X0 I 01 0 X0 0 I 0 I XXX XX00 1 00 1 X0 0 1 0 0 0 0 X00 0 XX X X00 Q0 X X0 X1 1 0 0 1 0 0GX 1 I 0 1 XX000 0 X 1 0 0 00 XX00 I 0 0 0 0 X I 0 0 0 0 I X1 I I0 0 0 X00 I 0 0 0 1 0 0 0 0 0 0 0 0 0 0 I X00 00 X0 0 1 0 X X0 00 0 0 1 0 X0 X0 X0 0 0 01 0 00 0 0 0 I 00 I 00 000 X00 0 X X XX X X0 0 0 X0 00 X00 X0 0 0 0 I 0 0 0 I 1 0 X0 00 00 X0 11 0 00X0 00 0 X 1 0 X1 0 X0 00 0 1 1 0 0 X X0 0 0 X0 1 1 0 I0 00 0 X 1 0 0 I I 0 00 1 0 0 0 X00 I 0 0 0 0 I 00 0 1 1I X0 0 00 0 10 1 X 0 X0 1 0 001X 1 0 0 0G X00 0 I 0 X0 0 X0 X0 0 1 0 1 I 00 0 0 0 0 X0 X1 1 1 0 0 0 X0 0 00 0 0 X0 0 0 00 00 0 0 0 X X0 0 0 0 0 00 1 0 0 X0 0 0 001 0 1 0 1 0 X0 I 0 0 10 X 1 1 1 1 X00 10 0 00 0 0 00 X X0 I X1 1X0 0 1 X0 00000 0 0 0 0 0 0 0 0 00 0 0 1 00 0000 0 0 0 0 0 0 0 0 0 0 0 0 0 1 00 0 0 00 0 00 0 0 0 0 0 00 0 0 0 1 000 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 X0 0 0 0 0 0 0 0 00 0 0 00 0 0 0 00 1 00 0 0 0 00 0 0 0 0 0 0 0 0 0 00 0 X 00 0 0 0 0 0 0 0 0 00 0 0 0 0 000 X0 0 0 0 0 00 0 0 00 0 0 0000 0 0 X0 0 0 0 0 0 0 0 0 0 000 000 0 0 0I 0 0 0 00 0 0 00 0 0 0 00000 0 0 X0 0 00 0 0 00 0 0 0 00 0 00 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 I00 0 0 00 0 0 0 0 0 00 0 0 0 0 0 0 X0 0 0 00 0 0 000 00G 0 0 0 00 0 1 0 0 0 0 0 0 0 00 00 0 0 0 0 0 0 00 X00 0 0 0 0 00 00 0 0 0 0 0 0 0 0 0 X0 0 0 0 0 0000 00 0 000 00 0 0X0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0 I 0 X 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 0 0 F irin g F re q u e n c y M a tr ix W 1 1 1 5 1 4 1 0 2 5 3 3 4 0 8 6 2 1 2 3 5 2 1 0 5 1 0 6 6 1 1 5 5 5 5 10 10 10 10 10 10 0 0 0 0 12 12 12 B .2 T ou rn ey P ro g ra m B.2.1 Program Code Tourney is a production system for scheduling bridge tournam ents. ;This is the first benchmark system. No input ;required. -- Milind M I I I I M I n M ) M M M I I M ) M t I I II I M M ) M I ) I M » M I M I » (literalize player 222 number night s-s cheduled) (literalize foursome night group north south east west) (literalize context name) (literalize scheduling night) (literalize already-played playerl player2) (literalize candidate group chosen south east west) (p 0 ..startup (start) — > (make player "number 1 (make player "number 2 (make player "number 3 (make player "number 4 (make player "number 5 (make player "number 6 (make player "number 7 (make player “number 8 (make player "number 9 (make player “number 10 (make player “number 11 (make player “number 12 (make player “number 13 (make player (make player (make player number 14 number 15 number 16 (make foursome “night 1 “north nil “south nil (make foursome “night 1 “group 'nights-scheduled 0 'nights-scheduled 0 'nights-scheduled 0 'nights-scheduled 0 'nights-scheduled 0 'nights-scheduled 0 'nights-scheduled 0 ‘nights-scheduled 0 'nights-scheduled 0 “nights-scheduled 0) “nights-scheduled 0) "nights-scheduled 0) “nights-scheduled 0) “nights-scheduled 0) “nights-scheduled 0) “nights-scheduled 0) “group a east nil “west nil) b 223 "north nil “south nil “east nil “west nil) (make foursome “night 1 “group c "north nil “south nil “east nil “west nil) (make foursome “night 1 “group d "north nil “south nil “east nil “west nil) (make foursome “night 2 “group a “north nil “south nil “east nil “west nil) (make foursome “night 2 “group b “north nil “south nil “east nil “west nil) (make foursome "night 2 “group c "north nil “south nil “east nil “west nil) (make foursome "night 2 “group d "north nil “south nil “east nil “west nil) (make foursome “night 3 "group a “north nil “south nil ‘east nil “west nil) (make foursome "night 3 "group b “north nil "south nil “east nil “west nil) (make foursome “night 3 “group c “north nil “south nil “east nil “west nil) (make foursome “night 3 “group d “north nil “south nil “east nil “west nil) (make foursome “night 4 "group a "north nil “south nil "east nil “west nil) (make foursome “night 4 “group b “north nil “south nil “east nil “west nil) (make foursome "night 4 “group c "north nil “south nil “east nil "west nil) (make foursome “night 4 “group d “north nil “south nil "east nil “west nil) (make foursome "night 5 "group a “north nil “south nil “east nil “west nil) (make foursome “night 5 “group b “north nil “south nil “east nil “west nil) (make foursome "night 5 “group c “north nil “south nil “east nil “west nil) (make foursome "night 5 "group d “north nil "south nil “east nil “west nil) (make context “name north) (make scheduling “night 1) (write (tabto 32) I Tournament Schedule I (crlf) (crlf) (tabto 15) In s E W| (tabto 42) IN s E Wl (tabto 6 8) IN s E WI (crlf) (tabto 15) 1 = = = = 1 (tabto 42) 1 = = = = 1 (tabto 6 8) 1 = = =1 (crlf) (p 1 ..north.pick-one-1 (context “name north) (scheduling “night <n>) {. <THE-F0URS0ME> (foursome “night <n> “north nil) > 224 { <THE-PLAYER> (player "number <p> “nights-scheduled < <n>) } (foursome “night <n> “north { <pl> <> nil }) (foursome “night <n> “north { <p2> <> <pl> <> nil }) (foursome "night <n> “north { <p3> <> <p2> <> <pl> <> nil }) (already-played “playeri <p> “player2 <pl>) (already-played “playeri <p> "player2 <p2>) (already-played “playeri <p> “player2 <p3>) — > (modify <THE-PLAYER> “nights-scheduled <n>) (modify <THE-F0URS0ME> “north <p>) ) j (p 2 ..north.pick-one-2 (context “name north) (scheduling “night <n>) { <THE-F0URS0ME> (foursome "night <n> "north nil) } { <THE-PLAYER> (player "number <p> "nights-scheduled < <n>) } (foursome “night <n> “north {. <pl> <> nil }) (foursome “night <n> "north { <p2> <> <pl> <> nil }) (already-played "playeri <p> “player2 <pl>) (already-played “playeri <p> "player2 <p2>) — > (modify <THE-PLAYER> “nights-scheduled <n>) (modify <THE-F0URS0ME> “north <p>) ) (p 3..north.pick-one-3 (context “name north) (scheduling “night <n>) -[ <THE-F0URS0ME> (foursome “night <n> “north nil) } { <THE-PLAYER> (player “number <p> “nights-scheduled < <n>) } (foursome “night <n> “north ■ ( <pl> <> nil }) (already-played “playeri <p> "player2 <pl>) - - > (modify <THE-PLAYER> “nights-scheduled <n>) (modify <THE-FOURSOME> “north <p>) ) (p 4..north.pick-one-4 (context “name north) (scheduling “night <n>) { <THE-FOURSOME> (foursome “night <n> “north nil) } { <THE-PLAYER> (player “number <p> “nights-scheduled < <n>) } — > (modify <THE-PLAYER> “nights-scheduled <n>) (modify <THE-F0URS0ME> “north <p>) ) (p 5..north.done - [ <THE-COMTEXT> (context “name north) } — > (modify <THE-CONTEXT> “name maie-candidates)) (p 6 ..make-candidates.make-candidate.A (context “name make-candidates) 225 (scheduling "night { <n> < 3 }) (foursome "night <n> "group A “north <yankee>) (player “number <redneck> "nights-scheduled < <n>) (player “number {. <oriental> < <redneck> } "nights-scheduled < <n>) (player “number { <cowboy> < <oriental> } ■ “nights-scheduled < <n>) - (already-played "playeri <yankee> "player2 <redneck>) - (already-played "playeri <yankee> “player2 <oriental>) - (already-played "playeri <yankee> “player2 <cowboy>) - (already-played “playeri <redneck> "player2 <oriental>) - (already-played “playeri <redneck> “player2 <cowboy>) - (already-played “playeri <oriental> “player2 <cowboy>) - (candidate “group A “chosen no “south <redneck> “east <oriental>) - (candidate “group A “chosen no “south <redneck> “west <cowboy>) - (candidate “group A “chosen no "east <oriental> “west <cowboy>) — > (make candidate “group A “chosen no “south <redneck> "east <oriental> "west <cowboy>)) (p 7..make-candidates.make-candidate-late (context “name make-candidates) (scheduling “night {. <n> > 2 }) (foursome “night <n> “group A “north <yankee>) (player “number <redneck> “nights-scheduled < <n>) (player “number { <oriental> < <redneck> } ■ "nights-scheduled < <n>) (player “number { <cowboy> < <oriental> } • “nights-scheduled < <n>) - (already-played “playeri <yankee> “player2 <redneck>) - (already-played “playeri <yankee> “player2 <oriental>) - (already-played “playeri <yankee> “player2 <cowboy>) - (already-played “playeri <redneck> “player2 <oriental>) - (already-played “playeri <redneck> “player2 <cowboy>) - (already-played “playeri <oriental> “player2 <cowboy>) — > (make candidate “group A “chosen no “south <redneck> “east <oriental> “west <cowboy>)) (p 8 ..make-candidates.done { <THE-CONTEXT> (context “name make-candidates) } • — > (modify <THE-CONTEXT> “name make-choice)) (p 9..make-choice.doit - ( <THE-CONTEXT> (context "name make-choice) } { <WINKER-A> (candidate “group a “chosen no “south <sa> “east <ea> “west <wa>) } { <WINNER-B> (candidate “group b “chosen no “south { <sb> <> <sa> <> <ea> <> <wa> } “east { <eb> <> <sa> <> <ea> <> <wa> } “west { <wb> <> <sa> <> <ea> <> <wa> }) } { <WINNER-C> (candidate “group c "chosen no 226 “south • { <sc> <> <sa> <> <ea> <> <wa> <> <sb> <> <eb> <> <wb> > “east { <ec> <> <sa> <> <ea> <> <wa> <> <sb> <> <eb> <> <wb> } “west - [ <wc> <> <sa> <> <ea> <> <wa> <> <sb> <> <eb> <> <wb> }) } ■ C <WINNER-D> (candidate “group d “chosen no “south { <sd> <> <sa> <> <ea> <> <wa> <> <sb> <> <eb> <> <wb> <> <sc> <> <ec> <> <wc> > "east - { <ed> <> <sa> <> <ea> <> <wa> <> <sb> <> <eb> <> <wb> <> <sc> <> <ec> <> <wc> > “west { <wd> <> <sa> <> <ea> <> <wa> <> <sb> <> <eb> <> <wb> <> <sc> <> <ec> <> <wc> » — > (modify <WINNER-A> “chosen yes) (modify <WINNER-B> “chosen yes) (modify <WINNER-C> “chosen yes) (modify <WINNER-D> “chosen yes) (modify <THE-CONTEXT> “name remove-candidates)) (p 1 0..remove-candidates.bye (context “name remove-candidates) • { <THE-CANDIDATE> (candidate “chosen no “group A) } — > (remove <THE-CANDIDATE>)) (p 1 1..remove-candidates.done • { <THE-CONTEXT> (context “name remove-candidates) } — > (modify <THE-C0NTEXT> “name apply-choice)) 'north <yankee>) } • (p 1 2..apply-choice.doit (context “name apply-choice) (scheduling “night <n>) • £ <THE-F0URS0ME> (foursome “night <n> "group <g> <THE-CH0ICE> (candidate “group <g> "chosen yes “south <redneck> “east <oriental> “west <cowboy>) } — > (modify <THE-F0URS0ME> “south <redneck> "east <oriental> “west <cowboy>) (remove <THE-CH0ICE>) (make already-played “playeri <yankee> "player2 <redneck>) 'player2 <yankee> “playeri <redneck>) 'playeri <yankee> “player2 <oriental>) 'player2 <yankee> “playeri <oriental>) 'playeri <yankee> "player2 <cowboy>) 'player2 <yankee> “playeri <cowboy>) 'playeri <redneck> “player2 <oriental>) (make already-played (make already-played (make already-played (make already-played (make already-played (make already-played 227 (make already-played “player2 <redneck> “playeri <oriental>) (make already-played “playeri <redneck> “player2 <cowboy>) (make already-played “player2 <redneck> “playeri <cowboy>) (make already-played “playeri <oriental> "player2 <cowboy>) (make already-played “player2 <oriental> “playeri <cowboy>)) (p 13..apply-choice.done { <THE-CONTEXT> (context “name apply-choice) } - - > (modify <THE-CONTEXT> “name report)) (p 14..report.night-schedule { <THE-CONTEXT> (context “name report) } (scheduling “night <n>) (foursome “night <n> “group a “north <an> “south <as> “east <ae> “west <aw>) (foursome “night <n> “group b “north <bn> “south <bs> "east <be> “west <bw>) (foursome “night <n> “group c “north <cn> “south <cs> “east <ce> “west <cw>) (foursome “night <n> “group d “north <dn> “south <ds> “east <de> “west <dw>) — > (modify <THE-COKTEXT> “name next-night) (bind <n2> (compute <n> +5)) (bind <n3> (compute <n> + 10)) (write (crlf) (rjust 1) |#| (rjust 1) <n> (rjust 1) I:I (tabto 5) I Group A:I (rjust 3) <an> (rjust 3) <as> (rjust 3) <ae> (rjust 3) <aw> (tabto 27) (rjust 1) |#| (rjust 2) <n2> (rjust 1) 1:1 (tabto 32) I Group A:I (rjust 3) <an> (rjust 3) <ae> (rjust 3) <as> (rjust 3) <aw> (tabto 53) (rjust 1) l#| (rjust 2) <n3> (rjust 1) I:I (tabto 58) I Group A:I (rjust 3) <an> (rjust 3) <aw> (rjust 3) <ae> (rjust 3) <as>) (write (crlf) (tabto 5) I Group B: 1 (rjust 3) <bn> (rjust 3) <bs> (rjust 3) <be> (rjust 3) <bw> (tabto 32) I Group B:I (rjust 3) <bn> (rjust 3) <be> (rjust 3) <bs> (rjust 3) <bw> (tabto 58) I Group B:I (rjust 3) <bn> (rjust 3) <bw> (rjust 3) <be> (rjust 3) <bs>) (write (crlf) (tabto 5) I Group C:I (rjust 3) <cn> (rjust 3) <cs> (rjust 3) <ce> (rjust 3) <cw> (tabto 32) I Group C:I 228 (rjust 3) <cn> (rjust 3) <ce> (rjust 3) <cs> (rjust 3) <cw> (tabto 58) I Group C: I - (rjust 3) <cn> (rjust 3) <cw> (rjust 3) <ce> (rjust 3) <cs>) (writ© (crlf) (tabto 5) I Group D:I (rjust 3) <dn> (rjust 3) <ds> (rjust 3) <de> (rjust 3) <dw> (tabto 32) I Group D:I (rjust 3) <dn> (rjust 3) <de> (rjust 3) <ds> (rjust 3) <dw> (tabto 58) I Group D:! (rjust 3) <dn> (rjust 3) <dw> (rjust 3) <de> (rjust 3) <ds> (crlf))) (p 15..next-night.more - [ <THE-CONTEXT> (context “name next-night) } - [ <THE-NIGHT> (scheduling “night { <n> < 5 >) > — > (modify <THE-CONTEXT> “name north) (modify <THE-NIGHT> “night (compute <n> + 1))) (p 16..next-night.done (context “name next-night) — > (write (tabto 32) I End of Scheduling I (crlf) (crlf))) (p 17..make-candidates.make-candidate.B (context “name make-candidates) (scheduling “night {. <n> < 3 }) (foursome “night <n> “group B “north <yankee>) (player "number <redneck> "nights-scheduled < <n>) (player “number { <oriental> < <redneck> } “nights-scheduled < <n>) (player “number { <cowboy> < <oriental> > “nights-scheduled < <n>) - (already-played "playeri <yankee> “player2 <redneck>) - (already-played “playeri <yankee> “player2 <oriental>) - (already-played "playeri <yankee> “player2 <cowboy>) - (already-played "playeri <redneck> “player2 <oriental>) - (already-played “playeri <redneck> “player2 <cowboy>) - (already-played “playeri <oriental> “player2 <cowboy>) - (candidate “group B “chosen no "south <redneck> "east <oriental>) - (candidate “group B “chosen no “south <redneck> “west <cowboy>) - (candidate "group B "chosen no “east <oriental> “west <cowboy>) — > (make candidate “group B “chosen no “south <redneck> "east <oriental> “west <cowboy>)) (p 18..make-candidates.make-candidate.C (context “name make-candidates) (scheduling “night - ( <n> < 3 }) 229 (foursome "night <n> "group C “north <yankee>) (player “number <redneck> "nights-scheduled < <n>) (player “number {. <oriental> < <redneck> } “nights-scheduled < <n>) (player “number i. <cowboy> < <oriental> } “nights-scheduled < <n>) - (already-played “playeri <yankee> "player2 <redneck>) - (already-played “playeri <yankee> “player2 <oriental>) - (already-played “playeri <yankee> “player2 <cowboy>) - (already-played “playeri <redneck> “player2 <oriental>) - (already-played “playeri <redneck> “player2 <cowboy>) - (already-played “playeri <oriental> “player2 <cowboy>) - (candidate “group C “chosen no “south <redneck> “east <oriental>) - (candidate “group C “chosen no “south <redneck> “west <cowboy>) - (candidate “group C "chosen no “east <oriental> “west <cowboy>) — > (make candidate “group C “chosen no “south <redneck> “east <oriental> “west <cowboy>)) (p 19..make-candidates.make-candidate.D (context “name make-candidates) (scheduling “night { <n> < 3 }) (foursome “night <n> “group D "north <yankee>) (player “number <redneck> “nights-scheduled < <n>) (player “number - C <oriental> < <redneck> } • “nights-scheduled < <n>) (player “number { <cowboy> < <oriental> } “nights-scheduled < <n>) - (already-played “playeri <yankee> “player2 <redneck>) - (already-played "playeri <yankee> “player2 <oriental>) - (already-played “playeri <yankee> “player2 <cowboy>) - (already-played “playeri <redneck> “player2 <oriental>) - (already-played “playeri <redneck> “player2 <cowboy>) - (already-played “playeri <oriental> “player2 <cowboy>) - (candidate “group D “chosen no “south <redneck> “east <oriental>) - (candidate "group D “chosen no “south <redneck> “west <cowboy>) - (candidate “group D “chosen no “east <oriental> “west <cowboy>) - - > (make candidate “group D “chosen no “south <redneck> “east <oriental> “west <cowboy>)) (p 2 0..make-candidates.make-candidate-late.B (context “name make-candidates) (scheduling "night { <n> > 2 }) (foursome “night <n> “group B “north <yankee>) (player “number <redneck> “nights-scheduled < <n>) (player “number • { <oriental> < <redneck> } • “nights-scheduled < <n>) <cowboy> < <oriental> } “nights-scheduled < <n>) 'playeri <yankee> “player2 <redneck>) 'playeri <yankee> “player2 <oriental>) 'playeri <yankee> “player2 <cowboy>) 'playeri <redneck> “player2 <oriental>) 'playeri <redneck> “player2 <cowboy>) 'playeri <oriental> “player2 <cowboy>) (player “number { - (already-played - (already-played - (already-played - (already-played - (already-played - (already-played 230 (make candidate "group B "chosen no "south <redneck> “east <oriental> “west <cowboy>)) (p 2 1..make-candidates.make-candidate-late.C (context "name make-candidates) (scheduling "night • { <n> > 2 }) (foursome "night <n> "group C “north <yankee>) (player "number <redneck> "nights-scheduled < <n>) (player “number ■ { <oriental> < <redneck> } - “nights-scheduled < <n>) (player “number ■ { <cowboy> < <oriental> } • "nights-scheduled < <n>) - (already-played "playeri <yankee> “player2 <redneck>) - (already-played “playeri <yankee> “player2 <oriental>) - (already-played “playeri <yankee> “player2 <cowboy>) - (already-played "playeri <redneck> “player2 <oriental>) - (already-played “playeri <redneck> “player2 <cowboy>) - (already-played “playeri <oriental> “player2 <cowboy>) — > (make candidate “group C “chosen no “south <redneck> "east <oriental> "west <cowboy>)) (p 2 2..make-candidates.make-candidate-late.D (context "name make-candidates) (scheduling “night { <n> > 2 }■) (foursome “night <n> “group D "north <yankee>) (player “number <redneck> “nights-scheduled < <n>) (player "number { <oriental> < <redneck> } - "nights-scheduled < <n>) (player “number { <cowboy> < <oriental> } “nights-scheduled < <n>) - (already-played "playeri <yankee> “player2 <redneck>) - (already-played “playeri <yankee> “player2 <oriental>) - (already-played “playeri <yankee> “player2 <cowboy>) - (already-played "playeri <redneck> “player2 <oriental>) - (already-played “playeri <redneck> "player2 <cowboy>) - (already-played "playeri <oriental> “player2 <cowboy>) — > (make candidate "group D “chosen no “south <redneck> “east <oriental> "west <cowboy>)) (p 23..remove-candidates.bye.B (context "name remove-candidates) { <THE-CANDIDATE> (candidate “chosen no “group B) } — > (remove <THE-CANDIDATE>)) (p 24..remove-candidates.bye.C (context “name remove-candidates) { <THE-CANDIDATE> (candidate "chosen no “group C) } — > (remove <THE-CAKDIDATE>)) (p 25..remove-candidates.bye.D (context “name remove-candidates) 2311 { <THE-CANDIDATE> ( c a n d i d a t e " c h o s e n no " grou p D) } - - > (rem o v e <THE-CANDIDATE>) ) B .2 .2 C h a r a c te r is tic M a tric e s The characteristic m atrices describe the production system. They include the parallelism m atrix P , the com m unication m atrix G and the firing frequency m atrix W. P a ra lle lis m M a tr ix P 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 X 1 0 0 0 0 0 X X 0 X X X X 0 0 0 1 1 1 0 0 0 0 0 I X 0 X X X X 1 0 0 0 1 1 1 I 0 0 0 0 0 X X 0 X I X X 0 0 0 1 I 1 0 0 0 0 0 X X 0 1 X I 1 0 0 0 1 0 0 0 0 0 0 0 0 X 0 0 0 0 0 0 0 0 0 0 0 1 0 I 0 X 0 0 X 0 0 0 0 0 0 0 0 0 0 0 X 0 0 1 0 0 0 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 X X X X 0 0 0 0 0 0 0 0 0 1 1 I 0 0 0 0 1 X X 1 X 0 0 0 0 0 0 1 X 0 1 1 X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 X I X 0 0 0 1 1 1 1 0 1 I 1 0 0 X 0 X 0 X X X X 1 1 1 0 1 0 I 0 0 I I X 0 X 1 X 0 0 0 1 0 0 0 0 0 0 0 0 0 0 X X X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 I X 0 1 0 0 0 0 0 0 1 1 1 1 X 1 0 0 0 0 X 1 1 1 X 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 X X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 I 0 X 0 X I 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 I 1 0 I X X 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 I I I X 0 0 0 1 1 1 I 1 0 0 0 0 0 0 X X I X X X 0 0 0 1 1 1 I 1 0 X 0 1 I 0 0 0 0 I I X X X 0 0 0 1 1 X 1 0 1 0 1 0 0 0 0 1 1 X X 1 0 0 0 1 1 X 1 1 0 1 0 0 0 0 0 X X X X X 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 X X X 0 232 C o m m u n ic a tio n M a trix C 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0 I 0 0 0 I 1 0 0 0 0 1 0 1 0 0 0 0 1 I 0 0 0 0 L 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 I 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 X X 1 1 0 1 0 0 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 X X 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 1 X X X 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 X 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 X 0 0 0 0 0 0 X 0 0 X 1 1 1 0 0 0 1 1 0 0 0 0 0 0 X 0 0 0 0 0 I 0 X 1 1 1 0 0 0 1 0 0 1 0 0 I 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 F irin g F re q u e n c y M a tr ix W 1 4 4 4 8 4 52 9 3 t, 5 7 5 20 5 5 4 4 52 52 52 9 9 9 55 55 55 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 I 0 0 1 0 0 1 Appendix C The PSIM Code PSIM (Parallel Sim ulation) is a parallel discrete event-driven sim ulator im plem ented on a 32-node iP S C /2 hypercube m ulticom puter. It consists of a host program and a node program . T he load balancing m ethods proposed in C hapter 4 are em bedded in to th e sim ulation. C . l H o s t P r o g r a m C . 1 . 1 H o s t H e a d e r M o d u le /* host.h */ /* Header file of host program for PSIM */ /I************************************************************/ #include <stdio.h> /* define constants */ #define ALL_TYPES -1 /* Symbol for all message types */ #define H0ST_PID 100 /* Process id. -of the host process */ #define N0DE_PID 0 /* Process id of the node processes */ #define ALL.NODES -1 /* Symbol for "all nodes" in loading * #define ALL.PIDS -1 /* Symbol for "all processes * / #define START_TYPE 0 /* Message type */ #define START_LEN 0 /* Length of "start" message */ #define PARM.TYPE 10 / * type of parameter msg */ #define GMSG_TYPE 15 /* type of generic msg */ #define FIWAL_TYPE 20 /* type of final msg */ #define NMSG_TYPE 2 /* node status */ #define EMSG.TYPE 3 /* event */ #define PMSG_TYPE 4 / * proc */ 234 #define RMSG.TYPE 5 /* record */ #define AMSG.TYPE 6 /* ask for load */ #define TMSG.TYPE 7 / * pass threshold * / #define LMSG.TYPE 8 /* get load from node * / #define MMSG.TYPE 9 /* migrate process */ #define STOP 4 /* stop event type */ #define DONE 9 #define FCFS 9 /* first come fist serve * / #define RB 1 / * round-robin */ #define N0_LB 0 /* no load balancing * / #define LRR 1 /* local RR * / #define GRR 2 /* global RR */ #define LML 3 /* local min load */ #define GML 4 /* global min load */ #define MAX_NODE 32 /* max number of nodes * / #define MSGLEN 100 /* message length */ #define YES 1 #define NO 0 /* data structure */ /jititejlolijt:******!!'****************************/ typedef struct parm_msg { long num_nodes; long node; long dim; long lb; double lambda; double mu; long scheduling; long max_time; long w; long debug; > PARMMSG, *PMSG; typedef struct final_msg - f long node; /* simulation parameter */ / * number of nodes */ /* node number */ /* dimension */ /* load balancing method */ /* system arrival rate */ /* system service rate */ /* CPU scheduling method */ /* maximum simulation time */ /* initial time window */ /* debug flag */ long num_task; float total_time; float avg_wait_time; float avg_length; float avg_exe_time; float avg_response_time float avg_wait_proc; /* float avg_load; /* y FINMSG, *FMSG; /* /* /* /* /* / * /* reporting from each node */ node number */ number of tasks */ total simulation time */ average waiting time */ average length of task */ average execution time * / ; /* average response time */ average number of waiting tasks */ average load * / 235 typedef struct node_msg{ /* long node; /* long stop; /* float timer; /* float old_timer; /* long num.tasks; /* > NODEMSG, *NMSG; typedef struct proc ■ { /* long id; /* long type; /* float arrival; /* float depart; /* float length; /* float start; /* float wait; /* float remain; /* > PROC, *P; node status message */ node number */ flag for local done */ current time */ last time */ number of tasks in queue */ structure of a process */ process id */ local or remote */ arrival time */ departure time */ execution length */ start execution time */ waiting time */ remaining unfinished time * / typedef struct event_msg { float time; long type; PROC process; > EVNMSG, *EMSG; /* message for an event */ /* time */ /* event type */ /* pointer to the process */ typedef struct recording { /* structure for recording statistics */ float time; /* time */ long l_in; /* load at the time */ long wait; /* number of processes waiting */ struct recording *next; /* pointer to the next */ > RECORDING, *RE; typedef struct generic_msg { /* generic message */ long type; /* message type */ long node; / * source node */ char body[MSGLEN]; /* message body */ > GENMSG, *GMSG; typedef struct hostload_msg - [ /* host load info, message */ float load.avg; / * system load average */ float load_tab[MAX_N0DE]; /* node load table * / short tag[MAX_N0DE]; /* tag for checking receiving */ > HLOADMSG, *HLMSG; typedef struct nodeload_msg { /* node load message */ long node; /* source node * / float load; /* load of the node * / > NLOADMSG, *NLMSG; /* global variable * / 236 extern PARMMSG host_parm_msg; /* parameter msg from host */ extern PMSG host_parm_ptr; /* pointer to host_parm_msg */ extern HLOADMSG hostloadmsg; /* host load message */ extern HLMSG host_l_msg; /* pointer to hostloadmsg */ extern NLQADMSG nodeloadmsg; /* node load message */ extern NLMSG node_l_msg; /* pointer to nodeloadmsg */ extern long msg_size; /* length of a message */ extern long host_node; /* host node number */ extern long num_nodes; /* total number of nodes */ extern long host_pid; /* host process id */ extern long done; /* done flag */ extern long done_nodes; /* all nodes done * / extern int finish_node[MAX_N0DE]; /* flag for node finish */ extern int finish; /* all finished */ extern long debug; extern double start_time; /* starting time */ extern double clock; /* current time */ extern double old_clock; /* last recorded time */ extern long w; /* time window */ extern long old_w; /* last time window * / extern float old_load_avg; /* last load average */ extern double k; /* time window adjustable */ extern double c; / * load imbalance factor */ extern double U; /* mean utilization */ extern double lambda; /* mean arrival rate */ extern double RS; /* mead response time */ extern double rs[MAX.NGDE] ; /* node mean response time tab */ extern double lam[MAX_NODE]; /* node arrival rate table */ /***************************************/ /* function returns * / /***************************************/ extern void input(); / * input parameters */ extern void init_host(); /* initialize host */ extern void send_parm(); /* send parameters to nodes * / extern void get_node_parm(); /* receive params from nodes */ extern void done_signal(); I* signal all nodes done */ extern void receive_final(); /* receive final report * / extern void final_result(); /* report final result */ extern void get_msg(); /* receive a message */ extern void check_msg() ; /* check message type and act * / extern void ask_load(); /* ask node’s load */ extern void record_load() ; /* record system load */ extern void send_loadinfo(); /* send load info to nodes */ extern void set_time_w(); /* set and adjust time window * / 237 extern void print.event(); /* print an event */ extern void host_print_proc(); /* print a process */ extern void print_node(); /* print node status */ extern void print_record(); /* print record message */ extern void print_load(); /* print system load */ extern long power(); /* power of 2 */ extern int check_final(); /* check if receives all */ extern float ramO ; /* random number */ extern double var(); /* variation of lambda */ extern float flt_abs(); /* absolute value of float */ C .1 .2 H o st M o d u le /* host.c */ /* This is the host program for PSIM. It does the loading,*/ /* performs the role of central information collector, and */ /* displays all messages on the screen. */ #include "host.h" /* main() */ /* Host main program. It does: */ /* 1. input simulation parameters, */ /* 2 . initializes the system, */ /* 3. load node programs, */ /* 4. send parameters to nodes, */ /* 5. periodically collects load info., */ /* 6 . shut down simulation, */ /* 7. report the statistics. */ main(argc,argv) int argc; char argv [] ; ■ c long node; if ( argc < 11 ) - C printf("Inappropriate parameters!\n"); > else { input(argc,argv); init_host(); printf("Loading the cube with the node processes ...\n"); 238 ________ I loadO'node", ALL.NODES, N0DE_PID); num_nodes = numnodesO; . printf("num.nodes = %d\n",num_nodes); clock = mclockO - s ta rt_ tim e ; printf(" clock = */,f\n", clock); send_parm(); get_node_parm (); clock = mclockO - s ta rt_ tim e ; printf(" clock = '/.f\n", clock); while ( ! done ) { clock = mclockO - s ta rt_ tim e ; i f ( (clock - old_clock) >= w ) { i f ( debug ) printf ("clock at host = * / , f old_clock = */.f\n", clock,old_clock); ask_load(); old_clock = clock; > if ( iprobe(GMSG_TYPE) ) ■ { get_msg(); > > done_signal(); while ( ! finish ) { if ( iprobe(GMSG_TYPE ) ) get_msg(); if ( iprobe( FIKAL_TYPE ) ) receive_f inalO ; > final_result(); killcube(ALL_NODES,ALL_PIDS); printf ("clock at host = ' / . f old_clock = */,f\n" , clock, old_clock) ; > > /* input() */ /* Input the parameters from the user. */ void input(argc,argv) int argc; char *argv[] ; • c double a to f O ; host_parm _ptr = &host_parm_msg; 239 host_parm_ptr->dim = atoi( argv[l] ); U = atof( argv[2] ); host_parm_ptr->mu = atofC argv[3] ); host_parm_ptr->scheduling = atoi( argv[4] ); host_parm_ptr->w = atoi( argv[5] ); host_parm_ptr->max_time = atoi( argv[6] ); host_parm_ptr->lb = atoi ( argv[7] ); host_parm_ptr->debug = atoi ( argv[8] ); debug = host_parm_ptr->debug; k = atof ( argv[9] ); c = atof ( argv[10] ); debug = host_parm_ptr->debug; lambda = U / host_parm_ptr->mu; host_parm_ptr->num_nodes = power(host_parra_ptr->dim) ; printf("Simulation of Load Balancing on Multicomputer\n"); printf (" \n") ; printf ("Number of computer nodes = 7,d\n", host_parm_ptr->num_nodes); printf ("DIM = 7.d\n" ,host_parm_ptr->dim) ; printf ("Average system utilization U = 7..2f\n", U) ; printf ("Average process arrival rate lambda = '/,.2f/s\n", lambda); printf ("Average process service rate mu = 7.-2f/s\n", host_parm_ptr->mu); printf ("The variation coefficient for lambda = * / , . 2f \n" , c) ; if ( host_parm_ptr->scheduling =?= FCFS ) printf("CPU scheduling is FCFSXn"); else printf ("CPU scheduling is Round Robin with quantum=*/,ld ms\n", host_parm_ptr->scheduling); printf ("Update threshold window w=*/.d ms\n" ,host_parm_ptr->w) ; printf ("Max time = * / . d ms\n" ,host_parm_ptr->max_time) ; printf("CIC update time window coefficient k = 7,.2f\n",k); printf("Load Sharing Policy is "); switch ( host_parm_ptr->lb ) ■ { case N0_LB: printf("No load sharing\n"); break; case LRR: printf("local RR migration\n"); break; case GRR: printf("global RR migration\n"); break; case LML: printf("local minimal load migration\n"); break; case GML: printf("global minimal load migration\n"); break; default: break; > 240 /* powerO */ /* Power of 2. * / long power(dim) long dim; ■ c long i,n; if ( dim == 0 ) return(l); else { n = 2 ; for ( i = 1; i < dim; i++ ) n = n * 2 ; return(n); > > /* init_host() */ /* Host initialization: setting time */ /* window and initializing hostlodmsg. */ void init_host() int i; start_time = mclockO; clock = mclockO - start_time; setpid(HOST_PID); /* set the host pid */ host_pid = mypidO; /* get process id for this process * / host_node = mynodeO; /* get the number of this node */ printf ("host_node = 7,d, host_pid = '/.d\n" , host_node, host_pid); w = host_parm_ptr->w * 1 0 0 0; old_w = w; old_load_avg = -1; done = NO; done_nodes = 0 ; host_l_msg = fehostloadmsg; host_l_msg->load_avg = 0 ; node_l_msg = &nodeloadmsg; finish = NO; for ( i = 0 ; i < num_nodes; i++) { finish_node[i] =0 ; host_l_msg->load_tab[i] = 0 ; host_l_msg->tag[i] = NO; > 241 /* send_parm() */ /* Send simulation parameters to nodes. */ void send_parm() ■ c int i; for ( i = 0 ; i < num_nodes; i++ ) { host_parm_ptr->node = i; host_parm_ptr->lambda = var(i); csend(PARM_TYPE,&host_parm_msg, sizeof(host_parm_msg),i,N0DE_PID); > > /* varO */ /* Generate variation of arrival rate */ /* based on imbalance factor. */ double var(i) int i; { double r,x; if ( num_nodes == 1 ) return(lambda); else { r = ( 2 * c ) / ( num _nodes-l); x = ( lambda - c ) + i * r ; r e tu r n ( x ) ; > > /****************************************/ /* get_node_parm() */ /* Get simulation parameters from node. */ /* The purpose is to see if the sending */ /* is correct and check the variation */ /♦of arrival rate based on the c. */ void get_node_parm() { int i; float r[MAX_N0DE]; float R.v.s: R = 0; for ( i = 0 ; i < num.nodes; i++ ) - [ crecv(PARM_TYPE,&host_parm_msg,sizeof (host.parm.msg) ) ; if ( debug ) printf ("from node * / , d : lambda = 'hi, mu = ' / , f w = */,ld ms\n host_parm_ptr->node,host_parm_ptr->lambda, host_parm_ptr->mu,host_parm_ptr->w); r[i] = 1 / ( i - host_parm_ptr->lambda ); R += r [i] ; > R = R / num.nodes; v = 0 ; for ( i = 0 ; i < num.nodes; i++ ) • { v += ( R - r[i] ) * ( R - rCi] ); } v = v / num.nodes; s = sqrt(v); printf (" expected R = ' / . f , V = 'hi RS = 'hi \n",R,v,s); y / * receive.f inalO */ /* Receiving the final statistics report*/ /* from each node. */ void receive.final() -t FINMSG host.recv.msg; FMSG host.recv.ptr; long i; host.recv.ptr = fehost.recv.msg; crecv(FINAL_TYPE, fchost.recv.msg,sizeof(host.recv.msg)); if ( check.final(host.recv.ptr)) { printf (" from node */.ld\n" ,host_recv_ptr->node) ; i = host_recv_ptr->node; printf("total num of tasks = %ld\n", host_recv_ptr->num_task); printf ("total time = 'hi\n" ,host_recv_ptr->total_time) printf ("average waiting time = 7,f\n" , host_recv_ptr->avg_wait_time); rs[i] = host_recv_ptr->avg_response_time; printf ("average length = ‘ /.fXn", host_recv_ptr->avg_length); printf ("average exe time= ’ /,f\n" , host_recv_ptr->avg_exe_time); printf ("average response time = 7,f\n", host_recv.ptr->avg.response.time); printf ("average waiting process = '/.f\n", host_recv_ptr->avg_wait_proc); printf ("average load = * / , f \n" ,host_recv_ptr->avg_load) ; > > /* f inal.resultO */ /* Report the final simulation results. */ void final_result() { int i; double stv,stvl; RS = 0; for ( i = 0 ; i < num_nodes; i++ ) - ( RS += rs[i]; > RS = RS / num.nodes; stv = 0 ; stvl = 0 ; for ( i = 0 ; i < num.nodes; i++ ) { stvl += (rs[i] - RS) * ( rs[i] - RS ); > stv = ( stvl / num.nodes ); printf("* *********************** ***\n"); printf ("Mean response = 7,f\n",RS); printf ("variance of mean response time = 7,i\n" ,stv) ; J / * check.f inalO */ /* Check to see if all final messages */ /* are received. If so, return YES, */ /* otherwise return MO. */ int check.final(msg) FMSG msg; - c int i; int ok; int num.finish; num.finish = 0 ; for ( i = 0 ; i < num.nodes; i++ ) { if ( msg->node == i ) { if ( finish.node[i] == 0 ) { finish.node[i] = YES; ok = YES; 244 } else{ ok = NO; > break; > > for ( i = 0 ; i < num_nodes; i++ ) { if ( finish.node[i] == YES ) num_finish++; > if (num.finish == num.nodes ) finish = YES; return(ok); /* get.msgO */ / * Receive a message by calling the * / / * communication primitives. */ void get.msgO GENMSG gmsg; GMSG g_msg; g.msg = fegmsg; crecv(GMSG_TYPE,&gmsg,sizeof(gmsg)); check_msg(g_msg); > /* check.msgO */ /* Check message type to decide how */ /* respond to it. Appropriate actions */ /* are taken. */ void check_msg(g_msg) GMSG g.msg; • c NMSG node.msg; NDDEMSG nodemsg; PROC process; P proc; EVNMSG eventmsg; EMSG e.msg; RECORDING recordmsg; RE r_msg; switch ( g_msg->type ) { case NMSG.TYPE: memcpy((char *)&nodemsg, (char *)g_msg->body.sizeof(nodemsg)); node.msg = fenodemsg; print.node(node.msg); if ( node_msg->stop == YES ) - [ done_nodes++; if ( done.nodes == num.nodes ) done = 1; > break; case PMSG.TYPE: memcpy((char *)&process, (char *)g_msg->body,sizeof(process)); proc = ^process; host_print_proc(proc); break; case EMSG.TYPE: memcpy((char *)&eventmsg, (char *)g.msg->body.sizeof(eventmsg)); e.msg = feeventmsg; print.event(e.msg); break; case RMSG.TYPE: memcpy((chax *)&recordmsg, (char *)g_msg->body,sizeof(recordmsg)); r.msg = ferecordmsg; print.record(r.msg); break; case LMSG.TYPE: memcpy((char *)&nodeloadmsg, (char *)g.msg->body,sizeof(nodeloadmsg)); record.load(node.l.msg); break; default: break; > > / * ram ( ) * / /* Random number generator. */ float ram() { double drand48(); float d; d = drand48(); 246 r e t u r n ( d ) ; > /* done.signalO */ /* Send a signal to all nodes,which says*/ /* each node has done. */ / l i e * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void done_ s ignalC) { int i; GENMSG gmsg; GMSG g_msg; g.msg = fegmsg; g_msg->type = FINAL.TYPE; for ( i = 0 ; i < num.nodes; i++ ) csend(GMSG.TYPE,fegmsg,sizeof(gmsg),i,NODE.PID); > / H e * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* ask.loadO */ /* Send message to ask load from each */ /* node. */ /****** jit**************************** *****/ void ask.loadO { int i; GENMSG gmsg; GMSG g.msg; g.msg = fegmsg; g_msg->node = host.node; g_msg->type = AMSG.TYPE; for ( i = 0 ; i < num.nodes; i++) { csend(GMSG_TYPE,fegmsg,sizeof(gmsg),i,N0DE_PID); > > /* record.loadO */ /* Record each node's load in the table */ /* after receiving the nodeloadmsg. */ / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void record.load(node.l.msg) NLMSG node.l.msg; ■ C long i; int done; float load; void send_loadinfo(); i = node_l_msg->node; done = 0 ; load = 0 ; host_l_msg->tag[i] = YES; host_l_msg->load_tab[i] = node_l_msg->load; for ( i=0 ; i< num.nodes; i++) { if ( host_l_msg->tag [i] == YES ) - [ load += host_l_msg->load_tab[i] ; done++; > } if C done == num.nodes ) { host_l_msg->load_avg = load / num.nodes; set_time_w(old_load_avg,host_l_msg->load_avg); old.load.avg = host_l_msg->load_avg; if ( debug ) print_load(); send.loadinfo(host.l.msg); for ( i=0 ; i< num.nodes; i++) • [ host_l_msg->load_tab[i] = 0 ; host_l_msg->tag[i] = MO; > > > /I****************************************/ /* send.loadinfo() */ /* Send system load information to each */ /* node. */ /♦He**************************************/ void send.loadinfo(host.l.msg) HLMSG host.l.msg; GENMSG gmsg; GMSG g.msg; long i; g.msg = &gmsg; g_msg->type = TMSG.TYPE; g_msg->node = host.node; memcpy((chax *)g_msg->body, (char *)host_l_msg, sizeof(HLOADMSG)); for ( i = 0 ; i< num.nodes; i++) { csend(GMSG.TYPE,fegmsg.sizeof(gmsg),i.NODE.PID); 248 > > /****************************************/ /* set_time_w() * / / * Set and adjust the time window. */ /****************************************/ void set_time_w(old_avg,new_avg) float old_avg; float new_avg; - c float x; if C old_load_avg != -1 ) { if ( old_avg != new_avg ) { if ( old_avg > new_avg && old_avg != 0 ) x = flt_abs( old_avg - new_avg ) / old_avg; else x = flt_abs( new_avg - old_avg ) / new_avg; if ( x <= k ) w=(l-x)* old_w; else { if C (1-k) < x && x < k ) w = (1-k) * old_w; else w = x * old_w; > } else w = (l+k) * old_w; old_w = w; > > /I*************************************************/ /* flt_abs() */ /* Return the absolute value of a float. */ /in************************************************/ float flt_abs(i) float i; ■ c return( (i>= 0 )?i : (-i) ); } C.1.3 Host Print M odule 249 /*************************************************************/ /* host.c */ /* This is the host program for PSIM. It does the loading, */ /* performs the role of central information collector, and */ /* displays all messages on the screen. */ /♦I************************************************************/ #include "host.h" /* print_event() */ /* Print an event. */ void print_event(e_msg) EMSG e_msg; ■ c P proc; void host_print_proc(); printf(" ** event ** \n"); printf ("time = ' / . f \n" ,e_msg->time) ; printf ("type = '/,d\n" , e_msg->type) ; if ( e_msg->type != STOP ) - [ proc = &(e_msg->process); host_print_proc(proc); > / * host_print_proc() */ /* Print a process at host. */ void host_print_proc(p) p p; ■ C if ( p != NULL ) i printf(" -- proc -~\n"); printf ("id — */.d\n" ,p->id) ; printf ("length -- * / , f \n" ,p->length) ; printf ("arrival -- 7,f\n" ,p->arrival) ; printf ("depart -- ' / . f \n" ,p->depart) ; printf ("start -- * / , f \n" ,p->start) ; printf ("wait — * / , f \n" ,p->wait) ; printf ("remain -- * / . f \n" ,p->remain) ; > else printf("Proc is NULL\n"); > 250 /* print_node() */ /* Print node status. */ void print_node(node.msg) NMSG node.msg; printf ("Node '/,ld:timer = */,f, num tasks = '/.Id, stop=*/,ld\n", node_msg->node, node_msg->timer, node_msg->num_tasks, node_msg->stop); } /a****************************************/ /* print_record() */ / * Print the node recording message. */ void print.record(r.msg) RE r.msg; { printf ("time = * / . f , l.in = 7,ld, wait=*/,ld\n" , r_msg->time,r.msg->1.in,r.msg->wait); > /* print.loadO */ /* Print the system load information. */ void print.loadO ■ c int i; printf ("load avg = ' / , f \n" ,host_l.msg->load_avg) ; for(i= 0;i< num.nodes; i++ ) { printf ("load [’ /.d] =’ / , f , " , i ,host_l_msg->load_tab [i] ) ; > printf("\n"); > C .1 .4 H o s t D e c la ra tio n M o d u le /* host.decl.c */ /* This file contains the declarations of global */ /* variables of the host program. */ /alt***************************************************/ ftinclude <stdio.h> 251 #include "host.h" PARMMSG host_parm_msg; /* parameter msg from host */ PMSG host.parm.ptr; /* pointer to host.parm.msg */ HLOADMSG hostloadmsg; /* host load message */ HLMSG host_l_msg; /* pointer to hostloadmsg */ NLOADMSG nodeloadmsg; /* node load message */ NLMSG node_l_msg; / * pointer to nodeloadmsg */ long msg_size; 1* length of a message */ long host.node; / * host node number */ long num.nodes; 1* total number of nodes * / long host.pid; /* host process id */ long done; I* done flag */ long done.nodes; / * all nodes done */ int finish.node[MAX..NODE]; /* flag for node finish */ int finish; /* all finished * / long debug; double start.time; /* starting time */ double clock; / * current time */ double old.clock; / * last recorded time */ long w; /* time window */ long old.w; /* last time window */ float old.load.avg; /* last load average * / double k; /* time window adjustable */ double c; /* load imbalance factor */ double U; /* mean utilization */ double lambda; /* mean arrival rate * / double RS; /* mead response time */ double rs[MAX.NODE] ; /* node mean response time C.2 N o d e P rogram C .2 .1 N o d e H e a d e r M o d u le / * n o d e .h * / / * H ead er f i l e o f n o d e p rogram s f o r PSIM. * / # i n c l u d e < s t d i o . h > j j ( < 3 ^ ) t ; j f c s ( e : f c ) f : 3 f c j | e s ^ s ) e s i c j f c 3 + e j f c > t c 5 f c s f c H c ) f c s | < * J t : j f c J t c : ( c s f c 5 f c H c j ( c j t c j $ c j & < ! t < ) $ e j | c J t c i f c H c : < e / / * d e f i n e c o n s t a n t s */ 252 / Jit******************************** * * * * * * * / #define H0ST_PID 100 /* Process id of the host process */ #define NODE.PID 0 / * Process id of the node processes */ #define START.TYPE 0 / * Message type * / #define START.LEN 0 /* Length of "start" message * / #define PARM.TYPE 10 /* type of parameter msg */ #define GMSG.TYPE 15 /* type of generic msg * / #define FINAL.TYPE 20 /* type of final msg * / #define NMSG.TYPE 2 /* node status * / #define EMSG.TYPE 3 /* event */ #define PMSG.TYPE 4 /* proc */ #define RMSG.TYPE 5 /* record */ #define AMSG.TYPE 6 / * ask for load */ #define TMSG.TYPE 7 /* pass threshold */ #define LMSG.TYPE 8 /* get load from node */ #define MMSG.TYPE 9 /* migrate process */ #define BAD.NODE - 1 0 0 /* flag for a bad node */ #define FCFS 9 /* first come fist serve * / #define RB 1 /* round-robin */ #define YES 1 #define NO 0 #define NO.LB 0 /* no load balancing */ #define LRR 1 /* local Round-Robin */ #define GRR 2 /* global Round-Robin */ #define LML 3 / * local min load * / #define GML 4 /* global min load */ #define MAX.TIME 999999999 /* event type */ #define ARRV 1 /* arrival */ #define DEPT 2 /* departure */ #define EXEC 3 /* execution */ #define STOP 4 /* each node finish */ / * process type */ #define LOCAL 1 #define REMOTE 2 #define MAX.NODE 32 /* max number of nodes */ #define MSGLEN 100 /* message length */ / * d a t a s t r u c t u r e * / t y p e d e f s t r u c t parm_msg { I* s i m u l a t i o n p a r a m e te r * / l o n g n u m .n o d e s; / * number o f n o d e s * / l o n g n o d e ; / * n o d e number * / l o n g dim ; / * d im e n s io n * / 253 long lb; /* load balancing method * / double lambda; /* system arrival rate */ double mu; /* system service rate * / long scheduling; / * CPU scheduling method */ long max _time; /* maximum simulation time */ long w; /* initial time window */ long debug; /* debug flag */ > PARMMSG, *PMSG; typedef struct final_msg ■ { /* reporting from each node */ long node; /* node number */ long num_task; /* number of tasks */ float total_time; /* total simulation time * / float avg_wait_time ; /* average waiting time */ float avg_length; /* average length of task */ float avg_exe_time; /* average execution time * / float avg_response_time /* average response time * / float avg_wait_proc ; /* average number of waiting tasks */ float avg_load; /* average load */ > FINMSG, *FMSG; typedef struct node_msg{ /* node status message */ long node; /* node number */ long stop; /* flag for local done */ float t imer; / * current time */ float old_timer; /* last time */ long num_tasks; /* number of tasks in queue */ > NODEMSG, *NMSG; typedef struct proc { /* structure of a process */ long id; / * process id */ long type; /* local or remote */ float arrival; /* arrival time */ float depart; /* departure time */ float length; /* execution length */ float start; /* start execution time */ float wait; /* waiting time * / float remain; /* remaining unfinished time * / } PROC, *P; typedef struct event { /* event */ float time; /* event time */ long type; /* event type *1 P process; /* process associated. */ struct event *next,*last; /* pointers in list */ > EVENT, *E; typedef struct event_msg { /* message for an event */ float time; /* time */ long type; /* event type */ PROC process; /* pointer to the process */ 254 > EVNMSG, *EMSG; typedef struct recording { I* structure for recording statistics * / float time; /* time */ long l_in; /* load at the time * / long wait; /* number of processes waiting */ struct recording *next; /* pointer to the next */ > RECORDING, *RE; typedef struct generic_msg { /* generic message */ long type; /* message type */ long node; /* source node */ char body[MSGLEN]; /* message body * / } GENMSG, *GMSG; typedef struct hostload_msg { /* host load info, message */ float load_avg; /* system load average */ float load_tab[MAX_N0DE]; /* node load table */ short tag[MAX_NODE]; / * tag for checking receiving */ > HLOADMSG, *HLMSG; typedef struct nodeload_msg { /* node load message */ long node; / * source node * / float load; / * load of the node */ > NLOADMSG, *NLMSG; typedef struct candidate { /* entry of candidate list * / long node; /* node number */ float load; / * node of the node * / struct candidate *next,*last; /* pointers */ > CAND, *C; typedef struct queue { / * entry of the queue */ P process; /* pointer to a process * / struct queue *next,*last; /* pointers * / > QUEUE, *Q; typedef struct proc_record { /* entry of process record list */ P process; / * pointer to a process */ struct proc.record *next,*last; } PR0C_RE, *PR; / * global variables */ extern PARMMSG host_parm_msg; /* parameter msg from host */ extern PMSG host_parm_ptr; /* pointer to host_parm_msg */ extern HLOADMSG hostloadmsg; /* host load message */ extern HLMSG host_l_msg; I* pointer to hostloadmsg */ extern NLOADMSG nodeloadmsg; / * node load message */ extern NLMSG node_l_msg; /* pointer to nodeloadmsg */ 255 extern GENMSG gmsg; / * generic message */ extern GMSG g.msg; /* pointer to gmsg * / extern long num.nodes; j * total number of nodes */ extern long cpu_quantum; /* CPU quantum */ extern long scheduling; /* CPU scheduling method * / extern long max_time; /* maximum time */ extern long w; /* time window */ extern long stop; / * node done flag */ extern long num_proc; /* number of process */ extern long status; /* node status * / extern long ok_to_recv; /* ready to receive a msg */ extern long my_node; /* my node number * / extern long host.node; / * host node number */ extern long debug; / * high level debug flag */ extern long debug1; / * low level debug flag */ extern long done; /* system done flag */ extern long ready.length; /* length of ready queue */ extern long trans.length; / * length of transfer queue */ extern long threshold; /* threshold value */ extern long neighbors[MAX.NODE] ; / * neighborhood flag array */ extern long DIM; /* hypercube dimension */ extern long LB; / * load balancing method */ extern long flag; extern unsigned long start_time; /* start time */ extern double lambda; /* node arrival rate */ extern double mu; / * node service time */ extern float timer; / * time clock * / extern float old.timer; /* last recorded time * / extern float cpu.clock; /* CPU clock */ extern float cpu.idle; /* CPU idle time */ extern float last.arrival; /* last arrival time */ extern float l_in; /* local load */ extern E event.list; /* event list */ extern Q ready.q; /* ready queue */ extern Q trans.q; / * transfer queue */ extern PR proc.list; /* process list */ extern RE record.list; / * recording list */ extern C cand.list; / * candidate list */ extern C new.list; /* new list * / /* function returns */ extern E event_alloc(); /* allocate event space*/ extern P proc_alloc() ; /* allocate process space */ extern C C_alloc(); /* allocate a candidate */ extern PR pr_alloc(); /* allocate process list entry space */ extern RE re_alloc(); /* allocate record space */ 256 extern void node_init(); extern void report(); extern void simO ; extern void receive_host_parm(); extern void initO ; extern void create_cand_list(); extern void create_e_list(); extern void get_arrival(); extern void proc.event(); extern void add_e_list(); extern void add_proc_list(); extern void update..ready_q(); extern E get_event(); extern void release.eventC); extern P get_proc(); extern float quantum_left; extern float get_cpu(); extern float get_next_time(); extern float check_next_time(); extern float expoO ; extern void check.next(); extern void get_msg(); extern void check_msg(); extern. void send_nodestat(); extern void send_load(); extern void send_proc(); extern void send_record(); extern void send_event(); extern void report(); extern void record_data(); extern int check_wait(); extern void print_e_list(); extern void print_cand_list(); extern void print_proc(); extern void print_event(); extern void print_proc_list(); extern void print_q(); extern void print_entry(); extern long get_cand(); extern long get_RR(); extern long get_ML(); extern void set.threshold(); extern void add_queue(); 257 extern void order_cand_list(); extern void sort_ceuid_list C) ; extern void add_cand_list(); extern void proc_trans(); extern void migrate(); extern void recv.procO ; extern void out_proc_list(); extern void enqueue_ready(); extern void enqueue.transQ ; extern Q dequeue_ready(); extern P dequeue_trans(); C.2.2 Node Module /ale***********************************************************/ /* node.c */ /* This files contains the functions as major driven and */ /* the beginning parts of the node program. * / / * It includes: */ /* mainQ, node_init(), receive_parm() . */ /a************************************************************/ #include "node.h" / a ( < a ( < a * : a ( c a t c a ( t a ( ( a | t a | c a t ; 3 ( < a ) e a ( : a ( < j ( < a ( c a ) c a t c > ( c a t ; a ( c a ( c a t ! a t c a | < a ( c a ( t a ( c a ) t a f ; a ( t a ( t a ( c a ) c a ( c a f ; a t c a ( ( 3 ^ a ( a / /* main() */ /* The main initializes the node, then */ /* performs node simulation and reports */ /* the results to host. */ main() ■ c node_init() ; sim(); /* in sim.c */ report(); /* in record.c */ } /****************************************/ /* node_init() */ /* Initialize the node variables. */ /if:******************************:*********/ void node_init() -t my_node = mynode(); host_node = myhostO; receive_host_parm(); host_l_msg = &hostloadmsg; node_l_msg = fenodeloadmsg; g_msg = &gmsg; 258 threshold = 5; ok_to_recv = YES; > /* receive_host_parm() */ /* Receive the simulation parameters */ /* from the host. */ void receive_host_parm() { PARMMSG host_parm_msg; PMSG host_parm_ptr; double drand48(); crecv(PARM_TYPE,&host_parm_msg,sizeof(host_parm_msg)); host_parm_ptr = &host_parra_msg; num.nodes = host_parm_ptr->num_nodes; lambda = host_parm_ptr->lambda; mu = host_parm_ptr->mu; scheduling = host_parm_ptr->scheduling; if ( scheduling != FCFS M cpu_quatum = scheduling; scheduling = RB; > max_time = host_parm_ptr->max_time; w = host_parm_ptr->w; DIM = host_parm_ptr->dim; debug = host_parm_ptr->debug; LB = host_parm_ptr->lb; csend(PARM_TYPE,&host_parm_msg, sizeof(host_parm_msg),host_node,H0ST_PID); C.2.3 Initialization M odule /* init.c */ /* This file contains functions to initialize the data */ /* structures and global variables for node simulation. */ /* It includes: */ /* initO, create_e_list() , create_cand_list() */ #include "node.h" 259 /* init() */ /* Initialize the global variables and */ /* various lists, including creating the*/ /* event list and candidate list. */ /* a * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void initO • c void srand48(); srand48(2); start_time = mclock(); last_arrival = 0 ; debugl = NO; old_timer = 0; cpu_clock = 0; cpu.idle = 0 ; create_e_list(); ready_q = NULL; trans_q = NULL; num.proc = 0 ; proc_list = NULL; record_list = NULL; l_in = 0 ; ready_length = 0 ; trans_length = 0 ; stop = NO; done = NO; cand_list = NULL; new_list = NULL; create_cand_list(); > /* create_e_list() */ /* Create event list. */ /****************************************/ void create_e_list() { E event; event = event_alloc(); event->time = MAX_TIME; event->type = STOP; event->process = NULL; event->next = NULL; event->last = NULL; event_list = event; > 260 /****************************************/ /* create_cand_list() */ /* Create the candidate list based on */ /* load balancing method chosen and the */ /* number of hops between nodes. */ /a****************************************/ void create_cand_list() { int i,j,x,y,hop,node; C cand; if ( LB == N0_LB ) { cand_list = NULL; > else { node = my_node; for (j = 0 ; j < num_nodes; j++ ) { hop = 0 ; x = j ~ node; for ( i = 0; i< DIM; i++ ) { y = x >> (i); if ( ( y & 1 ) == 1 ) hop++; > neighbors[j] = hop; } for ( j = 0 ; j< num_nodes; j++ ){ if ( LB == LRR II LB == LML ) { if (neighbors[j] == 1 ) { cand = C_alloc(); cand~>node - j; cand->load = 0 ; cand->next = NULL; add_cand_list(cand); > > if ( LB == GRR II LB == GML ) { if ( j != my_node ) { cand = C_alloc(); cand->node = j; cand->load = 0 ; cand->naxt = NULL; add_cand_list(cand); > > > > > 261 C .2.4 S im u latio n C o n tro l M o d u le / * s im.c * / / * This files contains the functions of node simulation. */ /* It includes: */ /* simO , get_arrival() , get_proc() */ /* add_e_list(), add_proc_list(), expo() */ #include "node.h" / * sim() */ /* The controller of node simulation. */ / * The termination is controlled by the */ / * done flag which is set after getting */ / * the signal from the host. */ void sim() { initO; /* in init.c */ while ( ! done ) - { if ( debug1 ) print_e_list(); if ( last_arrival < max_time ) { get.arrival(); > proc.eventO ; /* in proc.c */ get_msg(); /* in comm.c * / > timer = mclockO - start.time; record_data(timer); /* in record.c * / > /* get_arrival() */ / * Get arriving process, put it to the */ /* process list and create an event. */ /****************************************/ void get arrivalO { P proc; num_proc++; proc = get_proc(); add_e_list(proc->arrival,proc,ARRV); add_proc_list(proc); > 262 /* get_proc() */ /* Generate a process based on the sim- */ /* ulation parameters. */ P get_proc() ■ c P proc; proc = proc_alloc(); proc->id = num_proc; proc->type = LOCAL; proc->length = ( expoO / mu) * 1 0 0 0; /* in ms */ proc->arrival = last_arrival + ( expoO / lambda ) * 1 0 0 0; /* in ms */ last_arrival = proc->arrival; proc->start = 0 ; proc->wait = 0 ; proc->remain = proc->length; if C debugl ) send_proc(proc); return(proc); > / * addle_list() * / / * Create a new event and add it to the */ / * event list. Event list is ordered by */ /* by the event time. */ void add_e_list(clock,proc.type) float clock; P proc; int type; { E event; E tmp; event = event_alloc(); event->time = clock; event->type = type; event->process = proc; if ( debugl ) send_event(event); for ( tmp = event_list; ( ( tmp->time <= event->time) && ( tmp->next != NULL) ); tmp = tmp->next ) if ( tmp->time > event->time ) { if ( tmp == event_list ) { /* add to head */ event->next = tmp; event->last = NULL; event_list = event; tmp->last = event; > else { / * insert * / tmp->last->next = event; event->last = tmp->last; event->next = tmp; tmp->last = event; > > else { /* tmp is tail */ tmp->next = event; event->last = tmp; event->next = NULL; > > /* add_proc_list() */ /* Add a process to the process list. */ /* The process list is used to record */ /* the statistics of each process. * / void add_proc_list(proc) P proc; • c PR prd, tmp; prd = pr_alloc(); prd->process = proc; prd->next = NULL; if ( proc_list == NULL ) i proc_list = prd; proc_list->last = prd; > else { tmp = proc_list->last; tmp->next = prd; prd->last = tmp; proc_list->last = prd; > j i f : * * * * * * * * * * * * * * * * fc* * * 264 /* expoO */ /* Exponential function. */ float expoC) - c double drand48() ,logO ; float d; d = - (log (drand48())); return(d); > C .2 .5 P ro c e ss E v e n t M o d u le /* proc.c */ /* This file contains function which process an event. It */ /* is a part of the event driven control. * / /* It includes: */ /* proc_event(), update_ready_q(), get_cpu(), */ /* check_next(), release_event(), get_event(), */ /* get_next_time(), dequeue_ready(). */ /***********************************************************/ #include "node.h" /* proc.eventO */ /* Process an event. */ /I********************************* *******/ void proc_event() { E event; float clock; if ( debugl ) print_e_list(); if ( event_list != N U LL ) { timer = mclockO - start_time; if ( event_list->type == STOP 11 event_list->time <= timer ) { if ( event_list->type != STOP ) event = get_event(); else event = event_list; clock = event->time; switch ( event->type ) { 265 case ARRV: add_queue(0,event->process); l_in++; if ( cpu.clock <= clock ) - [ /* cpn idle */ update_ready_q(clock); check_next(); } break; case EXEC: add_queue(l,event->process); if ( cpu_clock <= clock ) /* cpu idle */ update_ready_q(clock); break; case DEPT: release_event(event); l_in--; if ( cpu_clock <= clock ) /* cpu idle */ update_ready_q(clock); break; case STOP: if ( stop == NO ) { stop = YES; send_nodestat(); > break; } if ( event->type != STOP ) { check_next(clock); record_data(clock); > > > if (LB != NO_LB ) proc_trans(); > /* update_ready_q() * / / * Dispatch the ready queue.If there is */ /* CPU time left, assign it to the 1st */ /* process in the ready queue and dis- */ /* patch the process. */ /lie***************************************/ void update_ready_q(clock) float clock; { Q q; P proc; double x; float quantum_left; if ( debugl ) { print_q(ready q); > 266 q = dequeue.readyO; if ( q != NULL ) { switch ( scheduling ) - [ case RB: quantum.left = get_cpu(clock,q); if ( quantum_left < 0 ) { ^ add_e_list(clock+cpu_quantum,q->process,EXEC); free(q); > break; case FCFS: x = mclockO - start.time; clock = x; cpu.idle += clock - cpu.clock; proc = q->process; proc->start = clock; proc->wait = clock - proc->arrival; cpu.clock = clock + proc->length; proc->depart = cpu_clock; proc->remain = 0; add_e_list(proc->depart,proc,DEPT); free(q); break; default: break; > > > /* get_cpuO */ /* Assign the cpu time to a process. */ float get_cpu(clock,q) float clock; Q q; - C P proc; float quantum.left; Q q; double x; cpu.idle += clock - cpu.clock; proc = q->process; if ( proc->start == 0 ) { proc->start = clock; proc->wait = clock - proc->arrival; > cpu.clock = clock + cpu.quantum; quantum.left = cpu.quantum - proc->remain; 267 if ( quantum.left >= 0 ) - [ proc->depart = clock + proc->remain; proc->remain = 0 ; add_e_list(proc->depart,proc,DEPT); free(q); if ( quantum.left > 0 ) /* quantum not fully used */ cpu_idle += quantum.left; } else { proc->remain = proc->remain - cpu_quantum; > return(quantum.left); > /* check_next() */ /* Check next event in the event list. */ void check.next() { float now.clock; float next.clock; float quantum.left; float period; Q q; if C scheduling == RB ) { now.clock = cpu.clock; next.clock = get_next_time(); quantum.left = 0 ; period = next.clock - now.clock; if ( next.clock != MAX.TIME ) { while C period > cpu.quantum ) { q = dequeue.readyO ; if ( q == NULL ) { printf(“ ** ready q NullW); break; > quantum.left = get_cpu(now_clock,q); if C quantum.left < 0 ) add_e_list(now_clock+cpu_quantum,q->process,EXEC); period = period - cpu.quantum; now.clock = now.clock + cpu.quantum; > > } > /if:***************************************/ /* release.eventO */ 268 /* Free th e space of an e v e n t. * / void release.event(event) E event; - c cfree(event); > Z* get_event() */ Z* Get an event from the head of the */ /* event list. */ E get.eventO - c E event; if ( event.list != NULL ) { event = event.list; event.list = event->next; if ( event.list != NULL ) event_list->last = NULL; event->last = NULL; event->next = NULL; return(event); > else return(NULL); > /He*************’ !'**’ ! ' ’ ! ' ’ )'*******************1 ) 1 / /* get.next.timeO */ /* Get the time of the next event. */ float get.next.timeO - C float t ; if ( event.list != NULL ) t = event_list->time; else t = 0 ; return(t); > Z* dequeue_ready() */ / * Dispatch the first entry from the */ 269 /* ready queue. */ /****************************************/ Q dequeue_ready() { Q q; if ( ready_q != NULL ) { ready.length- -; q = ready.q; ready.q = q->next; if ( ready_q != NULL ) ready_q->last = q->last; q->last = NULL; q->next = NULL; return(q); } else return(NULL); > C .2 .6 N o d e C o m m u n ic a tio n M o d u le / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* node.comm.c */ /* This files contains the functions for node communication.*/ /* subsystem, which message passing functions. */ /* It includes: */ /* get_msg(), check.msgC), send_load(), */ /* send_nodestat(), send_proc(), send_record(), */ /* send_event(). */ /*************************************************************/ #include "node.h" /****************************************/ /* send.nodestat() */ /* Send node status to the host. */ /He***************************************/ void send.nodestat() ■ c NODEMSG nodemsg; NMSG node.msg; GENMSG gmsg; GMSG g_msg; node.msg = fenodemsg; node_msg->node = my.node; 270 node_msg->timer = timer; node_msg->old_timer = old_timer; node_msg->stop = stop; node_msg->num_tasks = num_proc; g_msg = fegmsg; g_msg->type = NMSG_TYPE; g_msg->node = my_node; memcpy((char *)g_msg->body, (chax *)&nodemsg, sizeof(nodemsg)); csend(GMSG_TYPE,&gmsg,sizeof(gmsg),host_node,H0ST_PID); > /* send_load() */ /* Send local load to the host. */ /if:***************************************/ void send_load() { if ( debug ) printf("send load at node '/.ld,l_in='/.f ready'/,ld, trans ’ /,ld\n", my_node,l_in,ready_length,trans_length) ; node_l_msg->node = my_node; node_l_msg->load = l_in; g_msg->type = LMSG.TYPE; g_msg->node = my_node; memcpy((char *)g_msg->body, (char *)&nodeloadmsg, sizeof(nodeloadmsg)); csend(GMSG_TYPE,&gmsg,sizeof(gmsg),host_node,H0ST_PID); } /* get_msg() */ /* Receive generic message from the host*/ /* or other nodes. The message can be */ /* the information message or migrated */ /* processes from remote node. */ void get_msg() ■ c GENMSG gmsg; GMSG g_msg; g_msg = &gmsg; if ( iprobe(GMSG_TYPE) ) { erecv(GMSG_TYPE,fegmsg.sizeof(gmsg)); check_msg(g_msg); } > 271 /****************************************/ /* check_msg() */ /* Check the message type and take the */ /* appropriate actions as response. */ void check_msg(g_msg) GMSG g_msg; { switch ( g_msg->type ) { case AMSG.TYPE: send_load(); break; case TMSG.TYPE: set_threshold(g_msg); break; case FINAL_TYPE: printf ("receive final type at '/,d\n" ,my_node) ; done = YES; break; case MMSG_TYPE: recv_proc(g_msg); break; default: break; } > /I****************************************/ /* send_event() */ /* Send an event to the host. */ void send.event(event) E event; GENMSG gmsg; GMSG g-msg; g_msg = fegmsg; g_msg->type = EMSG_TYPE; g_msg->node = my_node; memcpy((char *)g_msg->body, (char *)event, 2*sizeof(long)); memcpy((char *)(g_msg->body)+(2*sizeof(long)), (char *)event->process,sizeof(PROC)); csend(GMSG_TYPE,&gmsg,sizeof(gmsg),host_node,H0ST_PID); } /****************************************/ 272 /* send_record() */ /* Send a statistic record to host. */ /*****************#**********************/ void send_record(record) RE record; { GENMSG gmsg; GMSG g_msg; g_msg = &gmsg; g_msg->type = RMSG_TYPE; g_msg->node = my_node; memcpy((char *)g_msg->body, (char *)record, sizeof(RECORDING)); csend(GMSG_TYPE,fegmsg,sizeof(gmsg),host .node,H0ST_PID); > /****************************************/ /* send_proc() */ /* Send a process to the host. */ / I * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void send_proc(proc) P proc; • c GENMSG gmsg; GMSG g_msg; g_msg = fegmsg; g_msg->type = PMSG_TYPE; g_msg->node = my_node; memcpy( ( char * ) g_msg->body, ( char *)proc, sizeof(PROC)); csend(GMSG.TYPE,&gmsg,sizeof(gmsg),host.node,H0ST_PID); } C .2 .7 T h re s h o ld U p d a te M o d u le /* thresh.up.c */ /* This files contains the functions for updating threshold */ / * and candidate list. */ /* It includes: */ /* set_threshold(), order.cand.list(), */ /* sort.cand.listO , add_cand_list() , */ /I*************************************************************/ #include “node.h" 273 / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* set_threshold() */ /* Update the threshold according to the*/ /* most recently received hostloadmsg. */ void set.threshold(g_msg) GMSG g_msg; - c HLOADMSG hostloadmsg; HLMSG host_l_msg; int i; float avg; C tmp; host.l.msg = fehostloadmsg; memcpy((char *)host_l_msg,(char *)g_msg->body,sizeof(HLOADMSG)); if ( LB == GRR Ii LB == GML ) threshold = ( int) host_l_msg->load_avg; if ( LB == LRR II LB == LML ) { avg = 0 ; for ( tmp = cand.list; tmp != NULL; tmp = tmp->next ) avg += tmp->load; avg += host_l_msg->load_tab[my.node]; threshold = avg / ( DIM + 1); } if ( threshold == 0 ) threshold = 1; order_cand_list(host_l_msg); > / * order.cand_list() */ /* Order the candidate list as */ /* increasing ordered by nodes’ load. */ void order.cand.list(host.l.msg) HLMSG host.l.msg; C tmp; int i; if ( debug ) • { printf("before sorting\n"); print.cand.listO ; > for ( tmp = cand.list; tmp != NULL; tmp = tmp->next ) { i = tmp->node; tmp->load = host_l.msg-Moad.tab [i] ; > 274 sort_cand_list(); if ( debug ) { printf("after sorting\n"); print_cand_list(); > > /****************************************/ /* sort_cand_listO */ / * Sort the candidate list. */ /****************************************/ void sort_cand_list() { C tmp,tmpl; long node; float load; if ( cand_list->next != NULL ) - ( for ( tmp = cand_list; tmp != NULL; tmp = tmp->next) - ( for ( tmpl = tmp->next; tmpl != NULL; tmpl = tmpl->next ) • { if ( tmpl != NULL && tmp != tmpl ) { if ( tmp->load > tmpl->load ) { node = tmp->node; load = tmp->load; tmp->node = tmpl->node; tmp->load = tmpl->load; tmpl->node = node; tmpl->load = load; > } } > > > /* add_cand_list() */ /* Add a candidate to the end of the */ /* candidate list. */ void add_cand_list(cand) C cand; { C tmp; if ( cand.list == NULL ) { cand.list = cand; cand_list->last = cand; 275 } else { tmp = cand_list->last; tmp->next = cand; cand->last = tmp; cand_list->last = cand; } C .2 .8 D ecisio n M a k in g M o d u le / x t * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* decision.c */ /* This file contains functions for decision making*/ /* for load balancing. */ /* It includes: */ /* add_queue(), enqueue_ready(), */ /* enqueue_trans() */ /xt***************************************************/ #include "node.h" / X > * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* add_queue() */ /* Add a process to the ready queue or */ /* migration queue,according to the tag.*/ /**********>(<*****************************/ void add_queue(tag,proc) int tag; P proc; - c Q q,q_alloc() ; q = q_alloc(); q->process = proc; q->next = NULL; q->last = NULL; if ( proc->type == LOCAL ) { if ( tag == 0 && LB != N0_LB ) { if ( ready_length <= threshold + 1 ) enqueue_ready(q); else enqueue_trans(q); } else enqueue_ready(q); > else 276 e n q u eu e_read y(q ); > /* enqueue_ready() */ /* Put a queue entry to the end of the */ /* ready queue. */ /****************************************/ void enqueue_ready(q) Q q; • C Q tmp; if ( ready_q == NULL ) { ready_q = q; ready_q->last = q; > else - { tmp = ready_q->last; tmp->next = q; q->last = tmp; ready_q->last = q; > ready_length++; > /* enqueue.transO */ /* Put a queue entry at the end of the */ /* transfer queue. */ void enqueue_trans(q) Q q; { Q tmp; if C trans_q == NULL ) { trans_q = q; trans_q->last = q; > else { tmp = trans_q->last; tmp->next = q; q->last = tmp; trans_q->last = q; > trans_length++; > 277 C .2.9 R e c o rd in g S ta tistic s M o d u le j j * : * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* record.c */ /* This file contains functions for statistics recording */ /* and reporting results to the host. */ /***********************************************************/ # include "node.h" /* record_data() */ / * Record the statistics at the clock. * / / j * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void record_data(clock) float clock; { RE record,tmp; record = re_alloc(); record->time = clock; record->l_in = l_in; record->wait = check_wait(); if ( debug ) send_record(record); if ( record_list == NULL ) record_list = record; else { for ( tmp = record_list; tmp->next != NULL; tmp =tmp->next ) t tmp->next = record; } > /* check_wait() */ /* Check to see how many processes in */ /* the ready have not started. */ int check_wait() • c int n; Q tmp; n = 0 ; if ( ready_q != NULL ) { for ( tmp = ready_q; tmp != NULL; tmp = tmp->next ) { if ( tmp->process->start == 0 ) n++; 278 > > if ( trans_q != NULL ) { for ( tmp = trans_q; tmp != NULL; tmp = tmp->next ) { if ( tmp->process->start == 0 ) n++; > } return(n); > /jit***************************************/ /* report() */ /* Get the statistics from the process */ /* list, calculate the results and */ /* report the final results to host. */ void report() FINMSG final_msg; FMSG final_msg_ptr; float wait_time; float response_time; float exe_time; float length; float t; float wait_proc; float load; PR tmp; RE tmpl; int n; n = 0; wait_time = 0 ; exe_time = 0 ; length = 0 ; response_time = 0 ; for ( tmp = proc_list; tmp != NULL; tmp = tmp->next ) { if ( debug ) print_proc(tmp->process); wait_time += tmp->process->wait; length += tmp->process->length; exe_time += tmp->process->depart - tmp->process->start; response_time += tmp->process->depart - tmp->process->arrival; n++; > final_msg_ptr = &final_msg; final_msg_ptr->node = my_node; final_msg_ptr->avg_wait_time = (wait_time / n ); 279 final_msg_ptr->avg_length = ( length / n ); final_msg_ptr->avg_exe_time = (exe_time / n ); final_msg_ptr->avg_response_time = (response.time / n ); final_msg_ptr->num_task = n; t = 0 ; wait_proc = 0; load = 0; for ( tmpl = record_list; tmpl != NULL; tmpl = tmpl->next ) { wait_proc += tmpl->wait * ( tmpl->time - t ); load += tmpl~>l_in * ( tmpl->time - t ); t = tmpl->time; > final_msg_ptr->total_time = t; final_msg_ptr->avg_wait_proc = (wait_proc / t ); final_msg_ptr->avg_load = ( load / t ); csend(FINAL_TYPE,&final_msg,sizeof(final_msg),host_node, H0ST_PID); C.2.10 Migration M odule /* migrate.c */ , /* This files contains the functions for migrating in/out */ /* load by processing transfer queue. */ /* It includes: */ /* proc_trans(), get_cand(), get_RR(), */ /* get_ML(), dequeue_trans(), migrateO , * / /* out_proc_list(). */ #include "node.hM /* proc.transO * / /* Process transfer queue, migrate a */ / * process to the node in the output */ / * migration port. */ void proc.transO { P proc; long trans.node; if ( ( trans.q != NULL ) && ( cand.list != NULL ) ) { if ( debug ) printf(" begin trans at 7,ld\n" , (mclock()-start.time) ) ; if ( debug ) 280 printf ("Node '/.Id: ready_length=*/.ld, trans _length='/.ld\n" , my_node, ready_length,trans_length); trans_node = get_cand(); if ( trans_node != BAD_NODE ) { proc = dequeue_trans(); if ( proc != NULL ) migrate(proc,trans_node); > if ( debug ) printf("end trans Node '/.Id:ready_len=*/,ld,trans_len=‘ /,ld\n" , my_node, ready_length,trans_length); if ( debug ) printf(" end trans at */.ld\n" , CmclockC) - start_time)) ; > } /* get_cand() */ /* Select a candidate as the destination*/ /* of process migration. */ long get_cand() { long node;- if ( LB == LRR 11 LB == GRR ) node = get_RR(); if ( LB == LML II LB == GML > node = get_ML(); return(node) ; > / * get_RR() */ / * Select a candidate based on the RR */ /* heuristics. */ long get_RR() { long node; C cand; if ( cand_list != NULL ) { node = cand_list->node; if ( cand_list->next != NULL ) { cand = cand_list; cand_list = cand->next; cand_list->last = NULL; cand->next = NULL; 281 cand->last = NULL; add_cand_list(cand); if ( debug ) print_cand_list(); } > else node = BAD_NQDE; return(node); > /jit***************************************/ /* get_ML() */ /* Select a candidate based on the min */ /* load heuristics. */ long get_ML() • c long n.node; float load; C tmp,tmp1; if ( cand_list != NULL ) { n = cand_list->node; cand_list->load++; for ( tmp=cand_list; tmp->next != NULL; tmp=tmp->next ) { tmpl = tmp->next; if ( tmp->load > tmpl->load ) { node = tmp->node; load = tmp->load; tmp->node = tmpl->node; tmp->load = tmpl->load; tmpl->node = node; tmpl->load = load; > } > else n = BAD.NODE; return(n); > /* migrate() */ / * Migrate a process by message passing */ /sic**************************************:*/ void migrate(proc,node) P proc; long node; 282 { GENMSG gmsg; GMSG g_msg; l_in--; g_msg = fegmsg; g_msg->typ© = MMSG_TYPE; g_msg->node = my_node; memcpy((char *)g_msg->body,(char *)proc, sizeof(PRQC)); csend(GMSG_TYPE,&gmsg,sizeof(gmsg).node,N0DE_PID); > /****************************************/ /* dequeue_trans() */ /* Dispatch an entry from transfer queue*/ /* and delete the process from the */ /* process list. */ P dequeue transQ { Q q; P proc; if ( trans_q != NULL ) { trans_length--; q = trans_q; trans_q = q->next; if ( trans_q != NULL ) trans_q->last = q->last; proc = q->process; proc->type = REMOTE; cfree(q); out_proc_list(proc); return(proc); > else return(NULL); > / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* recv_proc() */ /* Receive a process from remote node, */ /* create an event for it and add it to */ /* process list. */ /****************************************/ void recv.proc(g_msg) GMSG g_msg; - c P proc; 283 if ( debug ) printfC" at node ‘ /.Id receive trans from node ‘ /.ld\n", my_node,g_msg->node); proc = proc_alloc(); memcpy((char *)proc,(char *)g_msg->body,sizeof(PROC)) ; timer = mclockO - start.time; add_e_list(timer,proc,ARRV); add_proc_list(proc); } /* out_proc_list() */ /* Delete a process from the process */ /* list. */ void out_proc_list(proc) P proc; PR tmp; for ( tmp = proc.list; tmp->next != MULL; tmp = tmp->next )-( if ( tmp->process == proc ) break; > if ( tmp == proc.list ) - ( proc.list = tmp->next; if ( proc.list != NULL ) proc_list->last = tmp->last; > else - f tmp->last->next = tmp->next; if ( tmp->next != NULL ) tmp->next->last = tmp->last; > cfree(tmp); > C .2 . 1 1 N o d e L ib ra ry M o d u le /* n o d e . l i b . c */ / * T h is f i l e c o n t a i n s t h e node l i b r a r y f u n c t i o n s : t h e p r i n t * / /* f u n c t i o n s and t h e dynamic a l l o c a t i o n f u n c t i o n s . * / /* The p r i n t f u n c t i o n s a r e u s e d o n l y f o r debug p u r p o s e . * / /* I t i n c l u d e s : * / / * p r i n t _ e _ l i s t ( ) , p r i n t _ e v e n t ( ) , p r i n t _ p r o c ( ) , * / 284 / * p r i n t _ p r o c _ l i s t ( ) , p r i n t _ q ( ) , p r i n t _ e n t r y ( ) . * / / * p r i n t _ c a n t _ l i s t ( ) , e v e n t . a l l o c O , p r o c . a l l o c O , * / / * q . a l l o c O , r e _ a l l o c ( ) , p r _ a l l o c ( ) , * / / * c _ a l l o c ( ) . * / # i n c l u d e "nod©.h,‘ / * p r i n t _ e _ l i s t ( ) */ I* Print the whole event list. */ v o i d p r i n t _ e _ l i s t () { E tmp; /* print_proc_list() */ /* Print the process list. */ void print_proc_list() { PR tmp; for ( tmp = proc_list; tmp!= NULL; trap = tmp->next ) 285 if ( event_list != NULL ) i printfC"Event_list of node */,d\n" ,my_node) ; printf ("------------------------------\n") ; for ( tmp = event_list; tmp != NULL; tmp = tmp->next ) {. print.event(tmp); > > else printf("Event.list of node * / , d is NULL\n" ,my_node) ; > /* print_event() */ /* Print an event. */ void print.event(e) E e; ■ c printfC" ** event ** \n"); printf("time = ' / , f \n" ,e->time) ; printf (“type = ‘ /,d\n" ,e->type) ; print_proc(e->process); > send_proc(tmp->process); > /ale****************************************/ /* print_proc() */ / * Print a process. */ /*****************************************/ void print_proc(p) P p; { if ( p != NULL ) { printf(" -- proc --\n"); printf ("id — */,d\n" ,p->id) ; printf ("length -- ' / . f \n" ,p->length) ; printf ("arrival -- ’ / , f \n" ,p->arrival) ; printf ("depart -- * / . f \n" ,p->depart) ; printf ("start -- V.f \n" ,p->start) ; printf ("wait -- 7,f \n" ,p->wait) ; printf ("remain -- ’ / . f \n" ,p->remain) ; > else printf("Proc is NULL\n"); } /*****************************************/ /* print_q() */ /* Print the ready queue or the transfer */ /* queue. */ /*****************************************/ void print_q(qname) Q qname; { Q tmp; if ( qname == NULL ) printfC Q is NULL\n") ; else { if ( qname == ready_q ) printf("Ready queue\n"); else printf("Trans queue \n"); printfC'-------------------\n") ; for ( tmp = qname; tmp != NULL; tmp = tmp->next ) { print_entry(tmp); > > > /*****************************************/ 286 /* print.entry() */ /* Print a queue entry. */ void print.entry(q) Q q; - E printf("entry— >\n"); print_proc(q->process); > /* print_cand_list() */ /* print the candidate list for debug * / /* purpose. */ /ate*******************************:********/ void print_cand_list() - C C tmp; printf ("clist at n'/,d=" ,my_node) ; for ( tmp = cand.list; tmp != NULL; tmp = tmp->next) { printf ("node ‘ /.d, load %±, " , tmp->node ,tmp->load) ; > printf("\n"); } /* event.alloc() */ /* Allocate space for an event. */ E event_alloc() { char *calloc(); return( (E) calloc(l,sizeof(EVENT) ) ); > /ale****************************************/ /* proc.allocO */ /* Allocate space for a process */ P proc.allocO { char *calloc(); return( (P) callocO, sizeof (PROC) ) ); 287 > /*****************************************/ /* q_alloc() */ /* Allocate space for a queue entry */ Q q_alloc() • c char *calloc(); return( (Q) calloc(l.sizeof(QUEUE) ) ); > /* re_alloc() * / /* Allocate space for recording entry */ RE re_alloc() { char *calloc(); return( (RE) calloc(l,sizeof(RECORDING) ) ); > j 1f.1f.1f. if. if, 'jf.iif, If. •* (.* {. If ip. if.-iflf.'if. j /* pr_alloc() */ /* Allocate space for a process record. */ PR pr_alloc() ■ c char *calloc(); return( (PR) calloc(l.sizeof(PROC_RE) ) ); } /* c_alloc() */ /* Allocate space for a candidate. */ j If. if-if lif. ■ * £ ■ * £ If. If.-jf.1f.1f. j c C_alloc() ■ c char *calloc(); return( (C) calloc(l.sizeof(CAND) ) ); > 288 C .2.12 N o d e D e c la ra tio n M o d u le /* node_decl.c */ #include <stdio.h> #include "node.h" PARMMSG host_parm_msg; / * parameter msg from host */ PMSG host_parm_ptr; /* pointer to host_parm_msg */ HLOADMSG hostloadrasg; / * host load message */ HLMSG host_l_msg; /* pointer to hostloadmsg */ NLOADMSG nodeloadmsg; / * node load message */ NLMSG node_l_msg; / * pointer to nodeloadmsg */ GENMSG gmsg; / * generic message * / GMSG g-msg; /* pointer to gmsg * / long num.nodes; /* total number of nodes */ long cpu_quantum; / * CPU quantum */ long scheduling; /* CPU scheduling method */ long max.time; maximum time */ long w; I * time window * / long stop; /* node done flag */ long num_proc; /* number of process * / long status; /* node status */ long ok_to_recv; /* ready to receive a msg * / long my_node; /* my node number */ long host_node; / * host node number */ long debug; /* high level debug flag */ long debug1; /* low level debug flag * / long done; /* system done flag */ long ready.length; /* length of ready queue */ long trans.length; / * length of transfer queue */ long threshold; /* threshold value */ long neighbors[MAX.NODE] /* neighborhood flag array long DIM; / * hypercube dimension */ long LB; /* load balancing method */ long flag; unsigned long start.time; / * start time */ double lambda; / * node arrival rate * / double mu; /* node service time */ float timer; / * time clock */ float old.timer; /* last recorded time */ float cpu.clock; /* CPU clock */ float cpu.idle; /* CPU idle time */ float last.arrival; /* last arrival time * / float l_in; /* local load */ E event.list; /* event list */ Q ready.q; /* ready queue * / 289 Q trans_q; /* transfer queue */ PR proc_list; / * process list */ RE record_list; /* recording list */ C cand_list; /* candidate list */ C new_list; /* new list */ 290 Appendix D The DLB Code D L B (Dynamic Load Balancer) is a prototype dynam ic load balancer imple m ented on a 32-node iP S C /2 hypercube m ulticom puter. It consists of a host program and a node program . T he distributed load balancers are im plem ented by identical node program s. T he construction of DLB is presented in C hapter 5. The code presented here is used to execute th e takQ and fib ( ) benchm ark program s. To execute the qksortQ program , some defined constants and p art of the code need to be modified sightly. D .l H ost P rogram D.1.1 Host Header M odule /*********************************************************/ /* host.h */ /* Host header file for DLB ( Dynamic Load Balancer ) */ #include <stdio.h> #define ALL_TYPES -1 /* Symbol for all message types */ #define HOST.PID 100 /* Process id of the host process */ #define NODE.PID 0 /* Process id of the node processes */ #define ALL.NODES -1 /* Symbol for "all nodes" in load call #define ALL_PIDS -1 /* Symbol for “all processes */ #define PARM_TYPE 1 /* type of parameter msg */ #define GMSG.TYPE 2 /* type of generic msg */ 291 #define FINA_TYPE 3 /* type of final msg * / #define DONE.TYPE 4 /* shut down */ #define RMSG^TYPE 5 /* return result */ #define NMSG.TYPE 6 /* node msg */ #define AMSG_TYPE 7 /* ask for load */ #define TMSG.TYPE 8 /* pass threshold */ #define LMSG.TYPE 9 /* get load from node */ #define MMSG.TYPE 10 /* migrate process */ #define MAX_N0DE 32 /* maximum number of nodes */ #define MAXARG 10 / * maximum number of arguments */ #define FUNCLEN 10 /* length of the function name */ #define MSGLEN 300 /* message length */ #define YES 1 #define NO 0 #define STOP 4 #define DONE 9 #define NO.LB 0 /* no load balancing */ #define LRR 1 /* local round robin */ #define GRR 2 /* global round robin */ #define LML 3 /* local minimum load */ #define GML 4 / * global minimum load * / typedef struct parm_msg ■ c / * parameter message */ long node; /* node id */ int num_ arg; /* number of arguments */ int args[MAXARG]; /* argument array */ long num_nodes; / * number of nodes */ long dim; / * hardware dimension */ long lb; /* load balancing method */ float alpha; /* normalized constant */ long debug; /* high level debug flag * / long debug1; /'* low level debug flag */ char funname[FUNCLEN]; /* function name */ > PARMSG, *PMSG; ;ypedef struct node_msg - { /* node message */ long node; /* node id */ long num_proc; /* number process waiting *, long time; / * finishing time */ } NODEMSG, *NMSG; typedef struct final_msg { long node; char funname[FUNCLEN]; int result; int num_proc; > FINMSG,*FMSG; /* final message from node */ /* node id *1 / * function name */ /* result value */ /* number of processes executed * / 292 typedef struct generic_msg { /* generic message */- long type; /* message type */ long node; /* source node */ char body[MSGLEN]; /* message body */ > GENMSG, *GMSG; typedef struct hostload_msg - [ /* load information */ float load_avg; /* system load average */ float load_tab[MAX_NODE] ; /* node load table */ short tag[MAX_N0DE]; /* tag to show if node ack */ > HLOADMSG, *HLMSG; typedef struct nodeload_msg { long node; float load; > NLOADMSG, *NLMSG; extern PARMSG host_parm_msg; /* host parameter message * / extern PMSG host_parm_ptr; /* pointer to host_parm_msg */ extern HLOADMSG hostloadmsg; / * system load information */ extern HLMSG host_l_msg; / * pointer to hostloadmsg * / extern NLOADMSG nodeloadmsg; /* node load message */ extern NLMSG node_l_msg; /* pointer to nodeloadmsg * / extern FINMSG fmsg; /* final report message */ extern FMSG f_msg; /* pointer to fmsg */ extern GENMSG gmsg; /* generic message * / extern GMSG g-msg; /* pointer to gmsg */ extern NODEMSG nodemsg; /* node status message */ extern NMSG n_msg; /* pointer to nodemsg */ extern long msg_size; / * message size */ extern long host_node; / * host node id */ extern long num_nodes; /* number of nodes */ extern long host_pid; /* host process id */ extern long done; / * one node done */ extern long done_nodes; /* all nodes done */ extern unsigned long clock; /* time clock */ extern unsigned long start_time;/* start time */ extern unsigned long old_clock; /* last record clock */ extern long debug; /* high level debug */ extern long debug1; /* low level debug */ extern long max_proc; /* max number of process */ extern int results G ; /* result array */ extern int finish_nodeG I /* finished node flag array */ extern int finish; /* finish flag */ extern float w; /* update load information window */ /* node load message */ /* node id */ /* node load */ 293 extern float old_w; /* extern float old_load_avg; /* extern double k; / * extern void input(); / * extern void init_host(); / * extern void send_parm(); /* extern void done_signal(); /* extern void final_result(); /* extern void recv_nodestat(); /* extern void check_msg(); / * extern void ask_load(); /* extern long power(); /* extern void record_load(); /* extern void print_node(); / * extern void send_loadinfo(); / * extern void set_tim_w(); / * extern float flt_abs(); / * extern void print_load(); / * old window length */ old system load avg */ adjustable time window constant */ input LB parameters */ host program initialization */ send parameters to nodes */ send shut-down message * / report the compute result */ receive node status */ check msg type and take action */ ask for node load index */ power of 2 */ record load distribution */ print node information */ broadcast load distribution */ adjust time window */ absolute value of float */ print load distribution */ D.1.2 Host Main M odule / * host.c * / /* This file contains functions of the host program. It */ /* does the main control of the load balancer. */ #include "host.h" /* mainC) */ /* The main inputs the parameters, */ /* initialized the system, and loads */ J * the node programs onto the nodes. */ /* It collects and sends load infor. */ / * and shuts down the node execution */ / * when knowing all results received. */ mainCargc,argv) int argc; char argv [] ; - C if ( argc < 6 ) printf("Inappropriate paramters!\n"); else { input(argc,argv); init_host() ; printf("Loading the cube with the node processes ...\n"); 294 loadC'node", ALL.NODES, NODE.PID); num_nodes = numnodesO ; printf ("Total '/ .Id nodes loaded\n",num_nodes); clock = mclockO - start_tirae; printfC" clock = ‘ /,ld\n", clock); send_parm(host_parm_ptr); clock = mclockQ - start_time; printf(" clock = XldXn", clock); while ( ! done ) { /* not all nodes finished the execution */ clock = mclockO ~ start_time; if ( (clock - old_clock) >= w ) { if ( debug1 ) printf (“clock at host = ‘ /.Id old_clock = '/,ld\n", clock,old_clock); ask_load(); old_clock = clock; > if ( iprobe(GMSG_TYPE) H crecv(GMSG_TYPE,&gmsg,sizeof(gmsg)); check_msg(g_msg); > > done_signal(); while ( ! finish ) ■ { /* not received all final reports */ if ( iprobe(NMSG_TYPE) ) recv_nodestat(); > final_result() ; killcube(ALL_NQDES,ALL_PIDS); > > /* input() */ /* Input the parameters for LB control:*/ /* dim: hypercube dimension */ /* lb: load balancing method */ /* w: initial time window */ /* k: window adaptive constant */ /* alpha:threshold normalize const. */ / * debug: high level debug flag */ /* debug1: low level debug flag */ /***************************************/ void input(arg c,argv) int argc; char *argv [] ; 295 ■ c double atof(); long powerO; host_parm_ptr = &host_parm_msg; host_parm_ptr->dim = atoi( argv[l] ); host_parm_ptr->num_nodes = power(host_parm_ptr->dim); host_parm_ptr->lb = atoi ( argv[2] ); w = atof( argv[3] ); k = atof ( argv[4] ); host_parm_ptr->alpha = atof ( argv[5] ); host_parm_ptr->debug = atoi ( argv[6] ); host_parm_ptr->debugl = atoi ( argv[7] ); debug = host_parm_ptr->debug; debugl = host_parm_ptr->debugl; printf("Load Balancing Experiments on Multicomputer\n"); printfC-------------------------------------------------\n") ; printf("Number of processor nodes = 7,d\n",host_parm_ptr->num_nodes); printf ("DIM = 7,d\n" ,host_parm_ptr->dim) ; printf ("Update threshold window w = " / , f sec\n",w); printf ("CIC update time window coefficient k = . 2f \n" ,k) ; printf ("Normalized constant alpha = . 2f \n" ,host_parm_ptr->alpha) ; printf("Load Sharing Policy is "); switch ( host_parm_ptr->lb ) • { case N0_LB: printf("No load sharing\n"); break; case LRR: printf("LRR migration\n"); break; case GRR: printf("GRR migration\n"); break; case LML: printf("LML migration\n"); break; case GML: printf("GML migration\n"); break; default: break; > / * powerO */ I* Power of 2. */ long power(dim) long dim; long i,n; i f ( dim == 0 ) 296 return (1); else { n = 2; for ( i = 1; i < dim; i++ ) n = n * 2; return(n); > > /* init_hostO */ /* Initialize the host, setting clock, */ /* process id, time window, etc. */ void init_host() int i; start_time = mclockO; clock = mclockO - start_time; setpid(HOST_PID); /* set the host pid */ host_pid = mypidO; /* get the process id for this process */ host_node = mynodeO; /* get the number of this node */ printf ("host_node = * / , d, host_pid = */,d\n", host_node, host_pid) ; w = w * 1 0 0 0; old_w = w; old_load_avg = -1; done = NO; done_nodes = 0 ; host_l_msg = fehostloadmsg; node_l_msg = fcnodeloadmsg; f_msg = fefmsg; g_msg = fegmsg; n_msg = fenodemsg; finish = 0 ; max.proc = 0 ; host_l_msg->load_avg = 0 ; for ( i = 0 ; i < mm.nodes; i++) { finish_node[i] =0 ; host_l_msg->load_tab[i] = 0; host_l_msg->tag[i] = NO; results[i] = 0 ; > / * final_result() */ /* Report the final user program exe. */ /* results and maximum number of */ /* process executed at a node. * / 297 /lie**************************************/ void final_result() - C int i , r; r = 0; for { i = 0 ; i < num_nodes; i++ ) { r += results [i]; > printf ("Result = */,d\n",r); printf ("Max '/. Id processes at a node.\n", max_proc) ; D.1.3 Host-M ailman M odule /* host-mail.c */ /* This file contains functions for the host asynchronous*/ /* communication. */ #include "host.h" /a***************************************/ /* recv_nodestat() */ /* Receive the node status. */ void recv_nodestat() ■ c int i; int ok; int num_finish; crecv(NMSG_TYPE,fenodemsg,sizeof(nodemsg)); printf ("Total '/ .I d processes executed at node '/ ,l d using time ’ /,ld\n", n_msg->num_proc,n_msg->node, n_msg->time); if ( n_msg->num_proc > max_proc ) max_proc = n_msg->num_proc; num_finish = 0 ; for ( i = 0; i < num_nodes; i++ ) { if ( n_msg->node == i ) { if ( finish_node[i] == 0 ) { finish_node[i] = YES; ok = YES; > else{ ok = NO; > 298 b rea k ; > > for ( i = 0 ; i < mun.nodes; i++ ) { if ( finish.node[i] == YES ) num_finish++; } if (num.finish == num.nodes ) finish = YES; } /j|t>i(sic>ic*j|t>isiicaicj(e>ie****************************/ /* check_msg() */ /* Check the message type and take the */ /* responding actions. */ / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void check_msg(g_msg) GMSG g_msg; { int i; switch C g_msg->type ) { case LMSG_TYPE: memcpy((char *)&nodeloadmsg, (char *)g_msg->body, sizeof(nodeloadmsg)); record_load(node_l_msg); break; case FINA.TYPE: memcpy((char *)&fmsg, (char *)g_msg->body, sizeof(fmsg)); i = f_msg->node; results[i] = f_msg->result; printf (“result = * / , d from node * /, ld at time ’ /.Id An", results[i],i,mclock()-start_time); done_nodes++; if ( done_nodes == num.nodes ) done = 1; break; default: break; > } / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* done_signal() */ /* Send a signal to all nodes saying */ /* that all nodes have finished the */ /* user program and can shut down. */ / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / 299 void done.signalO { int i; g_msg = fegmsg; g_msg->node = host_node; g_msg->type = DONE.TYPE; for ( i = 0 ; i < num.nodes; i++ ) { csend(GMSG_TYPE,&gmsg,sizeof(gmsg),i,N0DE_PID); > /* ask.loadO * / /* Send a message to ask for local load*/ /* from each node. */ /***************************************/ void ask.loadO -c int i; g_msg->node = host.node; g_msg->type = AMSG.TYPE; for ( i = 0 ; i < num.nodes; i++) { csend(GMSG_TYPE,&gmsg,sizeof(gmsg),i,N0DE_PID); > > /* send.paramO */ /* Send load balancing parameters. */ void send.parm(host.parm.ptr) PMSG host.parm.ptr; int i,j,k,n; FILE *fp,*fopen(); fp = fopen("p.dat"," r " ); f scanf (f p, "7.s" ,host_parm_ptr->f unname) ; printf ("Function ' / , s with " ,host_parm_ptr->funname) ; f scanf (f p, "7.d" ,&n) ; printf ("‘ /.d parameters\n" ,n) ; for ( i = 0; i < num.nodes; i++ ) { printf ("Node 7 , Id : ",i); host_parm_ptr->node = i; host_parm_ptr->num_arg = n; for ( j = 0 ; j < n; j++ ) { 300 f scanf (f p, "‘ /.d" ,fck) ; host_parm_ptr->args[j] = k; printf ("7,d " ,host_parm_ptr->args £j] ) ; > printf("\n"); csend(PARM_TYPE.host.parm.ptr.sizeof(host_parm_msg),i.NODE.PID); > fclose(fp); D.1.4 Host-load M odule / ♦ I * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* host-load.c */ /* This file contains functions for the host processor to */ /♦adjust the time window and update the load distribution. */ j **>(!)(!* I * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / #include "host.h" /***************************************/ /* receive_load() */ /* Receive the local load from a node. */ / j t : * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void record.load(node_l_msg) NLMSG node_l_msg; { long i.j; int done; float load; i = node_l_msg->node; done = 0; load = 0; host_l_msg->tagCi] = YES; host_l_msg->load_tab[i] = node_l_msg->load; for ( j = 0; j< num.nodes; j++) { if ( host_l_msg->tag[j] == YES ) { load += host_l_msg->load_tabCj]; done++; } > if ( done == num.nodes ) { host_l_msg->load_avg = load / num.nodes; set_time_w(old_load_avg,host_l_msg->load_avg); old.load.avg = host_l_msg->load_avg; if ( debugl ) print.load(); send.loadinfo(host.l.msg); for (j =0; j < num.nodes; j++) { 301 host_l_msg->load_tab[j] = 0 ; host_l_msg->tag[j] = NO; > > > /***************************************/ /* send_loadinfo() */ /* Send load information to each node */ void send_loadinfo(host_l_msg) HLMSG host_l_msg; { long i; g_msg->type = TMSG_TYPE; g_msg->node = host_node; memcpy((chax *)g_msg->body,(char *)host_l_msg, sizeof(HLOADMSG)); for ( i = 0 ; i< num_nodes; i++) ■ { csend(GMSG_TYPE,fegmsg,sizeof(gmsg),i,N0DE_PID); > /He**************************************/ /* set_time_w() */ /* Adaptively adjust the time window */ /***************************************/ void set_time_w(old_avg,new_avg) float old_avg; float neu_avg; - c float x; if ( old_load_avg != -1 && k != 0 ) - ( if ( old_avg ! = new_avg ) - ( if ( old_avg > new_avg && old_avg != 0 ) x = flt_abs( old_avg - new_avg ) / old_avg; else x = flt_abs( new_avg - old_avg ) / new_avg; if ( x <= k ) w=(l-x)* old_w; else { if ( (i-k) < x < k ) w = (l-k) * old_w; else w = x * old_w; > > 302 else w = (1+k) * old_w; old_w = w; > > /* flt_abs() */ /* The absolute value of a float. */ float flt_abs(i) float i; return( ( i >= 0 ) ? i : (-i) ); > /* print_load() * / /* Print out the load distribution. */ void print_load() int i; printf ("load avg = * / , f \n" ,host_l_msg->load_avg) ; for ( i = 0 ; i < num.nodes; i++ ) - f printf ("load [*/,d]='/,f " , i ,host_l_msg->load_tab [i] ) ; > printf("\n"); D .1 .5 H o s t D e c la ra tio n M o d u le /* host.decl.c */ /* This file contains the declarations of global variables. */ #include <stdio.h> #include "host.h" PARMSG PMSG HLOADMSG HLMSG NLOADMSG WLMSG FINMSG host_parm_msg; host_parm_ptr; hostloadmsg; host_l_msg; nodeloadmsg; node_l_msg; fmsg; /* host parameter message */ / * pointer to host_parm_msg */ /* system load information */ /* pointer to hostloadmsg * / / * node load message */ /* pointer to nodeloadmsg */ /* final report message */ 303 FMSG f_msg; / * pointer to fmsg * / GENMSG gmsg; I * generic message */ GMSG g-msg; / * pointer to gmsg * / NODEMSG nodemsg; /* node status message */ NMSG n_msg; /* pointer to nodemsg */ long msg_size; /* message size */ long host_node; / * host node id * / long num_nodes; / * number of nodes * / long host_pid; / * host process id */ long max_proc; /* max number of process at node*/ long done; f* one node done */ long done_nodes; /* all nodes done * / unsigned long clock; /* time clock */ unsigned long start_time; /* start time */ unsigned long old„clock; I * last record clock */ long debug; / * high level debug */ long debugl; /* low level debug */ int results[MAX_N0DE] ; /* result array */ int finish.node[MAX.NODE] /* finished node flag array */ int finish; /* finish flag */ float w; /* update load information window */ float old_w; /* old window length */ float old_load_avg; /* old system load avg */ double k; /* adjustable time window constant */ D . 2 N o d e P r o g r a m D .2 .1 N o d e H e a d e r M o d u le / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / / * Node.h */ /* Node Header File for DLB ( Dynamic Load Balancing ) */ #include <stdio,h> /* define constants */ #define ALL_TYPES -i /* Symbol for all message types */ /* Process id of the host process */ /* Process id of the node processes */ /* Symbol for "all nodes" in load call */ /* Symbol for "all processes */ #define #d efin e #d efin e #define H0ST_PID 100 NODE.PID 0 ALL.NODES -1 ALL_PIDS -1 304 #define PARM.TYPE 1 /* Parameter message type */ ftdefine GMSG_TYPE 2 /* Generic message type * / #define FINA_TYPE 3 /* Final message type */ #define DONE.TYPE 4 /* shut down */ #define RMSG.TYPE 5 /* result return */ #define NMSG.TYPE 6 /* node status */ #define AMSG_TYPE 7 /* ask for load */ #define TMSG_TYPE 8 /* pass threshold */ #define LMSG_TYPE 9 /* get load from node */ #define MMSG_TYPE 10 / * migrate pcb */ #define MAX.NODE 32 /* maximum number of nodes */ #define MAXARG 10 /* maximum number of arguments */ #define FUNCLEN 10 /* function name length */ #define MSGLEN 300 /* message length */ #define MAXDATALEN 256 /* max length of each data argument #define IDLE 1 /* node status idle */ #define BUSY 2 /* node status busy */ #define NORMAL 3 /* node status normal */ #define BADNODE -1 /* bad candidate */ #define YES 1 #define NO 0 #define NONE -999999 /* dummy function return */ /* load balancing methods */ #define N0_LB 0 /* no load balancing */ #define LRR 1 /* local round robin */ #define GRR 2 /* global round robin */ #define LML 3 I * local minimum load * / #define GML 4 / * global minimum load */ /* process status */ #define READY 1 #define SUSPEND 2 /* data structures */ typedef struct parm_msg - C /* parameter message */ long node; /* node id */ int num_arg; /* number of arguments */ int args[MAXARG]; /* argument array */ long num.nodes; /* number of nodes */ long dim; /* hardware dimension */ long lb; /* load balancing method */ float alpha; /* normalized constant */ long debug; /* high level debug flag */ long debugl; /* low level debug flag */ 305 char funname[FUNCLEN]; /* } PARMSG, *PMSG; typedef struct node_msg { /* long node; /* long num_proc; /* long time; /* > NODEMSG, *NMSG; typedef struct final_msg ■ [ /* long node; /* char funname[FUNCLEN]; /* int result; /* int num_proc; /* } FINMSG,*FMSG; typedef struct generic_msg { /* long type; /* long node; /* char body[MSGLEN]; /* > GENMSG, *GMSG; typedef struct data_element {. / * int type; /* int length; /* char value[MAXDATALEN]; /* } DE, *DEP; typedef struct data - [ /* int num_arg; /* int arg_avail; /* DE arg_array[MAXARG]; /* } DS, *DSP; typedef struct hostload_msg { /* float load_avg; / * float load.tab[MAX.NODE]; / short tag[MAX.NODE]; /* > HLOADMSG, *HLMSG; typedef struct nodeload.msg { / * long node; / * float load; /* > NLOADMSG, *NLMSG; typedef struct candidate {. / * long node; /* float load; /* struct candidate *next,*last; > CAND, *C; fu n c tio n name */ node message */ node id */ number process waiting */ finishing time */ final message from node */ node id */ function name */ result value */ number of processes executed */ generic message */ message type */ source node */ message body */ data element */ data type */ length of the data */ space for store data value */ data of the PCB */ number of arguments */ number of available arguments */ argument array */ load information */ system load average */ 1 node load table */ tag to show if node ack */ node load message */ node id */ node load */ candidate of migration*/ node number */ local load */ /* pointer in the list */ 306 typedef struct pcblck { /* process control block */ int id; /* process id */ long . node; /* node id */ int state; /* process state */ long parent_node; /* parent node */ int parent.id; /* parent process id */ int port.id; /* port id */ int arg.count; /* argument counter */ char funname[FUNCLEN]; /* function name */ int (*func)(); /* code address */ int num.arg; /* number of arguments */ int args[MAXARG]; > PCBLCK, *PCB; /* argument array */ typedef struct queue { /* queue entry */ PCB pcb; struct queue ~*next; struct queue *last; > QUEUE, *Q; /* pointer to a PCB * / typedef struct result ■ { /* return of a process */ long node; / * source node */ long parent.node;. /* parent node */ int parent.id; /* parent process id */ int port_id; /* port to return */ int value; } RESULT, *R; / * actual value */ /****************************************/ /* global variables */ / ****************************************/ extern PARMSG host.parm.msg; /* host parameter message */ extern PMSG host.parm.ptr; / * pointer to host.parm.msg */ extern HLOADMSG hostloadmsg; /* host load info message */ extern HLMSG host.l.msg; /* pointer to hostloadmsg */ extern NLOADMSG nodeloadmsg; /* local load message */ extern NLMSG node.l.msg; /* pointer to nodeloadmsg */ extern GENMSG gmsg; /* generic message */ extern GMSG g.msg; /* pointer to gmsg */ extern N 0 DEMSG nodemsg; /* node status message */ extern NMSG n.msg; /* pointer to nodemsg */ extern long msg.size; /* length of message * / extern long host.node; /* host node id */ extern long host.pid; /* host process id */ extern long my.node; /* local node id */ extern long num.nodes; /* number of nodes in system */ extern long debug; /* high level debug */ extern long debugl; /* low level debug */ extern char f unname [] ; /* function name */ 307 extern int num.arg; * number of arguments */ extern int args [] ; * argument array * / extern int num.pid; * number process id's given */ extern long DIM; * hypercube dimension */ extern Q ready.q; * ready queue */ extern Q trans.q; * migration queue */ extern Q suspend.q; * suspend queue */ extern PCB current_pcb; * current PCB */ extern PCB root.pcb; * root PCB */ extern C cand.list; * candidate list */ extern C new.list; * new list */ extern int ready.length; * length of ready queue */ extern int trans.length; * length of migration queue */ extern int susp.length; * length of suspend queue */ extern int done; * shut down flag */ extern int ok.to.recv; * ready to receive load */ extern unsigned long start.time; * starting clock */ extern unsigned long old.timer; * last recorded clock */ extern unsigned long timer; * current clock */ extern long LB; * load balancing method * / extern long threshold; * load threshold */ extern int status; * load status */ extern float l.in; * local load */ extern long neighbors [] ; * neighboring nodes */ extern float alpha; * threshold normalize cons. */ extern int final.result; * user program result */ /* function returns * / /****************************************/ extern void node.initO ; / * initialize the node */ extern void make.run(); /* start executing user program */ extern void receive_host_parm();/* receive LB params. from host */ extern void init.kernel(); /* initialize the kernel */ extern void init.lbO; /* initialize load balancer */ extern void create_cand_list() ; / * create candidate list */ extern void report(); /* send result to host */ extern void kernel(); /* the control kernel */ extern void executeQ ; /* execute a process */ extern R make.result(); /* make result structure */ extern void awake(); /* awake a suspended process */ extern void send.resultO ; /* send results to remote node */ extern void kill.kernelO ; /* shut down */ extern void send_load(); /* send local load to host * / extern void get.msgO ; /* receive a generic message */ 308 extern void check_msg(); /* extern void recv_result(); /* extern void set_threshold(); /* extern void order_cand_list(); /* extern void sort_cand_list(); / * extern void print_cand_list(); /* extern void run(); / * extern void decision_make(); / * extern Q search.susO ; /* extern Q out_suspend(); /* extern void suspend(); /* extern void enqueue_ready(); /* extern void enqueue_suspend(); /* extern void enqueue_trans(); /* extern Q dequeue_ready(); /* extern Q dequeue_suspend(); /* extern Q dequeue_trans(); /* extern int tak() ; /* extern int fib(); /* extern int factO ; / * extern int nq() ; /* extern int timesC); /* extern int plus C); / * extern void trans(); /* extern long find_cand(); /* extern long get_RR(); /* extern long get_ML(); / * extern void migrate(); / * extern void add_cand_list(); / * extern PCB get_pcb() ; / * extern Q get_q(); /* extern C get_cand(); /* check message type and act */ receive result from remote node */ update threshold */ order candidate list */ sort candidate list */ print candidate list */ create a pcb, put it in queue */ enqueue decision making */ search suspend queue */ delete from suspend queue */ suspend current process */ put into ready queue */ put into suspend queue */ put into migrate queue */ take out from ready queue */ take out from suspend queue */ take out from migrate queue */ tak function */ fibonaicc function */ factorial function */ n_qeen function */ multiplication function */ addition function */ transfer load */ find candidate */ get candidate using RR */ get candidate using min load */ migrate a process */ insert a node into cand_list */ get a free pcb */ get a free queue entry */ get a free candidate entry */ D.2.2 Node Main M odule /************************************************************/ /* node.c */ /* This file contains functions for the main of node program*/ #include "nod©.h" /* mainQ */ /* The main initializes the node load */ /* balancer and start to run the user */ /* program. */ 309 main() • c node_init(); make_run(); > /* node_init() */ /* Initialize the global variables and */ /* the kernel and load balancer. * / void node_init() ■ c my_node = mynodeC); host_node = myhostO; receive_host_parm(); host_l_msg = fehostloadmsg; node_l_msg = fenodeloadmsg; n_msg = fenodemsg; g_msg = &gmsg; init_kernel(); init_lb(); f = 0 ; > /a****************************************/ /* receive_host_parm() */ /* Receive the load balancer parameters */ /* from the host. */ void receive_host_parm() ■ c PARMSG host_parm_msg; PMSG host_parm_ptr = &host_parm_msg; int i; crecv(PARM_TYPE,&host_parm_msg,sizeof(host_parm_msg)); DIM = host_parm_ptr->dim; num_arg = host_parm_ptr->num_arg; num.nodes = host_parm_ptr->num_nodes; debug = host_parm_ptr->debug; debugl = host_parm_ptr->debugl; LB = host_parm_ptr->lb; for(i = 0;i< num_arg; i++ ){ args[i] = host_parm_ptr->args[i]; } strcpy(funname,host_parm_ptr->funname); > 310 /****************************************/ /* init_kernel() */ /* Initialize the node kernel. */ /****************************************/ void init_kernel() • c ready_q = NULL ; suspend.q = NULL; trans_q = NULL; num.pid = 0 ; current_pcb = NULL; > /sit***************************************/ /* init_lb() */ /* Initialize the load balancer. */ /****************************************/ void init_lb() ■ c start_time = mclockO; old_timer = 0 ; l_in = 0 ; ready.length = 0 ; trans.length = 0 ; status = IDLE; threshold = 3; done = NO; new.list = NULL; create_cand_list(); y /****************************************/ /* create_cand_list() */ /* Create the candidate list. */ /*******************************♦********/ void create_cand_list() - c int i,j,x,y,hop,node; C cand,get_Ccind(); if ( debug1 ) printf (“DIM = '/.dW .DIM) ; if ( L B = = N0_LB ) - [ cand.list = NULL; > else { node = my.node; fox ( j = 0 ; j < num.nodes; j++ ) { 311 hop = 0 ; x = j node; for ( i = 0; i< DIM; i++ ) y = x >> (i); if ( ( y f t i > == 1 ) hop++; } neighbors [j] = hop; if ( debug1 ) printf ("neighbor [’ / .d ] = */,d\n" ,j .neighbors [j] ) ; > for ( j = 0 ; j< num_nodes; j++ ){ if ( LB == LRR II LB == LML ) { if (neighbors[j] == 1 ) { cand = get_cand(); cand->node = j; cand->load = 0; cand->next = NULL; cand->last = NULL; add_cand_list(cand); > > if ( LB == GRR II LB == GML ) { if ( j ! = my_node ) {. cand = get_cand(); cand->node = j; cand->load = 0 ; cand->next = NULL; cand->last = NULL; add_cand_list(ccind) ; > > } > if ( debug1 ) print_cand_list(); > /ate***************************************/ /* report() */ /* Send the user program result to the */ /* host. */ /****************************************/ void report(x) int x; FINMSG finalmsg; FMSG f_msg = fefinalmsg; f_msg->node = my_node; 312 f _ m s g - > r e s u l t = x; g_ m sg -> ty p e = FINA_TYPE; memcpy( ( ch ar * )g _ m sg -> b o d y , C ch ar * ) & f i n a l m s g , s i z e o f ( f i n a l m s g ) ); csend(GMSG_TYPE,&gmsg, s i z e o f ( g m s g ) , h o s t _ n o d e , H 0 S T _ P I D ) ; > D.2.3 Kernel M odule /* kernel.c */ /* This file contains the functions as the control of the */ /* dynamic load balancer. */ Z***********************************************************/ #include "node.h" /* kernelQ */ /* Node load balancer kernel. */ / * It initializes the root process, */ /* iteratively dispatches the processes,*/ /* receives messages, performs the load */ /* balancing, until receiving the done */ /* signal from the host. */ void kernel() root_pcb = ready_q->pcb; while ( ! done ) - [ get_msg(); execute(); if (LB != N0_LB ) transO ; } > / * e x e c u t e O * / / * D is p a t c h a p r o c e s s from t h e r ea d y * / / * queue and e x e c u t e i t . Then d e c i d e * / / * where t o r e t u r n t h e r e s u l t . * / v o i d e x e c u t e O { Q q , d e q u e u e _ r e a d y ( ) ; PCB p c b ; i n t r ; R r e s u l t ; RESULT r e s ; 313 i n t ( * f u n c ) ( ) ; l o n g n o d e ; i f ( r ea d y _ q != NULL ) -[ q = d e q u e u e _ r e a d y ( ) ; pcb = q->pcb; c f r e e ( q ) ; c u r r e n t_ p c b = pcb; fu n c = p c b -> fu n c ; s w i t c h ( pcb->num_axg ) ■ { c a s e 3: r = f u n c ( p c b - > a r g s [ 0 ] , p c b - > a r g s [ l ] , p c b - > a r g s [ 2 ] ) ; b r e a k ; c a s e 2: r = f u n c ( p c b - > a r g s [ 0 ] , p c b - > a r g s [ i ] ); b r e a k ; c a s e 1: r = f u n c ( p c b - > a r g s [ 0 ] ) ; b r e a k ; c a s e 0: r = f uncC); b r e a k ; d e f a u l t : if C r != NONE ) { if ( pcb->id == 0 && pcb->node == my_node ) { final_result = r; report(r); } else • { result = feres; node = pcb->node; result = make_result(r,pcb,result); if ( node == my_node ) break; > awake(result); else send_result(result); > } > > make_result() /* Make a data structure of the result */ / * from a process election. R make_result(r,pcb.result) int r; 314 PCB pcb; R result; ■ c result->node = pcb->node; result->parent_node = pcb->parent_node; result->pareut_id = pcb->parent_id; result->port_id = pcb->port_id; result->value = r; cfree(pcb); return(result); > / * awake() */ /* Awake a process in the suspend queue * / /* after receiving a result. */ void awake(result) R result; { PCB pcb; Q sq; int i; sq = search.sus(result->parent_id,result->parent_node); if ( sq == NULL t& debug ) printf("sq not found\n"); pcb = sq->pcb; pcb->arg_count~-; i = result->port_id; pcb->argsCi] = result->value; if ( pcb->arg_ count == 0 ) • { if ( debugl) for ( i = 0 ; i < num_arg; i++ ) printf("sus arg * / . d = 7,d\n" , i ,pcb->args [i] ) ; sq = out_suspend(sq); pcb->state = READY; enqueue_ready(sq); ■ } > ! # # # # * * * # f * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* send.result() */ / * Send a result to the remote node. */ void send_result(result) R result; { 315 g_msg->node = my_node; g_msg->type = RMSG_TYPE; m©mcpy( (char * )g_msg->body,(char *)result,sizeof(RESULT)); csend(GMSG_TYPE,&gmsg,sizeof(gmsg),result->node,N0DE_PID); /* kill_kernel() */ /* Shut down the kernel and send node */ /* message to host. */ void kill_kernel() - £ done = YES; n_msg->node = my_node; n_msg->num_proc = num_pid; n_msg->time = mclockO - start_time; csend(NMSG_TYPE,&nodemsg,sizeof(nodemsg),host_node,H0ST_PID); /* send^loadO */ /* Send local load to the host. */ void send_load() - c if ( debugl ) printf("send load at node */,ld,l_in='/.f ready'/,ld,trans 7,ld\n“, my_node,l_in,ready_length,trans_length); node_l_msg->node = my_node; node_l_msg->load = l_in; g_msg->type = LMSG.TYPE; g_msg->node = my_node; memcpy((char *)g_msg->body,(char *)&nodeloadmsg, sizeof(nodeloadmsg)); csend(GMSG_TYPE,&gmsg,sizeof(gmsg),host_node,HOST_PID); /* get_msg() */ /* Check to see if there is a message */ /* arriving. If so, receive it. */ void gat_msg() { if ( iprobe(GMSG_TYPE) ) { crecv(GMSG_TYPE,fegmsg,sizeof(gmsg)); 316 check_msg(g_msg); > > /****************************************/ /* check_msg() */ /* Check message type and take actions */ /* as response. */ void check_msg(g_msg) GMSG g_msg; { switch ( g_msg->type ) { case AMSG_TYPE: send_load(); break; case TMSG.TYPE: if (LB != N0_LB ) set_threshold(g_msg); break; case DONE.TYPE: if ( debugl ) printf ("receive final type at */,d\n" ,my_node) ; kill_kernel(); break; case MMSG.TYPE: recv_pcb(g_msg); break; case RMSG_TYPE: recv_result(g_msg); default: break; > > /* recv.result() */ /* Receive a result from remote node, */ /* and used it to awake a suspend proc. */ / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / void recv_result(g_msg) GMSG g_msg; { RESULT res; R result = &res; if ( debugl ) printf (" receive result from n'/.ld at n7.1d\n" ,g_msg->node,my_node) ; memcpy( (char *) result, (char * )g_msg->body, sizeof(RESULT) ); 317 if ( debugl ) printf (" result : parent node ' /.Id, parent.id ‘ /.Id value '/.d\n", result->parent_node,result->parent_id, result->value); awake(result); > /* set_threshold() */ /* Set the threshold and update the */ /* candidate list or load table. */ void set.threshold(g_msg) GMSG g_msg; ■ c int i; float avg; C tmp; memcpy((char *)host_l_msg,(char *)g_msg->body.sizeof(HLOADMSG)); if ( LB == GRR II LB == GML ) threshold = ( 1 + alpha) * host_l_msg->load_avg; if ( LB == LRR II LB == LML ) { avg = 0 ; for ( tmp = cand_list; tmp != NULL; tmp = tmp->next ) avg += tmp->load; avg += host_l_msg->load_tabCmy_nodej; threshold = (1+ alpha) * (avg / ( DIM + 1) ); > if ( threshold == 0 ) threshold = 5; if ( debugl ) printf ("threshold at node */ ,l d is */,ld\n" ,my_node .threshold) ; order_cand_list(host_l_msg); > /****************************************/ /* order_cand_list() */ /♦Order the candidate list in increasing*/ /*of loads. */ void order_cand_list(host_l_msg) HLMSG host_l_msg; { C tmp; int i; if ( debugl ) • { printf("before sorting\n"); print_cand_list(); 318 > for ( tmp = cand_list; tmp != NULL; tmp = tmp->next ) { i = tmp->node; tmp->load = host_l_msg->load_tab[i]; > sort_cand_list(); if ( debug ) { print_cand_list C); > > /* sort_cand_list() */ / * Sort the candidate list. */ /I****************************************/ void sort_cand_list() { C tmp,tmpl; long node; float load; if ( cand_list->next != NULL ) { for ( tmp = cand_list; tmp != NULL; tmp = tmp->next) { for ( tmpl = tmp->next; tmpl != NULL; tmpl = tmpl->next ) { if ( tmpl != NULL kk tmp != tmpl ) { if C tmp->load > tmpl->load ) { node = tmp->node; load = tmp->load; tmp->node = tmpl->node; tmp->load = tmpl->load; tmpl->node = node; tmpl->load = load; > > > > > > /****************************************/ /* print_cand_list() * / / * Print the candidate list for debug. */ /St:*************************’ !'’ !'’ !'’ !'**********/ void print_cand_list() { C tmp; printf ("clist at n7.d=" ,my_node) ; 319 for ( tmp = cand_list; tmp != NULL; tmp = tmp->next) { printf ("node * / . d , load ' / . f ,tmp->node, tmp->load) ; } printf("\n"); > D .2 .4 R u n M o d u le /* runO */ / * This file contains functions to run a process block. */ #include "node.h" /* make_run() */ /* Run the user program, create the */ /* root process and start the kernel. */ void make_run() ■c int takO; int fib(); int *qksort(); if ( strcmp(funname,"tak") == 0 ) run (tak, 0 ,3, args [0] ,args[l] ,args [ 2] ) ; if ( strcmp(funname,"fib") == 0 ) run(fib,0 , 1,args[ 0] ); if ( strcmpCfunname,"qksort") == 0 ) run(qksort,0 ,1,args[0] ); kernel(); return; > /* run() * / / * Create a process control block and */ / * put it into the ready queue or the */ / * migration queue. */ void run(func,port_id,num_args,argl,arg2,arg3) int (*func)(); int port_id; int num_args; int argl; int arg2 ; int arg3; 320 { short i ; PCB pcb,get_pcb(); Q get_q(),q; pcb = get_pcb(); if ( pcb != NULL ) { pcb->id = num_pid; pcb->node = my_node; pcb->state = READY; num_pid++; pcb->port_id = port_id; if ( pcb->id == 0 ) { pcb->parent_id = 0 ; pcb->parent_node = my_node; > else • { pcb->parent_id = current_pcb->id; pcb->parent_node = current_pcb->node; > pcb->func = func; pcb->num_arg = num_args; pcb->args[0 ] = argi; pcb->args[l] - arg2 ; pcb->args[2] = arg3; q = get_q(); if ( q != NULL ) { q->pcb = pcb; decision_make(q) ; > > > /***************************************/ /* decision_make() */ /* I f load balancing applied, decide */ /* whether a process should be put in */ /* the ready queue or migrate queue. */ /***************************************/ void decision_make(q) Q q; ■ C i f ( LB == N0_LB ) enqueue_ready(q); else { i f ( ready_length < threshold ) enqueue_ready(q); else enqueue_trans(q); > 321 > D .2.5 S u sp e n d M o d u le /* susp.c */ /* This file contains functions which deal with the suspend*/ /* queue. */ #include "node.h" /* Search_sus() */ /* Search the suspend queue to find the*/ /* process with the desired process id */ /* node id. */ q search_sus(pid,node) int pid; long node; Q q; int found; found = 0 ; if ( debugl ) printf (" search at node ‘ /.Id,pid %d node 7,ld\n" ,my_node ,pid,node) ; for ( q = suspend_q; q != NULL; q = q->next ){ if C debugl ) printf ("C’ /.d, */.ld) " ,q->pcb->id,q->pcb->node) ; if ( q->pcb->id == pid && q->pcb->node == node ) - [ f ound = 1; break; > } if ( debugl ) printf("\n"); if ( found == 1 ) return(q); else return(NULL); > /* out_suspend() */ /* Take an entry from the suspend queue.*/ / a t e * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * : * : / P out_suspend(q) 322 Q q; { if ( q == suspend_q ) ■ [ suspend_q = q->next; if ( suspend_q != NULL ) suspend_q->last = q->last; > else { q->last->next = q->next; if ( q->next != NULL ) q->next->last = q->last; > q->last = NULL; q->next = NULL; return, (q) ; > /lie***************************************/ /* suspend() */ /* Suspend the current process and */ /* create child processes of the current*/ /* process. */ void suspend(func,num_args,num_avail,argl,arg2,arg3) int (*func)(); int num_args; int num_avail; int argl; int arg2 ; int arg3; { PCB pcb; Q get_q(),q; if C debugl ) printf (" num_arg = ’ /,d, count = ‘ /.d.argl = V.d, arg2=‘ /,d, arg3= 7,d\n", num.args, num.avail, argl, arg2, arg3); pcb = current_pcb; pcb->func = func; pcb->num_arg = num_args; pcb->arg_count = num_args - num_avail; pcb->args[0] = argl; pcb->args[l] = arg2 ; pcb->args[2] = arg3; q = get_q(); if ( q != NULL ) { q->pcb = pcb; if ( pcb->arg_count == 0 ) { pcb->state = READY; 323 enqueue_ready(q); > else { pcb->state = SUSPEND; enqueue_suspend(q); " L J > else kill_kernel(); > D . 2 . 6 L o ad T ra n s fe r M o d u le /* trans.c * / /* This file contains functions to transfer load to the */ /* remote using the heuristic load balancing method. */ #include "node.h" /* transQ */ /* Transfer by migrating a process from */ /* the migration queue to a candidate */ / * node chosen. */ void transC) s L long trans_node; PCB pcb; Q q; if ( ( trans_q != NULL ) && ( cand.list != NULL ) ) { if ( debug! ) printf ("Node “ /.Id: ready_length = '/ .Id , trans_length = '/,ld\n", my_node, ready_length,trans_length); trans_node = find_cand(); if ( debugl ) printf ("trans node = */,ld\n" ,trans_node) ; if ( trans_node != B1DN0DE ) {. q = dequeue_trans(); if ( q != NULL ) { pcb = q->pcb; cfree(q); migrate(pcb,trans_node); > } else printf("trans node not found\n"); 324 if ( debugl ) printf ("trans end'/,Id: ready_length='/,ld, trans_length=/.ld\n" , my node, ready_length,trans_length); > > / I * * * * * * * * * * * * * * * * * * ! ! ' * * * ’! ' * * * * ’) ' * * * * * * * * * ’! ' * * / /* find_cand() */ / * Find a good candidate node for load */ /* transfer. */ long find_cand() •c long node; if ( LB == LRR II LB == GRR ) node = get_RR(); if ( LB == LML II LB == GLM ) node = get_ML(); return(node); > /* get_RR * / / * Get a candidate using the Round Robin*/ /* heuristics. */ long get_RR() { long node; C cand; if ( cand_list != MULL ) { node = cand_list->node; if ( cand_list->next != NULL ) ■ { cand = cand_list; cand_list = cand->next; cand_list->last = NULL; cand->next = NULL; cand->last = NULL; add_cand_list(cand); } > else node = BADNODE; return(node); > 325 /* get_ML() */ /* Get a candidate by using the minimum */ /* load heuristics. */ long get_ML() ■ c long n.node; float load; C tmp,tmpl; if ( cand.list != NULL ) { n = cand_list->node; cand_list->load++; for ( tmp = cand_list; tmp->next != NULL; tmp = tmp->next ) • { tmpl = tmp->next; if ( tmp->load > tmpl->load ) { node = tmp->node; load = tmp->load; tmp->node = tmpl->node; tmp->load = tmpl->load; tmpl->node = node; tmpl->load = load; > > > else n = BADNODE; return(n); > /He***************************************/ /* migrate() */ /* Migrate a process to a selected node */ /****************************************/ void migrate(pcb,node) PCB pcb; long node; ■ c l' ' if ( debugl ) printf ("send pcb from * / 0 ld to 7.1d\n" ,my_node .node) ; g_msg->type = MMSG_TYPE; g_msg->node = my_node; memcpy((char *)g_msg->body,(char *)pcb, sizeof(PCBLCK)); cfree(pcb); csend(GMSG_TYPE,&gmsg,sizeof(gmsg),node,N0DE_PID); > 326 /* recv_pcb() */ /* Receive a process from remote node * / /* and put it in the ready queue. */ /****************************************/ void recv_pcb(g_msg) GMSG g_msg; ■ c PCB pcb,get_pcb(); Q q,get_q(); int tak ( ) ; int fib(); int *qksort(); if ( debug! ) printf (" at node 7 , Id receive trans from node 7,ld\n", my_node,g_msg->node); pcb = get_pcb(); if (pcb != NULL ) { memcpy((char *)pcb,(char *)g_msg->body,sizeof(PCBLCK)); if ( strcmp(pcb->funname,"tak") == 0 ) pcb->func = tak; if ( strcmp(pcb->funname,"fib") == 0 ) pcb->func = fib; if ( strcmp(pcb->funname,"qksort") == 0 ) pcb->func = qksort; q = get_q(); if ( q != NULL ) { q->pcb = pcb; enqueue ready(q); > > > /a;***************************************/ /* add_cand_list() */ /* Add a candidate to the candidate list*/ void add_cand_list(cand) C cand; C tmp; if ( cand_list == NULL ) { cand_list = cand; cand_list->last = cand; > else • { tmp = cand_list->last; tmp->next = cand; 327 can d -> last = tmp; c a n d _ lis t-> la s t = cand; > > D .2 .7 Q u e u e H a n d le r M o d u le / * queue. c */ /* This f i l e co n tain s fu n c tio n s which m anipulate queues. */ #include "node.h" I* enqueue_ready() */ /* Put an e n try in to ready queue. */ /****************************************/ void enqueue_ready(q) Q q; - c Q tmp; ready_length++; l_ in + +; i f ( ready_q == NULL ) • { ready_q = q; read y _ q -> last = q; } e ls e { /* FIFO */ tmp = re a d y _ q -> la st; tmp->next = q; q -> la s t = tmp; read y _ q -> last = q; /* stac k tmp = ready_q; q->next = tmp; tm p-> last = q; ready_q = q; */ > } /****************************************/ /* enqueue.suspend() */ /* Put an en try in to the suspend queue. */ /********-********************************/ void enqueue _ suspend(q) 328 Q q; { Q "tmp; susp_length++; i f C debugl ) p r in tf ("susp len g th = '/.d\n" , susp_length) ; i f ( suspend_q == NULL ) -[ suspend_q = q; su sp en d -q -> last = q; > e ls e { /* FIFO */ tmp = su sp en d _ q -> last; tmp->next = q; q -> la st = tmp; suspend_q->last = q; /* stack */ /* tmp = suspend_q; q->next = tmp; tm p-> last = q; suspend_q = q; */ > > /* enqueue_transO */ /* Put an en try in to th e m ig ratio n queue*/ /****************************************/ void enqueue_trans(q) Q q; ■ c Q tmp; i f ( tra n s_ q == NULL ) { tran s_ q = q; tra n s_ q -> la st = q; > e ls e { tmp = tra n s _ q - > la s t; tmp->next = q; q -> la st = tmp; tra n s q -> la st = q; } tran s_ len g th + + ; > /****************************************/ 329 /* dequeue.readyO */ /* Take an entry from the beginning of */ /* ready queue. */ Q dequeue_ready() ■ c q q; if ( ready_q != NULL ) ■ { ready.length--; l_in— ; q = ready_q; ready.q = q->next; if ( ready.q != NULL ) ready_q->last = q->last; q->last = NULL; q->next = NULL; return(q); > else return(NULL); > /****************************************/ / * dequeue.transO */ /* Take an entry from the beginning of */ /* migration queue. */ /****************************************/ Q dequeue.trans() { Q q; if ( trans.q != NULL ) { trans.length— ; q = trans.q; trans.q = q->next; if ( trains _q != NULL ) trans_q->last = q->last; q->last = NULL; q->next = NULL; return(q); > else return(NULL); } /* dequeue.suspendO */ /* Take an entry from the beginning of */ 330 /* the suspend queue. */ j J i t * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / Q dequeue_suspend() ■ c Q q; if ( suspend_q != NULL ) { q = suspend_q; suspend_q = q->next; if ( suspend_q != NULL ) suspend_q->last = q->last; q->last = NULL; q->next = NULL; susp_length--; return(q); > else return(NULL); > D .2 . 8 B e n c h m a rk P ro g ra m M o d u le /* prog.c */ /* This file contains the benchmark programs. */ /* They are: takO , fib() and qksort(). */ #include "node.h" /j****************************************/ / * t a k ( ) * / /* Tak function. */ /if:***************************************/ int tak(x,y,z) int x, y, z; ■c int r,xl,yl,zl; if ( debug ) printf (" tak ' / . d , ' / . d , ‘ / .d at node ‘ /.ld\n*',x,y ,z ,my_node) ; if (y >= x ) returnCz); else { suspend(tak,3,0); run(tak,0,3,x-l,y,z); run(tak,1,3,y-l,z,x); run(tak,2,3,z-l,x,y); return(NONE); 331 > > /* fib() */ /* Fibonacci function. */ /I!;#**************************************/ int fib(x) int x; { int plus(); int fibC); unsigned long ck; if ( debug) printf(" fib * / . d at node ‘ /.ldXn" ,x ,my_node) ; f++; ck = mclockO + 0 ; while ( mclockO < ck ) 1 if ( x <= 2 ) return(x); else { suspend(plus, 2 ,0) ; run(fib,0 ,2 ,x-l); run(fib,1, 2 ,x-2); return(NONE); } > /****************************************/ /* plusO */ /* Addition function. */ int plus(x,y) int x,y; - c return(x+y) ; > /* times() */ /* Multiplication function. */ int times(x.y) int x,y; { return(x * y); 332 > /****************************************/ /* qksortO */ /* Quick sort program. */ int *qksort(l,r,x) int 1 ; / * left index of the array */ int r; /* right index of the array */ int x[]; /* array to be sorted */ ■ c int j, partitionO; /* partition */ void insortQ; /* insertion sort */ int k; if ( 1 < r ) { if ( r - 1 < THRESHOLD ) { /* insertion for small size */ insort(l,r,x); return(x); > else • { j = partition(l,r,x); /* j is the split point * / suspend(merge,6,4,l,j,r,x,0,0); /* 6 args only 4 available */ runCqksort,4,3,1,j,x); /* port.id = 4, 3 arguments * / run(qksort,5,3,j+1,r,x); /* port_id = 5, 3 arguments */ return(&NONE); /* dummy return for suspend */ > > else return(x); /I****************************************/ /* mergeC) */ /* Merge two sorted sub-set to one. */ int *merge(l,j,r,z,x,y) /* put two sub-array together */ int l,j,r; /* j is the joint point */ int z[] ,x[] ,y [] ; - C void copyC); copy(l,j,x,z); /* copy 1 to j elements from x to z */ copyCj+l,r,y,z); /* copy (j+l) to r elements from y to z */ return(z); /* z is sorted */ > /****************************************/ /* copyO */ 333 /* Copy sub-array into array. */ /****************************************/ void copy(l,r,x,y) int l,r; /* left and right indices */ int x d ; /* source */ int y[] ; /* target */ { memcpy(&y[ 1 ],&x[l].sizeof(int) * (r-i+1)); } /****************************************/ /* exchangeC) */ /* Exchange two element values. */ /****************************************/ void exchange(x,y) int *x,*y; ■ c int t ; t = HcX; *x = *y; *y = t ; > /****************************************/ /* partitionO */ /* A mead-of-three partition, divides */ /* an array into two sub-arrays. */ / ***** ** ******* ****** H e * * ****** **** * ******/ int partition(l,r,A) int 1,r ; 1 — 1 i _1 • H { int v,i,j,t ; t = A[r - l] ; A[r - l] = A [(1 + r) / 2] ; A [ (1 + r) / 2] = t; if (A[r - 1] > A [1]) e x ch a n g e(&A[r-l],&A[1]); if (A[r] > A [l]) ex c h a n g e(&A[r],&A[l]); if (A[r - 1] > A[r]) e x ch a n g e(&A[r-l3,&A[r3); v = A[r3 ; i = 1 - i; j - r; do { 334 do { i++ ; } while (A[i] < v); do { j--; } while (A [j] > v); exchange(&A[i],&A[j] ); } while (j >= i); t = A[r] ; A[r] = A[j] ; Atj] = Ati] ; A [i] = t; return(i); } /****************************************/ /* insortO */ /* Insertion sort on small array. */ void insort(l,r,x) int l,r; int x [] ; { int i, j, v; if (r > 1) { for (i = r - 1; i >= 1; i— ) { if (x[i] > x[i + 1]) { v = x[i] ; j = i + 1; do { x[j - 1] = x [j ] ; j++; > while (x[j] < v) ; x[j - 1] = v; > > } > / * * * * * * * * * * * * * * * * * * * * * * * * % * * * * * * * * * * * * * * * / /* gen() */ /* Generate randomly distributed data */ /* need to be sorted. */ void gen(n,X) int n; int X[] ; ■ c 335 int i; srand(l); ICO] = n; /* data set size */ X[l] = -1; /* lower bound */ for (i = 2; i <= n+1; i++) /* will sort X[2] to X[n+l] */ X[i] = ( randO & 00000017777) >> 4; X[n+2] = 99999; /* higher bound */ } D .2 .9 N o d e L ib ra ry M o d u le /* node_lib.c */ /* This file contains node library functions: the allocation*/ /* and the printing functions. */ #include "node.h" /* get_pcb() */ /* Get a free process block by allocat-*/ /* ing the memory space. */ PCB get_pcb() { char * callocO; returnC (PCB) callocCl,sizeof(PCBLCK)) ); > /* get_q() */ /* Allocate a queuen entry space. */ Q get-q() •[ char * callocO; returnC (Q) callocCl,sizeof(QUEUE) )); > Z********************* ******************:*/ /* get_cand() */ /* Allocate a candidate space. */ c get_cand() 336 - e char * callocO; returnC (C) calloc(i,sizeof(CAND) )); } D .2.10 N ode Declaration M odule / * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * / /* node_decl.c */ /* This file contains the declarations of global variables.*/ /***********************************************************/ tfinclude "node.h" PARMSG host_parm_msg; /* host parameter message */ PMSG host_parm_ptr; /* pointer to host_parm_msg */ HLOADMSG hostloadmsg; /* host load info message */ HLMSG host_l_msg; /* pointer to hostloadmsg */ NLOADMSG nodeloadmsg; /* local load message */ NLMSG node_l_msg; /* pointer to nodeloadmsg */ GENMSG gmsg; /* generic message */ GMSG g_msg; /* pointer to gmsg */ NODEMSG nodemsg; /* node status message */ NMSG n_msg; /* pointer to nodemsg */ long msg_size; /* length of message */ long host_node; /* host node id */ long host_pid; /* host process id */ long my_node; /* local node id */ long num_nodes; /* number of nodes in system */ long debug; /* high level debug */ long debug1; /* low level debug */ char funname[FUWCLEN] ; /* function name */ int num_arg; /* number of arguments */ int args[MAXARG]; /* argument array */ int num_pid; /* number process idJs given */ long DIM; /* hypercube dimension */ Q ready_q; /* ready queue */ Q trans_q; /* migration queue */ q suspend_q; /* suspend queue */ PCB current_pcb; /* current PCB */ PCB root_pcb; /* root PCB */ C cand_list; /* candidate list */ C new_list; /* new list */ int ready_length; /* length of ready queue */ int trans_length; /* length of migration queue */ int susp_length; /* length of suspend queue */ 337 int done; /* int ok_to_recv; /* unsigned long start_time;/* unsigned long old_timer; /* unsigned long timer; /* long LB; /* long threshold; /* int status; /* float l_in; /* long neighbors[MAX_N0DE] float alpha; /* int final_result; /* shut down flag */ ready to receive load */ starting clock */ last recorded clock */ current clock * / load balancing method */ load threshold */ load status */ local load * / /* neighboring nodes * / threshold normalize cons. */ user program result */ 338
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11257159
Unique identifier
UC11257159
Legacy Identifier
DP22810