Close
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
MAPPING METHODOLOGIES FOR HETEROGENEOUS COMPUTING by M uham mad Esm at Shaaban A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Com puter Engineering) May 1994 Copyright 1994 M uham mad Esm at Shaaban UMI Number: DP22891 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI DP22891 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106- 1346 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by Muhammad E. Shaaban under the direction of h.Xs. Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies Date DISSERTATION COMMITTEE V - U ; ^ f z . . .. Chairperson CfS > c j 4 5 ^ 2 - 4 ^ 42 . £ t .Z o To my parents 1 1 Acknowledgments I am forever indebted to my parents for their limitless kindness, love and sup port throughout the years. I am very grateful to my thesis advisor Dr. Viktor K. Prasanna for his guidance, encouragement and patience throughout this research. I would like to thank Dr. Richard Freund and Dr. Douglas Ierardi for taking the tim e and effort to serve on my thesis comm ittee. And for their true unconditional friendship through the ups and downs, and for their great help, technical and otherwise, very special thanks go to Mary Eshaghian and Ashfaq Khokhar. To all selfless kind souls who make life a little more bearable and give without expecting to receive, my deepest thanks and admiration. Contents D e d ic a tio n ii A c k n o w le d g m en ts iii L ist o f F ig u res v L ist o f T a b les v ii A b str a c t ix 1 In tr o d u c tio n 1 1 T h e o r y o f H e te r o g e n e o u s C o m p u tin g 8 2 H iera rch y o f H e te r o g e n e o u s C o m p u tin g 9 2.1 Network L ayer............................................................................................ 9 2 . 2 Communication Layer ........................................................................... 11 2.3 Intelligent L ay e r........................................................................................ 11 2.3.1 Code A nalysis.................... 12 2.3.2 Partitioning, Mapping and Scheduling ............................... 13 2.3.3 Programming E n v iro n m en ts................................................... 14 3 H O S T 16 3.1 Heterogeneous Optim al Selection Theory (H O S T ) ........................ 16 3.2 Modeling the Input to HOST ............................................................. 21 II In te llig e n t M a p p in g M e th o d o lo g ie s 23 4 A u to m a tic F in e G rain M a p p in g w ith C lu ste r-M 25 4.1 In tro d u c tio n ............................................................................................... 25 4.2 C lu s te r-M .................................................................................................. 28 4.2.1 Cluster-M S p ecificatio n s.......................................................... 28 iv 4.2.2 Implementation of the Cluster-M c o n s tr u c ts ...................... 30 4.2.3 Cluster-M problem Specification m a c r o s ............................. 31 4.2.4 Cluster-M R epresentations........................................................ 36 4.2.5 Implementing the representation a lg o rith m ......................... 43 4.2.6 Hierarchical Cluster-M M o d e l................................................. 44 4.3 Cluster-M Mapping A lg o rith m .............................................................. 50 4.3.1 P re lim in a rie s ............................................................................... 50 4.3.2 Mapping A lg o rithm ..................................................................... 53 4.4 Im plem entation and Experiment R e s u l ts .......................................... 54 4.5 C onclusion................................................................................................... 59 5 O p tim a l P a r titio n in g for T w o -m a ch in e H e te r o g e n e o u s S u ite 61 5.1 Partitioning of Chain Structured P ro b le m s ...................................... 62 5.1.1 Image Processing B a c k g ro u n d ................................................. 63 5.1.2 The Partitioning A lg o rith m .................................................... 67 5.1.3 Vision A pplications..................................................................... 79 5.2 Partitioning Tree Structured P r o b le m s ............................................. 81 5.2.1 The Labelling of the T r e e ........................................................ 82 5.2.2 The Doubly Weighted Assignment G ra p h ............................ 83 5.2.3 The Notion of Critical Nodes ................................................. 87 5.3 C onclusion................................................................................................... 94 III C o n clu d in g R em a rk s 95 A p p e n d ic e s 98 A p p e n d ix A: P C N C lu ster-M C o n stru cts 98 B ib lio g r a p h y 102 List of Figures 1.1 A heterogeneous network-based parallel computing system. . . . 3 1.2 Performance improvement of a heterogeneous suite over single supercom puter ........................................................................................ 3 1.3 The mapping problem as graph m atching ...................................... 4 1.4 Machine independent programming m o d e l ...................................... 6 2.1 Intelligent layer services........................................................................... 12 3.1 Input form at for OST and A O S T ...................................................... 17 3.2 Input form at for H O S T .......................................................................... 22 4.1 PCN System S tr u c tu re .......................................................................... 32 4.2 Cluster-M Specification of associative binary m acro................... 34 4.3 Cluster-M Specification of broadcast m acro................................. 36 4.4 Cluster-M Representation of n-cube of size 8 ............................... 38 4-5 Cluster-M Representation of Mesh of size 8 ...................................... 38 4.6 Cluster-M Representation of a ring of size 8 ................................. 39 4.7 Cluster-M Representation of a completely connected system of size 8 .............................................................................................................. 40 4.8 Cluster-M Representation of an arbitrarily connected system of size 8 .............................................................................................................. 40 4.9 A heterogeneous parallel computing system ...........................................46 4.10 Hierarchical Cluster-M representation of a heterogeneous com puting system .............................................................................................. 47 4.11 Hierarchical Cluster-M specification of input to HOST .............. 49 4.12 Pseudo code of Im p le m e n ta tio n .......................................................... 56 4.13 Mapping associative binary operation onto cube.............................. 57 4.14 Mapping irregular problem onto m esh................................................. 58 4.15 Mapping binary-associative operation onto an irregular system. 58 4.16 An example for irregular m apping........................................................ 59 5.1 Levels of Processing in V i s i o n ............................................................. 64 5.2 A seven-module chain m apped onto a two processor heteroge neous system ................................................................................................ 6 8 vi 5.3 The layered graph and several paths between the start and the end node....................................................................................................... 72 5.4 P ath Pi (bottom ) and path P 2 (top)..................................................... 78 5.5 A pictorial representation of a Scene Description Task........................ 80 5.6 An eight-module tree m apped onto a two processor heteroge neous system ................................................................................................ 8 6 5.7 An eight-module tree is partitioned onto a two processor hetero geneous system ............................................................................................ 8 8 5.8 The eight-module tree, as shown in Figure 5.7, is partitioned onto a two processor heterogeneous system ........................................ 90 5.9 A labelled tree with critical nodes shown in black................................ 92 List of Tables 3.1 Notations used in HOST formulation Abstract The performance of an algorithm for solving a given problem is very much dependent on its implementation and mapping on the architecture being used. Also, algorithms have diverse com putational requirements which may not be fulfilled when using a single architecture. Heterogeneous Com puting (HC) deals with the concurrent use of heterogeneous suite of computers (scalar, par allel, vectorized, etc.) in solving a given problem. M otivated by the potential for increased com putational performance and cost effectiveness, this com puta tional model has been recently studied by several scientists. The m ain them e of this thesis is to concentrate on fundam ental design issues in HC, in partic ular on developing suitable methodologies for m apping application tasks onto heterogeneous suite of computers. Towards this goal, we present our results in two parts. In the first part, we present the theory of Heterogeneous Computing. This includes a presentation of a hierarchy of design issues in Heterogeneous Com puting, comprised of three layers. The first layer, Network Layer, includes the physical design aspects of interconnecting autonomous machines in the system. The second layer, Communication Layer, is concerned with communi cation and synchronization primitives required to exchange information among processes residing in various machines. The third layer, Intelligent Layer, pro vides system-wide tools and techniques necessary to manage the suite of het erogeneous machines. The tasks handled by this layer include code analysis, partitioning, mapping, scheduling, and program m ing environment tools. In this thesis, we only concentrate on Intelligent Layer. Heterogeneous Optim al Selection Theory (HOST), which is a proof of existence of an optim al selection of heterogeneous machines for a given task, is also presented in this part. Having discussed the theory of Heterogeneous Com puting and identified program m ing and m apping requirements, we next concentrate on m apping methodologies for heterogeneous computing in P art 2. The first m apping methodology described is based on the Cluster-M parallel programming model. This is an online mapping heuristic for porting an arbitrary algorithm onto some or all of the available machines in a heterogeneous suite. Before pre senting this algorithm , we first introduce the Cluster-M parallel programming model. This model meets the requirements of HC by allowing a single program to be ported and shared among various machines in a heterogeneous suite. To exploit multi-level parallelism of a task, in a heterogeneous environment, Hierarchical Cluster-M (HCM) is also introduced in this part. In the second chapter of this part, an optim al m apping methodology is presented for a heterogeneous suite having only two machines of different types. The input to this m apping methodology is assumed to have either a chain or tree structure. In conclusion, we define future research directions in the field of Heterogeneous Com puting and mapping. Chapter 1 Introduction Today’s supercom puting applications are characterized by a high level of di versity in term s of the type of embedded parallelism, and by an ever-increasing dem and for com putational performance. Conventional parallel supercom put ing systems utilize a num ber of homogeneous processors to cooperate on solving parallel tasks. These systems are usually classified according to the m ultiplicity of data and instruction stream s [25]. Such homogeneous systems provide efficient solutions to tasks with em bedded parallelism m atching th at offered by the system (i.e. SIMD, MIMD, vector). If more than one type of parallelism is present in a task, the system perform ance is greatly degraded. This can be explained by A m dahl’s law which states th at the overall rate at which a machine will compute a code or set of codes is determ ined by the sum of the inverses of the times of each subpor tion. Thus the machine spends most of the execution tim e on code for which it is poorly suited. If greater com putational power is needed, the complete system needs to be replaced by a more powerful homogeneous system, a costly solution. Heterogeneous Computing (HC) is a novel approach th at has the potential to overcome several of the shortcomings of conventional homogeneous parallel systems. HC is an environment where a parallel application is run utilizing a num ber of cooperating autonomous high performance computers commu nicating over an intelligent network, and offering several types of parallelism 1 [27, 29, 38]. A typical HC environment is shown in Figure 1.1. This approach aims at providing high performance by executing portions of code on machines offering similar types of parallelism. HC offers the greatest performance im provement potential over homogeneous systems when more than one type of parallelism is present in the task in question (see Figure 1.2). This is achieved by m apping each task module to a machine in the heterogeneous suite with a m atching type of parallelism. This process is closely related to the m apping problem in homogeneous parallel systems. In homogeneous systems, the m apping problem is usually concerned with finding an assignment of problem tasks onto processors which minimizes the total completion tim e or cost. The optim al solution for this problem, given arbitrary task and system structures, has been shown to be N P-hard even when the processors are identical [7]. This due to the fact th at an optim al solution to the m apping problem requires graph m atching of the problem graph and the system graph (see Figure 1.3). To overcome the intractability aspect of the m apping problem several heuristic approaches have been proposed over the years [18, 5, 30, 62, 63, 48, 15, 57]. An alternate approach taken by several researchers is to find optim al mappings solutions by restricting the structures of the problem and the system [6 , 46, 34, 33]. HC, while promising increased com putational performance, adds additional constraints on the m apping problem. This stems from the HC requirement th at task modules are to be m apped onto machines with suitable types of par allelism. In addition, HC systems are dynamic in nature w ith systems possibly being added or removed from the heterogeneous suite. Thus m apping m ethod ologies developed for homogeneous parallel systems, where such constraints do not exist, m ay not be suitable for HC. In this thesis we investigate various design issues involved in HC, especially those concerning m apping of tasks onto the various machines in the hetero geneous suite. We classify such issues into a hierarchy comprised of three layers. The first layer, Network Layer, includes the physical design aspects of 2 Alliant FX-80 □ODD □QQO Cray Y-MP ^gtHtpeetf^lftreoanecQdtt: M assively Parallel Processor M PP '---------------------------------- Im age Unders tending Architecture Figure 1 .1 : A heterogeneous network-based parallel com puting system. Baseline System Vector M IM D SIM D Scalar S P 20 30 25 Single Vector Supercom puter 2 times faster than baseline 10 % 1 5 % 18 % Heterogeneous Suite 20 times faster than baseline 1 % 1 % 1 % Figure 1 .2 : Performance improvement of a heterogeneous suite over single su percom puter 3 Problem Graph Matching Problem Graph System Graph Figure 1.3: The m apping problem as graph m atching 4 interconnecting autonomous machines in the system. The second layer, Com m unication Layer, is concerned with communication and synchronization prim itives required to exchange information among the processes residing in various machines. The third layer, Intelligent Layer, provides system-wide tools and techniques necessary to manage the suite of heterogeneous machines, proper operation, and transparency to users. The tasks handled by this layer include code analysis, partitioning, mapping, scheduling, and program m ing environ m ent tools. In this thesis we concentrate on design issues for the Intelligent Layer in general, and on its m apping aspects in particular. In this thesis, we develop two m apping methodologies for HC. The first methodology is based on the concept of generating a m achine-independent specification of the parallel task (see Figure 1.4). To aid in developing this methodology, we design a machine independent program m ing model for HC, called Cluster-M. An efficient heuristic m apping methodology based on cluster- M is then developed. This is a general m apping methodology with no restriction on the structure of either the heterogeneous suite or the task. The second m apping methodology we propose in this thesis is restricted to heterogeneous suites with two machines, and to tasks with chain or tree struc tures. Under these constraints this methodology provides optim al m apping in polynomial tim e. We present our research results in two parts. In the first part, we present the theory of HC. In Chapter 2 , a hierarchy of design issues in Heterogeneous Com puting, comprised of the three layers is presented. HOST which is a proof of existence of an optim al selection of heterogeneous machines for a given task, is also presented in this part in C hapter 3. Having discussed the theory of Heterogeneous Com puting and identified program m ing and m apping requirem ents, we next concentrate on m apping methodologies for Heterogeneous Computing in part 2. The first m apping methodology described is based on the Cluster-M parallel programming. This is an online m apping heuristic for porting an arbitrary algorithm onto any or all of the available machines in a heterogeneous suite. We also present the Cluster-M in detail in this part. Cluster-M meets the requirem ents of HC by 5 Problem i' Machine Independent Specification Machine Machine Machine Machine Parallel Programming Model Generate Mappings Figure 1.4: Machine independent program m ing model utilizing a machine-independent specification of the task which can ported and shared among various machines in a heterogeneous suite. To exploit multi-level parallelism of a task, in a heterogeneous environm ent, Hierarchical Cluster-M (HCM) is also introduced as a more restricted form of the Cluster-M model. The m ain components of HCM are HCM system Representations, and the HCM problem Specifications. In the second chapter of this p art, we present an optim al m apping m ethodol ogy for two-machine heterogeneous suites. The input to this m apping m ethod ology is assumed to have either a chain or tree structure. In conclusion, we define future research directions in the field of Heterogeneous Com puting and mapping. 7 Part I Theory of Heterogeneous Computing In this part, we present the theory of heterogeneous computing. This in cludes a presentation of a hierarchy of design issues in heterogeneous com puting, comprised of three layers. The first layer, Network Layer, includes the physical design aspects of interconnecting autonomous machines in the system. The second layer, Com munication Layer, is concerned with communication and synchronization primitives required to exchange inform ation among processes residing in various machines. The third layer, Intelligent Layer, provide system- wide tools and techniques necessary to manage the suite of heterogeneous m a chines. The tasks handled by this layer include code analysis, partitioning, m apping, scheduling, and programming environment tools. HOST which is a proof of existence of an optim al selection of heterogeneous machines for a given task, is also presented in this part. 8 Chapter 2 Hierarchy of Heterogeneous Computing The HC environment is comprised of several hardware and software compo nents th at manage the suite of heterogeneous machines in the system and enable applications to be run efficiently. The hardware and software require m ents for HC can be classified into three layers: network layer, comm unication layer, and intelligent layer. In this thesis, we concentrate on issues related to the intelligent layer. We next describe each of these layers. 2.1 Network Layer The network layer in HC includes the physical aspects of interconnecting the autonomous high performance machines in the system. This includes low level network protocols, machine interfaces. Current Local Area Networks (LANs) can be used to connect existing machines, but this approach is not suitable for HC. In order to realize HC environm ent, higher bandw idth and lower la tency networks are required. The bandwidth of commercially-available LANs is lim ited to about 10 M bits/sec. On the other hand, in HC, assuming m a chines operating at 25 MHz clock with 40 MIPS instruction rate and 16 bits word length, a bandw idth in the order of 1 G bits/sec is required to m atch com putation and communication. 9 Recent advances in network technology have m ade it feasible to build gi gabit LANS. Links in these networks are capable of operating on the order of 1 G bits/sec or higher rates, and thus have at least 100 more bandw idth than today’s 10 M bits/sec Ethernets. Gigabit LANs standards are emerging. The High Performance Parallel Interface (H iPPI), whose physical layer has been approved as an ANSI standard, will likely become for the backbone for in terconnecting machines in HC. HIPPI-based LANS support d ata rates of 800 M bits/sec and 1.6 G bit/sec. Such networks are being used to interconnect a CRAY-2 and CM-2 at the Minnesota Supercom puter Center [58]. A similar project using A CRAY Y-MP and CM-2 is well underway at the Pittsburgh Supercom puting Center [45]. Even w ith high bandw idth networks, there are three m ain sources of in efficiency in current network implementations. First, existing application in terfaces incur excessive overhead due to context switching and data copying between the user process and the m achine’s operating system. Secondly, each machine m ust incur overhead of executing high-level protocols th at ensure re liable communication between tasks. Also, the network interface burdens the machine with interrupt handling and header processing for each packet. N ectar [4] is an example of a network backplane for heterogeneous mul ticom puters. It consists of a high-speed fiber-optic network, large crossbar switches and powerful network interface processors. Protocol processing is off loaded to these interface processors. In HC, modules from various vendors share physical interconnections. Since different m anufacturers usually use different communication protocols, the net work m anagem ent problem becomes more complex [47]. The following three general approaches dealing with network heterogeneity are given in [60]: 1. To treat the heterogeneous network as a partitioned network, where each partition employs a uniform set of protocols, 2 . to have only a single “visible” network m anagem ent console, and 3. to integrate the heterogeneous m anagem ent functions at a single m an agement console. 10 2.2 Communication Layer The HC environment achieves efficient execution of parallel tasks by decompos ing the task into several modules which are assigned to machines in the system with similar mode of embedded parallelism. The task modules run on assigned machines as local processes. These processes need to exchange interm ediate results and process synchronization information, either from processes resid ing in the same machine or from processes residing on other machines using the network. Since each machine on the system may utilize different process communication and synchronization primitives, a uniform system-wide com m unication mechanism operating above native operating systems is needed to facilitate this exchange of information. Due to the networked nature of HSC and the lack of shared memory, such communications mechanism m ust support message passing. An example communication mechanism for HC is provided by PVM [56]. The PVM system emulates a virtual concurrent computing machine on a suite of networked machines by executing a system level process on each machine. A process running on a local machine can access the virtual machine via library routines embedded in im perative procedural languages such as C. Communi cation support is provided for process m anagem ent via datagram or stream - oriented message-passing, synchronization based on barriers or variants of ren dezvous, and or auxiliary tasks. These library routines interact with the PVM system process on each machine, which then provides the requested actions in cooperation with PVM system processes running on other machines in the system. 2.3 Intelligent Layer The intelligent layer of the HC environment provide system-wide tools and techniques necessary to manage the suite of heterogeneous machines and to in sure proper and efficient execution of tasks. Such tools operate over the native operating systems of the individual machines and utilize process communica tion prim itives provided by the communication layer. The services provided by 11 this layer include language support, user interface, task decomposition, m ap ping, and scheduling, as illustrated in Figure 2.1. We next briefly describe these various services. Parallel Task written using Programming Environment Tools Task Modules -Code Analysis • - Pardoning xmm iimmmmm Mapping/ Scheduling Figure 2.1: Intelligent layer services. 2 .3 .1 C o d e A n a ly sis Traditional program profiling involves testing a program assumed to be com prised of several modules, by running it on some test data. The profiler m oni tors the execution of the program and gathers statistics including the running tim e of each program module. This information is then utilized to modify dif ferent modules improving the overall execution tim e. In HC, profiling is not done only to estim ate the execution tim e of code, but the type of the code is also considered. This is achieved by code-type profiling. Code-type profiling introduced in [28] is a code-specific function to determ ine the code-type (e.g. SIMD, MIMD, vector, scalar, etc.). 12 A nalytical benchmarking provides a measure on how well the available m a chines perform on a given code-type [28]. W hile code-type profiling identifies the type of code, analytical benchm arking ranks machines in term s of efficiency in executing a given code. Thus, analytical benchm arking techniques determ ine the relative effectiveness of a given parallel machine on various com putation types. 2 .3 .2 P a r titio n in g , M a p p in g an d S c h e d u lin g In HC, sim ilar to homogeneous systems, the problems of partitioning a par allel task into several modules, m apping resulting modules into various m a chines, and scheduling the execution of each m odule are closely related. In the past, partitioning and m apping problems for homogeneous parallel envi ronm ents have been investigated extensively [8 , 9, 17, 40, 24, 42, 51, 53, 54]. However, HC poses new constraints. In the following, we define partition ing and m apping as two different problems and also differentiate between the context of these term s in homogeneous and heterogeneous environments. In homogeneous environment, the partitioning problem addressed in [39, 12, 31] can be divided into two sub-problems. Parallelism detection determ ines the parallelism in a program. Clustering combines several operations into tasks and thus partitions the application into several tasks. Each cluster is then assigned to a processor. Both of these sub-problems can be performed by the user, or by the compiler or by the machine at run tim e. In HC, parallelism detection is not the only objective, classification of the code into different kinds of parallelism is also required. This job is accomplished by code-type profiling. Code-type profiling also poses additional constraints on clustering and in some cases it may involve an exhaustive search of the code types available. This exhaustive search m ay introduce extra overheads if it is to be perform ed at run time. An efficient strategy to curtail the search space is needed. M apping/ Allocation of program modules to processors has been addressed by m any researchers in the past [ 8 , 17, 40, 24, 42] [51]. Informally, in homo geneous environments m apping problem can be defined as assigning program 13 modules to processors such th at the num ber of pairs of comm unicating mod ules th at fall on pairs of directly connected processors is m axim ized [8 ]. In HC, machines are globally connected together through a high bandw idth net work, therefore assignment of communicating modules to directly connected machines is not an issue. However, other objective functions for m apping such as, m atching code-type to the m achine-type add additional constraints. If such m apping has to be performed at run tim e for load balancing purposes, or due to failure of a machine, m apping becomes more complex. In homogeneous environments, scheduling assigns each task to a processor, in order to achieve b etter processor utilization and high throughput. Three lev els of scheduling are generally employed. High level scheduling, selects a subset of all subm itted jobs competing for the available resources. Intermediate-level scheduling responds to short-term fluctuations in the system load by tem porar ily suspending and activating processes to achieve sm ooth system operation. Low-level scheduling determines the next ready process to be assigned to a processor for a certain duration. In HC, while all the above three levels of scheduling may reside in each m a chine, a fourth level of scheduling is needed. This level deals with scheduling at the system level. The scheduler m aintains a balanced system-wide workload by m onitoring the progress of all the tasks in the system. The scheduler needs to know the different task-types and available machine-types (i.e., SIMD, MIMD, Mixed-mode, etc.) in the system, since tasks may have to be reassigned due to changes in the system configuration or due to overload problem s. Commu nication bottlenecks and queueing delays incurred due to the heterogeneity of hardware add additional constraints on the scheduler. The scheduler also needs to use information from code-type profiling and analytical benchm arking. 2 .3 .3 P r o g r a m m in g E n v ir o n m e n ts A parallel programming environment provides a set of tools for program m ing and debugging application programs on parallel machines. It includes parallel languages, intelligent compilers, parallel debuggers, syntax-directed editors, configuration m anagem ent tools, and other program m ing aids. 14 In HC, machine-independent and portable parallel program m ing languages and tools, are required. This due to fact th at any portion of the task has the potential to be executed on any of the machines present in the suite. To provide such portability, certain tools are needed to act as interm ediate media, based on which m achine-independent algorithms can be designed using a single program m ing language and then be m apped onto the desired architecture. One such program m ing model, Linda [14, 11] defines a logically shared d ata struc turing memory mechanism called tuple space. However, Linda is difficult to im plem ent on architectures not supporting a shared m em ory structure. In con trast to Linda, the programming m odel Express supports a distributed memory system organization. However, algorithms coded using Express are machine dependent and therefore are not fully portable. A few other candidate par allel program m ing environments for HC are: the Actors Program m ing model [1, 2, 3], and Tool fo r Large-Grained Concurrency (TLC ). TLC, developed by BBN, employs implicitly parallel constructs to specify the dependencies among a set of coarse-grained rem ote com putations. The model Actors, on the other hand, allows massive parallel execution of algorithms. At an overhead cost of implementing such system, Actors is machine independent: it can be executed on shared memory computers and over distributed networks. Cluster-M introduced in [23], is a novel parallel program m ing m odel which allows parallel programs to be w ritten independent of underlying structure and thus is a possible candidate for HC. The components of Cluster-M and Hierarchical Cluster-M , which is a restricted version of Cluster-M designed specifically for HC, are presented in this thesis in Chapter 4. Recognizing its suitability for HC, we utilize Cluster-M as the basis for a general heuristic m apping methodology for HC presented in Chapter 4. 15 Chapter 3 HOST In this chapter, an existence theory for selecting an optim al suite of computers for solving problems with diverse com putational requirem ents, called Heteroge neous O ptim al Selection Theory (HOST), is presented. HOST is an extension to Augmented O ptim al Selection Theory in two ways: It incorporates hetero geneous parallelism embedded in the tasks, and it reflects the costs associated in using various fine grain m apping strategies at individual machine level. In the second part of this chapter, we concentrate on modeling this m athe m atical form ulation so th at it can be used as the input to the m apping m ethod ologies presented later in this thesis. 3.1 Heterogeneous Optimal Selection Theory (HOST) Freund in [27] proposed an O ptim al Selection Theory (OST) to choose an optim al configuration of machines for executing an application task on a het erogeneous suite of computers with an assum ption th at the num ber of m a chines available is unlim ited. An application task is assumed to comprise of S non-overlapping code segments. Each segment has homogeneous parallelism embedded in its com putations. Also, code segments are considered to be ex- j ecuted serially. A code segment is decomposable if it can be partitioned into different code blocks (see Figure 3.1). All code blocks w ithin a code segment 16 Code Segment 1 Ga &l Stock 1 itf+ » tr* m m t t t t ■ SIMD MIMD Task Code Segment S Figure 3.1: Input form at for OST and AOST have same type of embedded parallelism and can be executed concurrently. The goal of OST is to assign the code blocks within each homogeneous code segment to the available m atching machine types such th a t it can be executed optimally. The execution tim e for a decomposable code segment is equal to the longest execution tim e among all the code blocks of th at code segment. A m achine type is identified based on the underlying machine architectures, for example, SIMD, MIMD, scalar, vector, etc. Similarly, each machine type can have more than one model, e.g., Ncube and Mesh for SIMD. Machine choices in OST are always optim al and decomposition of code segments are uniform. According to OST, there exists an assignment such th at the total tim e spent on all code segments is minimized, subject to a fixed constraint such as cost. The assum ption here is th at there are always enough num ber of machines of each type available to which a code block can be assigned. Therefore, a code segment can always be assigned to its m atching machine type. OST was augmented by Wang, et al. [59] to incorporate the performance of code segments on non-optim al machine choices, assuming th at the num ber of available machines for each type is lim ited. Under this assum ption, a code 17 segment’ which is m ost suitable- for one type of machines m ay have to be as signed to another type. For example, consider a code segment consisting of 5 code blocks suitable for execution on MIMD machines. Assuming there are 2 MIMD machines and 6 SIMD machines available, it is impossible to assign all the blocks to MIMD machines. On the other hand, assigning all 5 blocks to 5 SIMD machines m ay result in the shortest execution tim e, or decomposing the segment into 2 blocks and assigning them to 2 MIMD machines will give b etter performance. In Augm ented O ptim al Selection Theory (AOST), non- optim al machine choices and non-uniform decompositions of code segments are incorporated. In the formulation of OST and AOST, it has been assum ed th at the execu tion of all code segments of a given task is totally ordered in tim e. However, there may exist different execution interdependencies among a set of code seg m ents. Also, parallelism may be present between code segments, resulting in a concurrent execution of several code blocks of different code segments on a suite of heterogeneous machines. Furtherm ore, the effect of m apping techniques for different problems available on individual machines has not been considered in the form ulation of selection theory. Consider the following example. Given a code segment consisting of two SIMD code blocks. One code block sorts N elements and the other adds N elements. Assume there are two SIMD machines available, a N processor hypercube and a N processor Mesh Connected Com puter (MCC). Efficient as signment of code blocks to the two SIMD machines depends upon the m apping techniques used on the machines for the problem under consideration. The assignment problem becomes more interesting if the machines are of differ ent sizes in term s of num ber of processors and or memory available w ith each processor. HOST, as presented in this section, is an extension to AOST in two ways: It incorporates the effects of various fine grain m apping techniques available on individual machines, and the task is assumed to have heterogeneous embed ded parallelism. The input form at is relaxed to allow concurrent execution of 18 m utually independent code segments. An application task is divided into sub tasks. Subtasks are executed serially. Each subtask m ay contain a collection of code segments which can be executed in parallel. A code segment consists of homogeneous parallel instructions. Each code segment is further decomposed into several code blocks which can be executed concurrently. These code blocks are to be assigned to the machines of the same type. A general input form at is illustrated in Figure 3.2. Figure 3.1 is a special case of Figure 3.2, where each subtask contains only one code segment. In HOST, heterogeneous code blocks of different code segments can be executed concurrently on different types of machines, exploiting heterogeneous parallel com putations embedded in the application. Let S be the num ber of code segments of the given task, and M be the num ber of different machine types to be considered. Furtherm ore, let r)[t] be the num ber of machine models of type t, a[t\ be the num ber of mappings available on machine type t, and /?[<,/] be the num ber of machines of model I of type t available. Assume v[t,j] is the m axim um num ber of code blocks code segment j can be decomposed into. Define 7 [t,j] to be the num ber of machines of type t th at are actually used to execute code segment j. Therefore, 7 [t,j] equals the m inim um of v[t,j] and the num ber of machines of type t available, i.e., 7 [t,j] = min(£?11 (3[t, Z ], v[tj]). A param eter m [f, fc ] is defined to specify the effect of m apping technique available for a code block k on machine type t. Let’s further assume th at for a particular m apping m on machine type t, the best m atched code segment can obtain the optim al speedup 6[t, m] in comparison to a baseline system. A real num ber tv[t,j] indicates how well a code segment j m atches machine type t. A[t, fc ] is a utilization factor when running code block A ; on a machine of type t. We have 0 < Tv[t,j] < 1 and 0 < A[t,k] < 1. Let p[j] be the percentage of tim e spent executing code segment j within overall execution of a given subtask on baseline machine. J2j= 1 p{j\ = 1- Similar to the definition of p[j], let p[j,k] be the percentage of tim e spent executing code block k within overall execution of code segment j on baseline machine. T±?PbM = 1. 19 Suppose code segment j is assigned to machine type t. For each code block k within code segment j, there is a m apping m[t,k]. Let p[t,j] be m apping vector for code segment j on machine type t. p[t,j] = (m[£, l],m[t,2], - ■ ■ ,m[t,i[t,j]]),l < m[t,k ] < < *[£]. W ith this m apping vector /i on machine type t, we have the execution tim e of segment j as follows: = niaxi<fc<7 [ij] (p[j] x p[j,k])/(e[t,m[t,k]] x tt[£, j] x A[£,A;]). Therefore, different mappings, p, available on machine type t result in dif ferent execution tim e of segment j. Let A[£,ji] be the m inim um execution tim e of segment j among all the possible mappings on type t. A [t,j] = m in ^[t,j]S[tJ,p[t,j]]. Let the machine type selection vector r indicate the selection of machine types for code segment 1 to S , such th at r = (£[1],£[2], • • • ^ [S 1 ]). Define x[T] to be the execution tim e of the given subtask with heterogeneous m achine type selection r on all the code segments, such th at x[T] = m axi<i< 5 A[i[j], j], then HOST is form ulated as follows: For any subtask, there exists a r with m in x M (3.1) subject to 5I)^ 1 (m axi< ;7< 5 ■y[t,j]) x c[t] < C For an easy reference, all the notations used in HOST form ulation are presented in Table 3.1. Based on this formulation, it is evident th at given a decomposition of the task, as shown in Figure 3.2, to be executed on a desired heterogeneous suite of machines, an optim al execution tim e is achievable. In the following section, we present a paradigm suitable for modeling both the input form at of the task and the underlying heterogeneous suite of machines. This modeling will then be utilized as a platform to develop m apping methodologies for HC environm ent. 20 s No. of code segments of a given task M No. of different machine types to be considered V [t] No. of machine models of type t a £ ] No. of mappings available on machine type t No of available machines of model I of type t Max. No. of code blocks code segment j can be decomposed 1 [t,j] No. of machines of type t to execute code segment j m [f, k] Mapping technique for code block k on machine type t 0 [f, m] O ptim al speedup for m apping m on machine type t *[tj] How well a code segment j m atches machine type t A[<, fc ] Utilization factor for code block k on a m achine of type t PlJ] Percentage execution tim e of code segment j within a subtask p[j, k] Percentage execution tim e of block k within code segment j M apping vector for code segment j on machine type t n] Execution tim e of segment j with m apping /i on type t Min. execution tim e of segment j for mappings on type t T Machine type selection vector x [t] execution tim e of a subtask with m achine type selection r Table 3.1: Notations used in HOST form ulation 3.2 Modeling the Input to HOST HOST, as described in the section above, is an existence proof for an optim al selection of processors for a given task in HC. In this section, we present a tool for modeling the input to HOST. The input form ulation in HOST assumes th at a parallel task T is divided into subtasks f,-, 1 < i < N. Each subtask U is further divided into code segments tij, 1 < j < S, which can be executed concurrently. Each code seg m ent within a subtask can belong to a different type of parallelism(i.e. SIMD, MIMD, vector, etc.), and thus should ideally be m apped onto a m achine with a m atching type of parallelism. Each code segment may further be decomposed into several concurrent code blocks with the same type of parallelism. These code blocks tijk, 1 < k < B, are suited for parallel execution on machines hav ing the same type of parallelism. This decomposition of the task into subtasks, code segments, and code blocks is shown in Figure 3.2. A good model of this input form at is needed to facilitate the m apping of 21 Subtasks Code Segments (SIMD, MIMD, vector etc.) Code blocks (homogeneous) Figure 3.2: Input form at for HOST tasks onto a heterogeneous architecture. In addition to modeling the input form at, the architecture being considered for the execution of the task should also be modeled. Several requirem ents for this model are identified as follows: • The modeling of the input form at should handle the decomposition of the task into subtasks, code segments, and code blocks, while preserving the information regarding the type of parallelism present in each portion of the task. This is essential to m atch the type of each code block with a suitable machine type in the system. • The model should handle parallelism at fine grain and coarse grain levels. • Modeling of the input code should emphasize the com m unication require m ents of the various code segments. • The modeling of the input code should be independent of the underlying architecture. • The modeling of the system should provide the mode of com putation of each machine in the system. • The interconnection topology of individual architectures should be sys tem atically represented in the m odel at both system and m achine levels. 22 In C hapter 3, we introduce the Cluster-M parallel program m ing paradigm which m eets most of the above requirements. Cluster-M models a parallel task as a problem specification, independent of the underlying architecture. However, Cluster-M has no provision to model the heterogeneity present in the task. We also extend the Cluster-M model to accom m odate the requirements of heterogeneous supercomputing. This extended model is called Hierarchical Cluster-M . 23 Part II Intelligent Mapping Methodologies In this part, we concentrate on the proposed m apping methodologies for heterogeneous computing. The first m apping methodology described is based on the Cluster-M model. This is an online m apping heuristic for porting an arbitrary algorithm onto any or all of the machines in a heterogeneous suite. We first present the Cluster-M model in detail. This includes the m ain components of Cluster-M : Cluster-M Specification and Cluster-M Representation. We also present Hierarchical Cluster-M (HCM) as a restricted form of Cluster-M which m eets the input requirem ents of HOST. In the second chapter of this part, an optim al m apping methodology is presented for a heterogeneous suite with only two machines of different types. The input to this m apping methodology is restricted to either a chain or tree structure. 24 Chapter 4 Automatic Fine Grain Mapping with Cluster-M In this chapter, a generic technique for fine grain m apping of portable paral lel algorithm s onto multiprocessor architectures is presented. This proposed m apping methodology is on the Cluster-M which we also present in detail in this chapter. Cluster-M is a novel parallel program m ing tool which facilitates the design and m apping of portable softwares onto various parallel systems. The other components of Cluster-M are the Specifications and the Represen tations. Using the Specifications, machine independent parallel algorithm s are presented in a “clustered” fashion specifying the concurrent com putations and communications at every step of the overall execution. The Representations, on the other hand, are a form of clustering the underlying architecture to simplify the m apping process. The m apping algorithm presented in this chapter is an efficient m ethod for m atching the Specification clusters to the Representation clusters. 4.1 Introduction An efficient parallel algorithm designed for a parallel architecture includes a detailed outline of the accurate assignments of the concurrent com putations onto processors, and the d ata transfers onto comm unication links, such that 25 the overall execution tim e is minimized. This process m ay be complex for m any application tasks with respect to the target m ultiprocessor architecture. Fur therm ore, this process is to be repeated for every architecture even though the application task m ay be the same. Consequently, this has a m ajor im pact on the ever increasing cost of software development for m ultiprocessor systems. In this chapter, we concentrate on the design of portable parallel algorithm s and present a methodology for fine grain m apping of these algorithm s onto various parallel machines. Towards this goal we first introduce Cluster-M , a paral lel program m ing model which facilitates the design and m apping of portable softwares onto various multiprocessor systems. In this chapter we introduce a novel program m ing model called Cluster-M which allows parallel program s to be w ritten independent of underlying structure. Cluster-M has two m ain com ponents: the Cluster-M Representation of an architecture, and the Cluster-M Specification of a problem. The Cluster-M Representation of an architecture incorporates the processor interconnection topology. A parallel program exe cutable by this model is called the Cluster-M Specification which represents the communication and com putation needs of a solution to the problem. We also present Hierarchical Cluster-M as a more restricted form of Cluster-M conforming to the requirem ents of HOST presented in C hapter 3. Portable algorithms are specified in Cluster-M form at in a way th at repre sent concurrent com putations and communications at every step of the overall execution. To m ap the Cluster-M Specification onto the target architecture, the processors of the underlying system are clustered in a hierarchical fashion such th at all those in the same cluster have efficient comm unication medium. The m apping methodology outlined previously specifies a direction towards good m atching of the concurrent tasks (Specification clusters) and intercon nected processors (Representation clusters) [21]. In this chapter, we present an efficient algorithm for fine grain m apping of Specifications onto Represen tations. A num ber of other parallel program m ing tools have also been developed recently, which provide an environm ent for design and autom atic m apping of portable algorithm s [1, 5, 43, 63]. These tools can be classified into two groups. 26 The first group uses a library of pre-defined routines for m apping [1, 43]. In the second category, the m apping is determ ined based on a graph m atching technique. The m apping problem here is the same as the classic one defined and studied by several researchers over the years [53, 7, 40, 10, 19, 48]. The input to the m apping problem is two graphs. The first graph is called the problem graph which is similar to the data flow representation of the execution process, where each node is a com putation task and edges represent dependency and flow of data. The second graph is called the system graph which is a trivial representation of the underlying architecture. The m apping problem is defined as the m atching of these two graphs such th at the overall execution tim e is minimized. This problem has been proven to be com putationally equivalent to the graph isomorphism problem and hence is an NP-com plete optim ization problem [7]. To reduce the complexity of the m apping problem, a num ber of approaches such as graph contraction and clustering have been studied [18, 5, 30, 62, 63, 48]. In graph contraction, a pair of connected nodes are merged into a supernode [5, 48], while in clustering, a set of connected nodes are m erged into one new node [18, 30, 62, 63]. This process continues until a graph with desired order and pattern is reached. In all these graph m atching based techniques, the entire problem graph is considered against the entire system graph. The im portant observation th at can be m ade here, is th a t it is not necessary to m ap all the steps of an algorithm which are present in a problem graph, onto the system graph all at once. An algorithm represents a step by step procedure for solving a problem. Therefore, it is sufficient to m ap one step of an algorithm at a tim e onto available processors. However, this assignment should be m ade intelligently so th at it minimizes the total execution tim e of the future steps and the overall execution tim e. For this reason, in this paper, we present a m apping algorithm which not only considers the problem graph in a layered (clustered) fashion, but also layers the target system graph in a clustered form. This Cluster-M based approach is the process of finding a good m atching between the two sets of clusters. The rest of the chapter is organized as follows. In section 4.2, we introduce 27 the Cluster-M parallel programming model and the more restricted HCM. The proposed Cluster-M based m apping algorithm is detailed in section 4.3. An im plem entation of this algorithm presented in section 4.4. A brief conclusion is given in section 4.5. 4.2 Cluster-M In this section we introduce a novel programming model called Cluster-M which allows parallel programs to be w ritten independent of underlying structure. Cluster-M has two m ain components: the Cluster-M Representation of an architecture, and the Cluster-M Specification of a problem. The Cluster-M Representation of an architecture incorporates the processor interconnection topology. A parallel program executable by this model is called the Cluster- M Specification which represents the communication and com putation needs of a solution to the problem. This Specification is m apped onto the Cluster- M Representation of the underlying architecture. Cluster-M provides efficient means for designing parallel algorithm s which can be ported and m apped onto various multiprocessor organizations. Cluster-M is utilized in section 4.3 in developing a general heuristic m apping methodology for HC. 4 .2 .1 C lu ste r -M S p e c ific a tio n s A Cluster-M Specification of a problem is a high level m achine-independent program th at specifies the com putation and communication requirem ents of a given problem. A Cluster-M Specification consists of m ultiple levels of clus tering. In each level, there is a num ber of clusters representing concurrent com putations. Clusters are merged when there is a need for comm unication among concurrent tasks. For example, if all n elements of an array are to be squared, each element is placed in a cluster, then the Cluster-M specification would state: For all n clusters, square the contents. 28 Note, th at since no comm unication is necessary, there is only one level in the Cluster-M Specification. The m apping of this Specification to any architecture having n processors would be identical. C lu ste r -M c o n str u c ts The basic operations on the clusters and their contained elements are performed by a set of constructs which form an integral part of the Cluster-M model. The following is a list and description of the constructs essential for writing Cluster-M Specifications. • C M A K E {L V L ,x, E L E M E N T S ) This construct creates a cluster x at level LVL which contains ELE M ENTS as its initial elements. ELEM ENTS is an ordered tuple of the form E L E M E N T S = [e1 ? e2, • • •, en] where n is the total num ber of components of ELEMENTS. The components of ELEM ENTS could be scalar, vector, m ixed-type, or any type of d ata structure required by the problem. • C E L E M E N T (L V L ,x ,j) This construct yields the j-th element of cluster x of level LVL. If j is replaced by then CELEM ENT yields all the elements of cluster x. If x is replaced by then CELEM ENT yields all the elements of all clusters of level LVL. • C SIZE {L V L ,x) Yields the num ber of elements of cluster x, (i.e. \\CELEM ENT{LVL,x,-)\\). • C M E R G E (L V L ,x,y, E L E M E N T S ) This construct merges clusters x , y of level LVL into cluster m inx,y of level LVL + 1. The elements of the new cluster are given by ELEM ENTS. If ELEMENTS in CM ERGE is replaced by the elements of the new cluster are given by: [ iG E L E M E N T {L V L , x, - ) , C E L E M E N T ( L V L , y, - ) ] 29 (i.e. the elements of x are concatenated to the elem ents of y to form ELEM ENTS of the combined cluster). • CUN(LVL, *, x, i) This construct applies unary operation * to the i-th elem ent of cluster x. If i is replaced by then the operation is applied to all elements of x. If both i and x are set to then the operation is applied to all elements of all clusters of level LVL. • C B I{L V L ,* ,x ,i,y ,j) This construct applies binary operation * to the i-th elem ent of cluster x and the j-th element of cluster y. If i, j are replaced by then the binary operation is applied to all elements of x, y. CBI returns the resulting components. • C SP LIT(LV L, x, k) This construct splits cluster x of level LVL at h-th element into two clusters of level LVL+1. Using these constructs the previous problem specification can be w ritten as: begin LVL= 1 C U N {LVL, S q u a r e ,-,-) end 4 .2 .2 I m p le m e n ta tio n o f th e C lu ste r -M c o n s tr u c ts In this section, we first give a brief introduction to Program Composition No tation (PCN ), a parallel programming system selected as an im plem entation m edium for the various components of Cluster-M . We then discuss Cluster-M constructs implemented in PCN. 30 P r o g r a m C o m p o sitio n N o ta tio n (P C N ) Program Composition Notation is a system for developing and executing par allel program s [16, 26]. It comprises of a high-level program m ing language w ith C-like syntax, tools for developing and debugging programs in this lan guage,and interfaces to Fortran and C allowing the reuse of existing code in m ultilingual parallel programs. Program s developed using PCN are portable across m any different workstations, networks, parallel com puters. The code portability aspect of PCN makes it suitable as an im plem entation system for Cluster-M . PCN focuses on the notion of program composition and emphasizes the techniques of using combining forms to put individual components (blocks, procedures, modules) together. This encourages the reuse of parallel code since a single combining form can be used to develop m any different parallel program s. In addition, this facilitates the reuse of sequential code and simplifies development, debugging and optim ization by exposing the basic structure of parallel programs. PCN provides a core set of three prim itive composition operators: parallel, sequential, and choice composition, represented by ” ||” , and ”?” respectively. More sophisticated combining forms can be implem ented as user-defined extensions to this core notation. Such extensions are referred to as tem plates or user-defined composition operators. Program development, both w ith the core notation and the tem plates is supported by a portable toolkit. The three m ain components of the PCN system are illustrated in Figure 4.1. We have implem ented and tested the Cluster-M constructs in PCN successfully. The PCN im plem entation of the constructs is given in Appendix A. 4 .2 .3 C lu ste r -M p r o b le m S p e c ific a tio n m a cro s Several operations are frequently encountered in designing parallel algorithms. Macros can be defined using basic Cluster-M constructs to represent such com mon operations. We next present several macros, their coding in term s of Cluster-M constructs and their PCN implementations: 31 § 1! Portable Toolkit Application-specific composition operators Core Programming Notation Figure 4.1: PCN System Structure A s s o c ia tiv e b in a ry o p e r a tio n Performing an associative binary operation on N elements is a common opera tion in parallel applications. The Cluster-M Specification graph for input size = 8 is given in Figure 4.2. The resulting Specification graph is an inverted tree with input values each in a leaf cluster at level 1 and the result at the root cluster at level logn + 1 . Using Cluster-M constructs, the m acro ASSOC-BIN, w ritten in PCN, applies associative binary operation * to the N elements of input A and returns the resulting value as follows: AS SO CN , A) int N, A[ ]; { ; Ivl = 0 , makeJtuple(N, cluster), {; i over 0 .. N — 1 :: { ; CMAKE(lvl,[A[i]],c), cluster[i\ = c } 32 }, BinaryX)p{cluster, N, op, Z ) } Binary jOp(X, N, op, B ) int N, n; {? N > 1 - > { ; n := N /2, makeJ.uple{n, Y), { ; i over 0 .. n — 1 :: { ; B I.M E R G E (op,X [2*i],X[2*i + l],Z), Y\i] = Z } } , BinaryX)p(Y, n, op, B ) } , default— > B = X } BI.MERGE(op, X I, X2, M ) int e; { ; C B I(o p ,X l,l,X 2 ,l,e ), C M E R G E (X l,X 2 , [e], M ) } V e cto r d o t p ro d u ct As a representative example of vector operations(Vecops), we consider here the dot product of two vectors. The vector dot product of two n-element vectors A and B is defined as d = J2?=i(ai ' & ;)• The cluster-M Specification graph of this operation is sim ilar to th at shown in Figure 4.2. This m acro can be w ritten in 33 Level 1 Q j* L (Input) Level 2 C7a l * a 2~^ CT ° S > a 6 CT » 7 * a 8^> Level 3 < d _ / a 1*a2*a3* U 4 Level 4 (Result) a 1 *a2*a3*a4*a5* a6*a7* Figure 4.2: Cluster-M Specification of associative binary macro, term s of Cluster-M constructs and the above ASSOC-BIN m acro as follows: /* V ECTO R DOT PR O D U C T*/ DOT-PRODUCT(N, op, A, B, Z) int N, A[],B[],C[N]ye-, {; Ivl = 0 , makeJuple(N, A l), makeJtuple{N, B 1 ), {|| i over 0 .. N — 1 :: { ; CMAKE(lvl,[A[i]],a), CMAKE(lvl,[B[i]],b), Al[i] = a, Bl[i\ = b } } , {; j over 0 .. N — 1 :: C\j) := e } }, A S S O C -B IN {“ + ”, N, C, Z)) 34 S IM D d a ta p a ra llel o p e r a tio n s In this class of operations each operation is applied to all the input elements w ithout any communication. In this case each operand is assigned one cluster in the problem Specification. The desired operation is applied to all clusters. The m acro DATA-PAR applies operation * to all N elements of input A, as follows: DATA.PAR(op, n, N, A, Z) int A[ ]; {; Ivl = 1 , makeJtuple{N, cluster), { ; i over 0 .. N — 1 :: { ; CMAKE(lvl,[A\i]],c), cluster[i] = c } } , makeJtuple(N-> Z ), {■,j over 0 .. N — 1 :: { ; CUN(op, n, cluster[j], 1 , e), Z[j\ = e } } } B r o a d c a st o p e ra tio n This is a frequently encountered operation in parallel programs. One value is to be broadcast to all processors in the system. The problem Specification for a m acro th at broadcasts one value ’a ’ from processor x to N recipient clusters or processors, can be w ritten in term s of Cluster-M constructs as follows: B R O A D C A S T S , e, Z) 35 {; Ivl = 0 , makeJtuple^N, Z), {|| i over 0 to N — 1 :: { ; <7MAR£(Zu/,[e],c), Z[i] = c > } The Specification graph for the broadcast operation when N = 8 is shown in Figure 4.3. Level 1 COPY, COPY. C O P Y ^ < COPY Level 2 Figure 4.3: Cluster-M Specification of broadcast macro. 4 .2 .4 C lu ste r -M R e p r e se n ta tio n s For every architecture, at least one corresponding Cluster-M Representation can be constructed. Cluster-M Representation of an architecture is a m ulti level nested clustering of processors. To construct a Cluster-M Representation, initially, every processor forms a cluster, then clusters which are completely connected are merged to form a new cluster. This is continued until no more merging is possible. In other words, at level LV L of clustering, there are mul tiple clusters such th at each cluster contains a collection of clusters from level 36 LVL — 1 which form a clique. The highest level consists of only one cluster, if there exists a connecting sequence of communication channels between any two processors of the system. A Cluster-M Representation is said to be com plete if it contains all the comm unication channels and all the processors of the underlying architecture. For example, the Cluster-M Representation of the n- cube architecture is as follows: At the lowest level, every processor belongs to a cluster which contains ju st it self. At the second level, every two processors (clusters) which are connected are merged into the same cluster. At the third level, clusters of previous level which are connected belong to the same cluster, and so on until level n. The complete Cluster-M Specification of a 3-cube, a 2 x 4-mesh, a ring of size 8 , a completely connected system of size 8 , and a system with arbitrary interconnections are shown in Figures 4.4, 4.5, 4.6, 4.7, 4.8 respectively. A Cluster-M Representation with k nested subcluster levels represents a connected network of processors with diam eter 0 (k). To investigate the re lationship between the clustering levels of an architecture and its diam eter, lets define Dlvl the diam eter of Cluster-M Representation at clustering level LVL. Dlvl is defined as the m axim um num ber of comm unication steps needed between any two processors contained in any single cluster at level LVL. The diam eter of the presentation at level * + 1 can be expressed as: Dlvl+i = Dlvl-\- (communication overhead of level LVL -f- 1 ) For example, let us consider a ring-connected architecture with N proces sors where k = log N levels. The Cluster-M Representation for this architecture is given in Figure 4.6. In this case every two adjacent clusters will be merged, the size of clusters is doubled at level LVL compared to LVL — 1. k + 1 such levels result. The diam eter of the network can be found by exam ining Dlvl for several levels: D2 = 2 - 1 D^ = 4 — 1 Di = 2i~1 - 1 37 Thus at the m axim um level k = log A T , the network diam eter = 0(2k). The relationship between network diam eter and the num ber of clustering levels de pend on the degree of connectivity of the processor nodes and on the connection patterns at each level. Before presenting an algorithm to find Cluster-M Representations, we define several term s and identify some clustering properties: - €> Figure 4.4: Cluster-M Representation of n-cube of size 8 . Figure 4.5: Cluster-M Representation of Mesh of size 8 . • The system graph of an N-processor system S — (P, E) is an undirected graph represented by the adjacency m atrix, where A (i,j ) = 1 indicate a communication link between processors i , j , 1 < i,j < N. • A clique in an undirected graph G = (V, E) is a subset V C V of vertices each pair of which is connected by an edge in E. In other words, a clique I I I I I I I I X ................ - —I - , I In — — 1 --------------- 1 — 1 1 1 1 1 I 1 1 . . . -----1 --------------- 1 ----- ......... - < H - i i i i i i i i - T--0 ................< iH - --------------- 1 ---^ I t, . . . -(-<&-..............< iH - ,— ✓ i----- r------------------| ^ • i...... 38 o ~ Figure 4.6: Cluster-M Representation of a ring of size 8 . is a complete subgraph of G. • A system processor is contained in only one cluster at level LVL. Let PC (LVL, x ) designate all processors belonging to cluster x of level LVL. Thus for clusters x, y of level LVL, PC(LV L, x) fl PC(LVL,y) = < j > . At LVL = 1, PC(1, x) = x. • Each cluster is identified by the lowest num bered processor contained in the cluster (i.e for cluster x, x = m in PC(LVL, x)). Thus let CLU STE R S(LV L) = [ci, - • ■ ,cm] be an ordered tuple designating the clusters at level LVL, with m being the num ber of such clusters. • The clusters of level LVL form an undirected graph where two clusters x,y are connected if there exists processors px £ PC(LVL,x), and py £ PC(LVL,y), where A(px,py) = 1. • Define C(LVL,p) = c to indicate th at processor p belongs to cluster c of 39 Figure 4.7: Cluster-M Representation of a completely connected system of size 8 . Figure 4.8: Cluster-M Representation of an arbitrarily connected system of size 8 . 40 level LV L, 1 < LVL < k, where k is the m axim um num ber of clustering levels . W ith the aid of the above properties and definitions, we next present an algo rithm to generate Cluster-M system Representation. C lu ste r -M s y s te m R e p r e se n ta tio n a lg o rith m : The following pseudo-code algorithm , S Y S — R E P , constructs the Cluster-M Representation of a connected system of N processors. Initially, all clustering levels are empty. At clustering level 1, each system processor is in a cluster by itself. For each clustering level, the clique containing the lowest-numbered un-merged cluster, is obtained using procedure CLIQUE. The details of find ing cliques is om itted (for any of several existing algorithm s can be utilized). All clusters in the obtained clique are then merged into one cluster of the next clustering level using procedure M ERGE. This is continued until all clusters of the current The algorithm halts when a clustering level is reached which is comprised of one cluster with label 1 . P R O C E D U R E S Y S - REP(A) For all, i, LVL begin C(LVL, i) = 0 PC {LVL, i) = < f> C LU STERS(LVL) = [] LVL = 1 cluster level set to 1 end For all processors i, 1 < i < N begin C{LVL,i) = i Each processor is in a cluster by itself at level 1 PC{LVL, i) = [i] 41 CLUSTERS(LVL) = CLUSTERS(LVL) + i end W hile C LU STERS(LVL) / [1] do begin For all c € C LU STERS(LVL) starting with min(c) do begin For all x,ye CLIQUE(LVL,c) do begin M E RG E (LV L, x, y) end end LVL = LVL + 1 end P R O C E D U R E CLIQUE(LVL,c) begin Find CLIQUE such th at c < E CLIQUE and Vx,ye CLIQUE 3A(PC(x,LVL),PC(y,LVL)) = 1 end P R O C E D U R E M E RG E (LV L, x, y) begin C LUST E R S(L V L + 1) = C L U STE R S(L V L + 1) + min(®,y) PC(LVL + 1, min(a;, y)) = PC{x, LVL) + PC(y, LVL) For all p, C(LVL,p) = x or y do begin 42 i i C(LVL + l,p ) = min(a;,2/) end end 4 .2 .5 Im p le m e n tin g th e r e p r e se n ta tio n a lg o r ith m We have im plem ented the Cluster-M Representation algorithm presented in the previous section with identical modules in C under Unix. The algorithm takes the adjacency m atrix of the processing elements of the system as input and generates the various clusters in each level including the processing elements belonging to each cluster. For this im plem entation, we assume th at the input graph is connected. We now analyze the running tim e of this im plem entation. Since the num ber of clusters in each level decreases as the level num ber increases, we designate the decrease in num ber of clusters as ”c” . For each level, th e num ber of clusters is given by N/clvl~1 which indicates th at c^ - 1 processing elements belong to each cluster. For each level, we compare each cluster in th at level w ith the clusters num bered higher than itself in the same level and check if they form a clique. Since there are clv l~ 1 processing elements in each cluster, c^ - 1 * clvl~l comparisons are required for (Nf clvl~ ~ 1)* (N/ctvl~1) clusters. Therefore for each of the logc N levels, 0 ( N * N) comparisons are needed. The total complexity of the program is found to be 0 ( N 3) as follows: N N order = N * N 2 -|------* N 2 -\— - * N 2 + ... + 1 C C * = 0 ( N 3) Since each cluster is compared to clusters num bered higher than itself, the im plem entation is sensitive to processor IDs. The accuracy of this im plem entation was verified by generating system Representations identical to those presented in the last section. The following 43 is an output of the program for a hypercube of size 8 when processor IDs are hashed. Ivl : 1 ( 1 : 1)(2 : 2)(3 : 3)(4 : 4)(5 : 5)(6 : 6)(7 : 7)(8 : 8 ) Ivl : 2 (1:1 3)(2 : 2 5)(4 : 4 6)(7 : 7 8 ) Ivl : 3 (1:1 2 3 4 5 6 7 8 ) We are planning to implement this algorithm in parallel in Program Com position N otation(PCN ). 4 .2 .6 H iera rc h ica l C lu ste r -M M o d e l To exploit multi-level parallelism of a task in heterogeneous environm ents, Hierarchical Cluster-M model is proposed as a more restricted form of the Cluster-M model. Hierarchical Cluster-M (HCM) exploits parallelism at the subtask, code segment, code block, and instruction levels. This is accomplished by modifying both the Cluster-M system representation and problem specifi cation processes. The modification to the system representation takes into account the presence of several interconnected machines in the system , provid ing a spectrum of com putational modes. The problem specification takes into account the type of parallelism present in each portion of the task. H iera rch ica l C lu ste r-M s y s te m re p r e se n ta tio n The Hierarchical Cluster-M representation of a system consists of two layers of clustering: system layer and machine layer. System layer clustering consists of several levels of nested clusters. At the lowest level of clustering each machine in the system is assigned a cluster by itself. Completely connected clusters are m erged to form the next level of clustering. This process is continued until no m ore merging is possible. Machine layer clustering is obtained in a sim ilar way w ith individual processors replacing system machines in the clustering process. 44 For a heterogeneous suite of interconnected com puters, the HCM system representation is obtained as follows: 1. The HCM system representation algorithm is first applied to the system as a whole. At the first level of clustering, each com puter in the system is in a cluster by itself. Each clustering level is constructed by merging clusters from the lower level th at are completely connected. This is con tinued until no more clustering is possible. The resulting clustering levels are called system level clusters. 2. Each resulting cluster is labelled according to the type of parallelism present in the cluster(i.e. SIMD, MIMD, vector, etc.). 3. For each com puter in the system apply the Cluster-M system represen tation algorithm. At the lowest level each processor in the com puter is in a cluster by itself. All completely connected clusters are merged to form the next level of clustering. The highest level of clustering consists of one clusters containing all processors in the com puter. This results in the Cluster-M representation for each individual com puter in the system. Note th at the collection of machine clusters at the highest level are equiva lent to the lowest system clustering level. A heterogeneous parallel computing system is shown in Figure 4.9, while its HCM representation is shown in Figure 4.10. H iera rch ica l C lu ste r -M p ro b lem sp e c ific a tio n The Cluster-M specification of a parallel task is a program th a t specifies com m unication and com putation requirements of the task. The Cluster-M speci fication consists of several levels of clusters with the input-level being lowest and the final result-level being highest. At the lowest level, each cluster con tains one com putational operand. All initial clusters involved in a com putation are merged into one cluster in the next clustering level. Clusters in interm e diate levels are merged, split, and/or their elements m anipulated according 45 3-cube MEVED 4X 4 M esh SIM D S calar S c a la r S c a la r S calar 4 P ro cesso r V ecto r Figure 4.9: A heterogeneous parallel computing system. to com putation and communication requirements. Several essential Cluster- M constructs needed to form ulate the Cluster-M specification of a task are discussed in [2 0 ]. The Cluster-M specification represents the communication needs of the problem at the instruction level and has no provision to identify parallelism at higher levels (i.e. subtask, code segm ent/block levels). This specification can be w ritten for any parallel problem regardless of its comm unication or present com putation types present. The HCM task specification is obtained using Cluster-M constructs. We assume here th at the input is a task T in a form sim ilar to the input to HOST, i.e. the following has been done: • The task is divided into sequential subtasks U, 1 < i < N. • Each subtask is divided into several concurrent code segments tij, 1 < j < S. The type of each of each code segment has been identified. 46 Scalar 3-cube M IM D 4 Processor V ector Figure 4.10: Hierarchical Cluster-M representation of a heterogeneous com put ing system. 47 • Each code segment is further decomposed into several concurrent homo geneous code blocks Ujk, 1 < k < B. T he Hierarchical Cluster-M specification of task T has several layers of clustering: subtask, code segment, code block, and instruction clustering layers. HCM specification of task T is com puted as follows: 1. S u b ta sk c lu ste r in g layer: At the subtask clustering layer, each subtask is represented by single cluster level i , with subtask forming the lowest such level. 2. C o d e se g m e n t c lu ste r in g layer: • For all subtask clustering levels, each level i contains a num ber of clusters at the code segment clustering layer. Each such cluster contains code segments tij of subtask t{. Each cluster is labelled w ith the parallelism type of its corresponding code segment. • Code segment clusters, in the same subtask clustering level i , are connected if results from the clusters are used by a single cluster of subtask clustering level i -f 1. 3. C o d e b lo ck c lu ste r in g layer: Each code segment cluster j in subtask clustering level i contains several clusters at the code block clustering layer. Each cluster in this layer corresponds to a code block tijk- Each code block cluster is labelled with the type of parallelism present in the block. 4. In str u c tio n c lu ste r in g layer: For each cluster of the code-block clustering levels find its Cluster-M problem specification. This step yields the lowest layer of HCM cluster ing, namely the instruction-level clustering layer. Note th at if the input is comprised of only one subtask containing one code * segment w ith one code block, then the resulting HCM specification is identical to the more general Cluster-M specification. This is due to the fact th a t in such 48 a case no code type restrictions are imposed. The input to HOST as shown in Figure 3.2 and the corresponding HCM specification is shown in Figure 4.11. From the above steps, each layer in HCM specification corresponds to a decomposition level in the input to HOST, as follows: S ubtask clustering layer I I t U Instruction clustering layer e=> c > - o > - c=> i i Figure 4.11: Hierarchical Cluster-M specification of input to HOST • Each subtask U in the input to HOST is represented in HCM specifi cation by level i in the subtask clustering layer containing one cluster corresponding to subtask i. • Each code segment t{j in the input to HOST is represented by a cluster in code segment clustering level j in subtask clustering level i. • Each code block tijk in the input to HOST is represented by a cluster in code block clustering level k in code clustering level j and subtask level i. 49 Thus a one-to-one correspondence exists between the decomposition levels of the input code to HOST and its corresponding HCM specification. The HOST input form at assumes th at subtask U cannot start unless all the code of subtask ti- j is completed. The HCM specification m ake no such assum ption. All possible inputs to HOST form a subset of all possible inputs th at can be represented by HCM. Having presented Cluster-M as a program m ing model suitable for HC, we next present a hueristic m apping algorithm which is based on Cluster-M . 4.3 Cluster-M Mapping Algorithm Given a Specification graph and a Representation graph as the input to the m apping module, the process continues as explained in this section. The m ap ping procedure presented in this paper has a much lower tim e complexity than the traditional m appings since it contains a graph m atching procedure which considers the input graphs level by level. In the following, we first present a set of definitions and preliminaries, Then in 4.3.2 we present a high level description of the m apping algorithm. 4 .3 .1 P r e lim in a r ie s D e fin itio n 4.1 • Let a Specification cluster at level L V L be denoted by ks[*i, * 2 > • • • > ^lvl], where ilvl is the cluster num ber at level L V L and (1 < I < L V L — 1) is the cluster num ber of its parent cluster at level I. • If a Specification cluster «s[*i5 H, * * * ■ > ^lvl] can not be further decom posed into sub-clusters, i.e. if this cluster is corresponding to a fine grain sub-task Tk, let * 2 , • • •, *'z,vx] = Tk- 50 D e fin itio n 4.2 • Let a Representation cluster at level L V L be denoted by kr[*i, i2, • • •, * lv l ] 5 where iu/L is the cluster num ber at level L V L and ii (1 < I < L V L — 1) is the cluster num ber of its parent cluster at level L • If a Representation cluster * 2 , * * • ? ^lvl] can not be further decom posed into sub-clusters, i.e. if this cluster is corresponding to a processor Pj, let ks [h, i2, - • ■, *lvl] = Pj- D e fin itio n 4.3 • Let the com putation requirement of Specification cluster i2, * • * > i-LVL] be denoted by as [h, * 2 , • ■ •, ilvl]- • The com putation requirement of any fine grain subtask Tk, <rs[Tfc], is specified in Problem Specifications. • For any cluster K s f i i , i2, ■ • •, i l v l ] which contains sub-clusters at a lower level L V L + 1 , o-s[i1 ? i2, ■ - ■ , i l v l ] = E ; H , * • •? ^l v l , *]• D e fin itio n 4 .4 • Let the com putation capacity of Representation cluster /c.R[f i , i2, • • •, ilvl\ be denoted by aR[ix, i2, • • •, ilvl]- • The com putation capacity of any processor pj, crR\pj\, is given. • For any cluster KR[ix, i2, • ■ *, ilvl] which contains sub-clusters at a lower level L V L + 1 , crR[ix,i2, - ■ ■ , ilvl] = E ; v r [h , * 2 , * • • ,iLVL,i]- D e fin itio n 4.5 • Let the clustering degree of Specification cluster /Cs[*i, i2, • • *, tLVx] be denoted by <£s[?i, i2, • • •, ilvl\, and defined to be the the num ber of levels down to its deepest sub-cluster Ks[ii, i2, - • •, % l vl, * • •, i.e. i 2 , • ■ •, ilvl] — L V L S — LVL, where for any sub-cluster «s[*i,*2 , • • • ,iLVL, • • • ,iLVLs']> L V L s’ < L V L S. • The clustering degree of a fine grain subtask is 0. 51 D e fin itio n 4 .6 • Let the clustering degree of Representation cluster « r[u , i2, * * • > ^lvl] be denoted by <£r[h,£2, • • •, and defined to be the the num ber of levels down to its deepest sub-cluster i2, • • •, ^LVLi • • • > be. ‘ j ilvl] — L V L 5 — LVL, where for any sub-cluster KR[ii,i2, - ■ • ,ihVL, - • - ,iLVL*']’ L V L 5' < L V L 5. • The clustering degree of a processor is 0. A cluster of less clustering degree has more communication requirem ent/ capacity than a cluster with the same com putation requirem ent/capacity. According to the above definitions of clusters, we have the following propo sitions. P r o p o s itio n 4.1 • Specification clusters /cs[ii, i2, • • •, i-LVL-, * ] and res[*i, i2, • • •, j] have comm unication need. • Representation clusters /cr[z ’i , i 2 , • ■ ■, ihVL -, *] and « r [ « i , *25 • * • , *L V L i j ] have comm unication links. P r o p o s itio n 4.2 • * 2 ? ? ® / ] ^ ^ 2 ? 5 ^ < 5 ‘ " " ) ” h n i . ® ^25 ) ^ ^2 5 " * ' j ‘ ? * /+ m ] T U2 . D e fin itio n 4 .7 • Let S' be the total com putation requirem ent of the whole task, i.e. S = • Let R be the total com putation capacity of the whole system , i.e. R = £ i <**[*] • 52 • Let / be the reduction factor which indicates how m uch Specification com putation is to be m apped onto a Representation processor. / = Rf S. Therefore, if / > 1 , then the com putation capacity of the system is greater than what is required to solve the problem as outlined in the Cluster-M Specification. Otherwise, 1 / f of the com putations specified are to be m apped onto each of the com putational units represented. D e fin itio n 4 .8 The m easure of m apping quality we use can be form ulated as: |/m| = l / X 0's [ ? ] - 0’ i?[/m(«:s[*])]| + i? 5: * :5 ^ 5 ,(^s[*]—^i?[/m(KS[*])]) (4.1) t i where f m is the m apping function for Specification cluster at top level, and g is a function defined as g(x) = 1 if x < 0 and g(x) = 0 if x > 0. Fa and F$ are the favor factor for com putation and com m unication m atching respectively (Fa > 0, Fs > 0). The best m apping is the one w ith the m inim um |/ m|. 4 .3 .2 M a p p in g A lg o r ith m Given a Specification graph to be m apped onto a Representation graph, the m apping procedure starts at the top layer (level) of the Specification graph. To m ap every Specification cluster «s[«] at the top level, onto a Representation cluster, we search for the best m atched Representation cluster w ith a compu tation capacity closest to / x crgfi] and a clustering degree equal to or less than W hen the m apping at top level is done, for each pair of the m apped Speci fication and Representation clusters, the same m apping procedure is continued (recursively) at a lower level until the m apping is fine grained to the processor level. A high level description of the m apping algorithm is given below. 1 . Sort all • • •, Ilvl] in descending order of the value of <7jr[h, • • •, 2. Sort in descending order of the value of <rs[?]. 53 3. Calculate S', R and / . If / > 1 , let / = 1. Calculate the required com putation capacity of ks[z] to be / x <75 [z], 4. Find a virtual Representation layer consisting of non-overlapping clusters such th at when the Specification clusters at the current lavel are m apped onto these clusters, the m easure of quality \fm\ is minimized. 5. For each pair of KS[k] and /cr[ii, i2, • • •, iLVLk] = /m(«s[&]), if kr[»i, * 2, • • •, i h V L k] ) = Vji then stop. Otherwise let ••*>*/] — «s[», * 2, • • •, *;], and KR[i2, ■ • •, ihVLk- > • • •, */] = * 2 , ■ • •, iLVLk, • • •, *i] for any existing /, and go to step 2 . The total tim e complexity of this algorithm is analyzed as follows. For each level except step 4, the complexity of step 2 to 5 is dom inated by the complexity of sorting which can be done trivially in 0 ( K 2) tim e sequentially for K inputs, where K is the num ber of Specification clusters at current level. An optim al solution for step 4 can have an exponential tim e complexity. However, since the num ber of clusters being m apped every level is usually constant, it leads to an average linear tim e performance. The step 2 to 5 m ay be repeated for J iterations, where J is the num ber of nested levels in the Cluster-M Specification graph. 4.4 Implementation and Experiment Results We have im plem ented the algorithm described above. In our im plem entation, we have used a heuristic for finding f m in step 4, which has a tim e complexity of O (K N ), where K is the num ber of Specification clusters at current level, and the total num ber of clusters in the Representation graph is <9(7V), where N is the num ber of processors and M is the num ber of Specification clusters. The total tim e complexity of this entire im plem entation, steps 1 to 5, is 0 ( T 2), where T is max{M , N}. The pseudo code of the im plem entation is shown in Figure 4.12. As shown in the pseudo code, to implement the step 4 of the algorithm in 0 ( K N ) tim e for every Specification cluster, we consider all Representation 54 clusters and select the best one which minimizes |/ m|. We do this for all the Specification clusters at th at level. Furtherm ore, in calculating |/ m|, we only consider m atching the com putation capacity by letting Fs = 0. The results are good even though Fs is forced to be 0 due to the effects of clustering. Clustering has two effects. First, it partitions the problem graph vertically to indicate group of com putations which have data dependency. Second, it partitions the problem graph horizontally to create independent layers such th at all the com putations in th at layer are to be com puted concurrently. As a result of this, the concurrent com putations (Specification clusters) are m apped onto concurrent processors (Representation clusters), such th a t the clustered d ata dependent com putations are m apped onto group of processors having efficient comm unication medium. In the following, we show the results of our experim ents in four categories according to the structures of the input problem and system graphs. R e g u la r p r o b lem v s. reg u la r s y ste m The output of m apping of an associative binary operation problem onto a cube architecture, whose Cluster-M Specification and Representation are illustrated in Section 4.2, is shown in Figure 4.13. In this example, both the problem and the system are regular, i.e. the problem graph and system graph have uniform structures. Our experim ent results show th at our algorithm is very efficient in producing close to optim al solutions for these regular mappings. Irreg u la r p ro b le m v s. reg u la r sy ste m M ultiprocessor systems are usually constructed with a uniform structure. How ever problems m ay have an irregular structure. This m ay m ake the m apping process difficult. Figure 4.14 shows how an irregular problem is m apped onto a 2 x 2 mesh. The problem contains 6 fine grain subtasks. Subtask a , b and e have com putation requirem ent of 2 , while the com putation requirem ent of 55 program mapping; var Spec: en tire problem specification; Hep: entire system representation; {Spec and R ep are represented by a set o f m ulti-level lists} begin sort all th e clusters of Spec and R.ep a t each level; m ap(Spec, Rep) end. delete th e last cluster rh from h; split,(cluster rh, extra )', {returns th e ex tra part, and the required p a rt o f cluster rh} merge ex tra p art into Rep; merge required p a rt into cluster h; m ap(Spec cluster xk, R ep cluster h) end {else} end {while} end; {map} procedure m ap(Spec, Rep); {recursive m apping procedure} var S: integer; {total com putation capacity of Spec} R.: integer; {total com putation capacity o f Rep} f: real; {reduction factor} begin if Spec or Rep null then return calculate S of Spec; calculate R. of Hep; f:=H./S; if f> l then f:= l; while (Spec not em pty) do begin for each Spec cluster 2 a t to p level begin {calculate the required com putation capacity of th e Rep cluster to be m apped onto} « • ==*-(££4** +/ * £ L i+, "*[*]); { This is a more accurate version of Ri = / x «•,? [» ] } search for Hep cluster at to p level of com putation capacity o f Ri; if found such Rep cluster j then begin delete cluster t from Spec; delete cluster j from R.ep; m ap(Spec cluster » , Hep cluster j ) end {if} end; {for} {Now there is no best m atch between Spec and Rep clusters at to p level} delete header cluster .?/, from Spec list; initialize Hep cluster h to be em pty; <rn[h] := 0; repeat delete header cluster rh from R ep list; trri[h] := < rR [h] + <rn[rh]; m erge cluster rh into cluster h; until > R.,h; if = Ra ji then m ap(Spec cluster sh, Rep cluster h) else begin extrat= tra[h] — ; function split.(rl:cluster, e.rtro:integer); {to split cl into two parts and retu rn these two clusters. O ne w ith com putation capacity of e x tra . The other one then has th e required rapacity. This function is also a recursive one. } begin go to a lower level o f cl', search fbr a cluster of capacity e x tra ; if found such a cluster ca then begin delete ca from cl; cluster ch := cl — a; return the extra p a rt as ca and th e required p art as cb end else begin initialize a Rep cluster h to be em pty; <rn[h] := 0; repeat delete header cluster rh from cl; <Tn{h] := <tr[/i] + <xR[rh]; merge cluster rh into cluster h; until <ra[h] > e xtra ; if = e x tra then return h as extra p art and th e rest of c l as required p art else begin e x tr a t := (Tr[/(] — extra ; delete th e last cluster rh from h; split.(cluster rh , e x tr a 1); {returns th e extra part and the required p art o f cluster rh} merge extra p art into cl; merge required p a rt into cluster h; return h as extra p art and th e rest of cl as required part end {else} end {else} end; {split.} Figure 4.12: Pseudo code of Im plem entation 56 Cluster— M S pecificatio n : C lu ster-M R e p resen tatio n : mapped unto Step 1 : Step 2 : © S = 8 f = l mapped R = 8 mapped mapped mapped onto mapped onto SEES!. © © mapped © © © © S ,ep 3 :©^ ^ L _ © © — © © - © © ' © mapped © Figure 4.13: Mapping associative binary operation onto cube. subtask d and / are of 3, and subtask c of 4. The output of our m apping algorithm shown here is actually an optim al solution. R e g u la r p r o b le m v s. irreg u la r s y s te m Figure 4.15 illustrates the output of our algorithm in m apping an associative binary operation onto an irregular system. The problem contains 8 subtasks of com putation requirem ent 1 , while the system has 8 processors of com putation capacity 1 , except processor A has capacity 2 (e.g. A is a m aster processor of the completely connected subsystem of A, B , C and D ). In this particular example, the system provides more com putation capacity than the problem requirem ent while still m aintaining the problem com m unication needs. There fore, processor D is not used. 57 Cluster-M Specification: S = 16 to be mapped onto f — 1/4 Cluster-M Representation: ( b j > ( C c J ( d j R = 4 Step 1: onto o n to Step 2: Step 3 : C D o mapped " S E P * * » » @ onto mapped © © © mapped © © mapped © mapped © © © © mapped © Figure 4.14: Mapping irregular problem onto mesh. Cluster— M S p ecificatio n : C lu ster-M R e p resen tatio n : Step 1 : S = 8 S te p 2 : . s te p 3 : O — = g = L r onto © mapped onto mapped mapped to be mapped onto f = 9 /8 ,L e tf = 1 R = 9 onto © mapped onto mapped © © mapped mapped /*\ / ' “N mapped C D — ^ — * - C D C D © © — -> © © .... - © © Figure 4.15: Mapping binary-associative operation onto an irregular system. 58 Irregu lar p ro b lem v s. irreg u la r s y ste m Such an exam ple is given in Figure 4.16. In this example, all the fine grain sub tasks have the same com putation requirement of 1 , and all th e processors have the com putation capacity of 1. However, the total Specification com putation requirem ent and total Representation com putation capacity are different, and the comm unication patterns of the subtasks and processors are different too. Cluster-M Specification : Cluster-M Representation : to be © r © © C s mapped onto l-ojy Step 1 : onto © r© <© <3 onto mapped D E onto mapped onto mapped onto onto mapped mapped Step 2 : S K p 3 © © © mapped © ►© mapped © mapped Step 4 : mapped © ~© mapped © ~© onto Figure 4.16: An example for irregular mapping. 4.5 Conclusion In this chapter we have proposed a programming model for HC called Cluster- M. Based on Cluster-M , we proposed and im plem ented a methodology for m apping fine grain com putations of portable Cluster-M algorithm s onto various 59 m ultiprocessor systems. The input to the m apping algorithm is a Cluster- M Specifications graph which corresponds to a layered problem graph, and a Cluster-M Representations which corresponds to a layered system graph. Unlike other m apping approaches, which m ap the entire problem graph onto the entire system graph, this algorithm reduces the complexity of the problem by only m apping the corresponding layers of the two graphs We presented our experim ental results in using the implem ented algorithm for m apping various types of problems onto systems. Our m apping algorithm produces good sub- optim al but fast mappings. Further performance studies are being carried out to obtain more concrete perform ance data [2 2 ]. 60 Chapter 5 Optimal Partitioning for Two-machine Heterogeneous Suite Partitioning problems for two processor heterogeneous systems have recently become a focus of interest. A lot of experim ental research has been conducted in the last couple of years on the development of user friendly distributed com puter environments and on the efficient utilization of heterogeneous computing systems. These experim ents have been conducted [58, 45] to exploit the best features of a variety of com puter architectures, and to understand the funda m ental problem: how to optim ally partition application program s across two machines interconnected by a high speed network. Feedback from such experi m entation has provided valuable information required to accurately model the behavior of such systems. It has thus become feasible to conduct further theo retical research and design efficient algorithms which can assist the program m er or the compiler in determ ining how to partition an application program across processors in such a heterogeneous network. The following are some of the research sites working in a two processor heterogeneous environm ent. • The M innesota Supercom puter Center(M SC) have several com puter sys tem s with High Performance Parallel Interface (H iPPI) support [58]. The H iPPI hardware and software were recently installed to provide 61 high speed internetworking capabilities between the Connection Machine model CM-2, and the CRAY-2. This high speed interconnectivity allows application programs to be partitioned and run sim ultaneously on more th an one supercom puter at a tim e. • The D istributed High Speed Com puting (DHSC) environm ent at the P ittsburg Supercom puting Center allows applications to be partitioned between the CRAY Y-MP, and the CM-2. Algorithms appropriate to the massively parallel m achine m ay run on the CM-2, while serial or vector algorithm s may be run on the CRAY Y-MP [45]. In this chapter, we consider optim al partitioning for two classes of parallel problems onto a two machine heterogeneous system. The first case is where the input is a chain-structured application. Since this type of application is typical of image processing applications , we discuss our results in the context of image processing in the first part of this chapter. In the second part of this chapter, we concentrate on optim al partitioning of tree structured problems onto a two machine heterogeneous system. 5.1 Partitioning of Chain Structured Problems In this section we study at the problem of partitioning a chain structured parallel or pipelined program over a two processor heterogeneous system and show th at it is possible to approxim ately solve this problem. The algorithm , presented in this section, is based on a fully polynomial time approximation scheme, and its tim e complexity is polynomial both in the size of the problem and in f, where e is the relative error bound for the approxim ate scheme. We start by presenting a background on image processing tasks which have chain structures. Then in Section 5.1.2, the algorithm for finding the near optim al partition of a chain-structured parallel or pipelined program over a two processor system is presented. A chain-structured program is m ade up of m modules num bered t\...tm and has an intercom m unication p attern such 62 th at m odule ti is connected to modules and Our solution technique involves creating a doubly weighted assignment graph (i.e., one which has two weights associated w ith each edge), and finding a path for which the m axim um of ( 1 ) the sum of one kind of weight, and (2 ) the sum of another kind of weight, is minimum. As it is difficult to design an exact algorithm in order to find such a path, we have, therefore, devised a fully polynomial tim e approxim ation scheme to solve this problem. We conclude this section with a discussion in Section 5.1.3 of applications in vision problems. 5 .1 .1 Im a g e P r o c e ss in g B a ck g ro u n d Image processing is so rich in its diversity of methodology and so inherently com putationally intensive th at it naturally demands a very heterogeneous mix of parallelism [61]. It is easy to conceive of vision systems th at m ake use of every known form of parallelism simultaneously and th at m axim um efficiency cannot be obtained for all aspects of vision using a homogeneous parallel pro cessor. Beyond simply accomm odating m any different forms of parallelism, vision actually demands th at different forms be combined in order to satisfy its com putational requirements [38]. Vision processing is classified into a num ber of categories or levels, all of these levels can utilize a trem endous am ount of parallelism, and the levels themselves can operate in parallel [38]. It has been observed th at machines in the SIMD class are well suited for early processing of raw images-often term ed as low level processing. At this level input is an image while output is also an image of the same size, and small-kernel convolutions are applied on each pixel in parallel [38, 61]. High level vision tasks such as image understand ing, recognition, and symbolic processing exhibit coarse-grain or m edium -grain MIMD type characteristics (see Figure 5.1). Thus by exploiting the different features and capabilities of a heterogeneous environm ent, consisting of SIMD and MIMD machines integrated using a high bandw idth network, higher levels of performance can be attained than is possible by using any single type of parallel machine. 63 High Level Knowledge-based processing Models Schemas Rules Blackboard Frames etc. Intermediate Level Symbolic processing Token grouping Graph Matching Hypothesis testing etc. Sensory or iconic processing Numerical arrays Statistical data Image events etc. Figure 5.1: Levels of Processing in Vision Knowledge sources, hypothesis test results, etc. Instantiated models, object hypothesis, etc. Extracted tokens, token attributes, statistics, etc. Processing parameters, local constraints, selection criteria, etc. Low Level 64 Many com puter vision tasks, such as image understanding, p attern recog nition, dynamic scene analysis, etc., can be expressed as pipelined algorithm s [49]. A common requirement in such a system is to apply repeatedly a fixed sequence of operations (or transform s) to an essentially unending series of im ages. Given the local neighborhood communication requirem ents, this kind of application has a serial or chain-like structure to it and naturally lends it self to pipelining [50]. Should we choose to carry out all these processes on a single type of machine, the m axim um rate at which we can process incoming signals is determ ined by the tim e required for the processor to apply all the processing steps to each signal. As a result any single type of machine often spends its tim e executing code for which it is poorly suited. By partitioning the application task into different machines th at com m unicate via high speed links, each step or stage of pipeline processing can be executed simultaneously on the machine to which it is best suited. The m axim um rate of processing is now determ ined by the processor that takes the longest am ount of tim e to perform the application task, known as the bottleneck processor [9]. The following problem then emerges. Given a set of m subtasks connected in a chain-like fashion, and a heterogeneous com puter system consisting of dif ferent machines, find the assignment of subchains of subtasks to processors th at minimizes the load on the m ost heavily loaded processor. If the num ber of processors are only two and the program is serial (i.e. even though there are m modules, only one is active on one processor at one tim e), this problem can be solved efficiently using the network flow approach pioneered by Stone [53]. If the program is serial and the interconnection structure of the m od ules is tree-like, it is possible to solve it for any num ber of processors using a shortest tree algorithm [6 ]. A num ber of research sites are actually working under this distributed environm ent for a two processor heterogeneous system. Experim ents have been conducted in order to determ ine whether a task from a distributed program should be executed on the sequential front-end of a Con nection Machine CM-200, or whether the total execution can be reduced by executing the task rem otely on the parallel back-end processor [41]. 65 If the modules are executable in parallel, it is very difficult to find effi ciently the optim al solution, given a variety of criteria for optim ality. This is because the problem is com putationally equivalent to one or the other of the notorious NP-com plete graph partitioning or m ultiprocessor scheduling prob lems [9]. This explains why most of the work in this field focused on heuristic techniques [15, 57]. However, under certain constraints on the structure of the program and/or the m ulticom puter system, this problem can indeed be solved in polynomial tim e. It has been shown by Bokhari [9] th at a chain structured parallel or pipelined program can be partitioned optim ally over a chain or ring of processors. Nicol [46] and Iqbal [33] have improved the complexity of ear lier algorithm s for partitioning chain structured problems w ith restrictions on the type of mappings a n d / or on the weights assigned to different modules. Iqbal [36] has also solved a num ber of partitioning problems in heterogeneous environments. All these researchers worked under the constraint th at each processor has a contiguous subchain of program modules assigned to it. T hat is, partitions of chains have to be such th at modules i and * + 1 are assigned to same or adjacent processors. They called this the contiguity constraint. Some of the graph theoretical research, conducted in the past, was directed to find a general m ethod for partitioning the vertices of a graph into two sets of prescribed sizes by the removal of m inim um num ber of edges [34, 44]. A num ber of researchers have also developed parallel algorithm s for partitioning some classes of graphs [13]. Approxim ate schemes have also been designed to partition some families of graphs [35]. The solution of such problems are help ful in designing Imprecise Partitioning Schemes for heterogeneous com puter systems [35]. In an imprecise partition, i.e., an (n ± e)-partition of a graph, the error e is a function of the num ber of edges removed, it thus provides a convenient platform to bargain between the accuracy (which determ ines the degree of load balancing), and the cost of partition (which accounts for in tercom m unication). In order to appreciate the usefulness of the approxim ate solutions, one should bear in m ind th at data for the problem being solved is often only approxim ately known as the procedures of code type profiling and 66 analytical benchm arking [38] are still in their infancy. Hence approxim ate so lutions m ay be as meaningful as an exact solution for m any of the practical problems where the extra accuracy of the exact solution is not needed and where the approxim ate solution can be obtained in a relatively short tim e [32]. 5 .1 .2 T h e P a r titio n in g A lg o r ith m We discuss here a simple algorithm for finding an optim al partitioning of a chain structured parallel program, belonging to an integrated vision system , over a dual-processor heterogeneous system. A chain structured program is m ade up of m modules num bered t\...tm, and has an intercom m unication pattern such th at m odule ti is connected only to modules ti+i and £ 4 -_i. The optim al assignment of subchains to the two processors is influenced by the following: • The tim e required to run a module on a processor (which will vary across processors as we are working in a heterogeneous environm ent). We rep resent the tim e of execution of module ti on processor x(y) by wxi(wyi) and it depends upon the following: — Both the type of parallelism of the subtask and the type of machine executing the module or the subtask. — The actual num ber of blocks into which the subtask can be decom posed i.e. the num ber of processors in a parallel machine th at are actually used to execute it. • The tim e required for comm unication between m odule ti, and <i+i, pro vided the two modules are assigned to different processors. This tim e is represented by ct and is dependent upon the following: — The am ount of interm odule comm unication which, in general, can be nonuniform. — The speed of the link between the two processors. — The amount of data form at conversion between the two dissimi lar processors. This additional communication overhead is incurred 67 when d ata originating from one type of m achine m ust be converted into the form at of the receiving machine before it is processed. This overhead is peculiar to a heterogeneous environm ent where m achines are m ost likely to have different d ata form ats, and is known as the data format conversion overhead. The partitioning problem can be expressed in the following m anner: Given a set of m modules connected in a chain like fashion, and a two processor heterogeneous system find the assignment of subchains of modules to processors th a t minimizes the m ax(W x, Wy), where Wx(Wy) is the load on processor x(y). Our approach to the solution of this problem is to first draw up a doubly weighted assignment graph. A path in this graph corresponds to the assignment of subsequences of modules to processors. Machine x Machine y Figure 5.2: A seven-module chain m apped onto a two processor heterogeneous system. 68 E x e c u tio n & ; C o m m u n ic a tio n C o sts In the case of pipelined processing, the tim e required for a processor to finish executing the work assigned to it is taken to be equal to the sum of the times to execute all of the modules th at reside on it. A comm unication overhead is added to this sum. This accounts for the tim e taken to transm it the final result to the next processor. Thus, in Figure 5.2, the execution tim e for processor x is the tim e required for processor x to execute modules 1, 4, and 5 on a frame of data. The comm unication tim e is the tim e to transm it inform ation from m odule 1 in processor x to module 2 in processor y, and from m odule 5 in processor x to module 6 in processor y. It is im portant to note th at when processor x is executing modules 4, 5 of fram e i , and module 1 of frame i + 2, processor y is executing modules 6 , 7 of fram e i — 1, and m odule 2, 3 of frame i + 1. The tim e for all processors to finish processing one frame of information each is determ ined by the most heavily loaded i.e. the bottleneck processor. W hile dealing w ith parallel pro cessing, costs are added up in exactly the same fashion. However, execution and interm odule comm unication need not occur in well defined phases as they m ight be distributed all over the lifetime of the program . At any one point in tim e, all processors are working on different parts of the sam e problem unlike the pipelined case where each processor works on a distinct fram e of data. In terprocessor communication in the parallel processing case is bidirectional as processors need to exchange information. T h e D o u b ly W e ig h te d A ssig n m e n t G rap h In this section we discuss the concept of a doubly weighted assignment graph G th at contains all information about execution and comm unication tim es of the modules. There are two weights associated with each edge of this graph: • A x weight corresponds to additional (or increm ental) load assigned to the x processor, and • A y weight corresponds to additional (or increm ental) load assigned to the y processor. 69 Thus instead of a single weight on each edge as in traditional weighted graphs, we have an ordered pair of weights on each edge. As usual, a path P be tween two distinguished nodes start and end in this graph will be composed of a sequences of edges, ei, e2, e3, .... There are two kinds of sum weights as sociated w ith each path P, one is the fam iliar sum of all A ;r(es), the other is the sum of all Aj,(et). A path in this graph corresponds to an assignment of subsequences of modules to processors. Thus a path for which the m axim um of (53 A x(ej), 52 Aj/(eJ)) is m inim al, corresponds to the optim al assignment of modules on the two processor heterogeneous system. T h e S tr u c tu r e o f th e A ssig n m e n t G rap h There are two layers in the graph G, the x layer corresponds to the x processor, while the y layer corresponds to the y processor. Both these layers in the graph contains all the modules of the application program , connected in a chain like fashion. The jt h node in the x{y) layer of this graph corresponds to m odule tj in the application program and is thus represented by tj(tj). The start node is connected to every node in the x layer as well as in the y layer except t\{t\). Every node tj(tj) in the x(y) layer is connected to each node tv k(t%) in the y(x) layer provided j < k < m. Every node in the x(y) layer is connected to the end node. T h e L a b e llin g T ech n iq u e An edge(start,tj), 1 < j < m, i.e. an edge between the start node and a node tj in the x layer, corresponds to a partitioning p in which modules t\...tj-\ are assigned to processor x , while at least module tj is assigned to processor y. The A* and A y weights of this edge are given below: i - 1 A xistart^t*) = + Cj-i i— 1 A y{start,t! j) = Cj_i An edge(tj ,tk), where j < k < m , i.e. an edge from a node in the x layer 70 to another node in the y layer corresponds to a partitioning in which modules 1 are assigned to processor y, while at least m odule tk is assigned to processor x. The A x and A y weights associated with this edge are given below: = C fc _ ! k- 1 ^ y {tj i tfc) = w yi ck-x 3 An edge(tJ, end) i.e. an edge between a node i j in the x layer, and the end node corresponds to partitioning in which modules are assigned to processor y. The corresponding A x and A y weights are listed below: Ax(tj, end) = 0 m A y(tj, end) = wxi i E x a m p le 1 Consider the seven-module chain shown in Figure 5.2. Assume th at wxi = l,w yi = 4 if i = 1,4, and 5, wxi = 4, wyi = 1 if i = 2 ,3,6, and 7, and c»=2, where 1 < i < 6 . In simple the grey modules have an execution cost of 1(4) on processor x(y ), while the white modules have execution costs equal to 4(1) on processor ar(y), and the communication cost is assumed to be uniform equal to 2. The assignment graph corresponding to the seven-module chain is shown in Figure 5.3, the top layer is the x layer, while the y layer is shown in the bottom . The partitioning of the chain structured parallel program , shown in Figure 5.2, is represented by a path (shown in bold) between the start node and the end node in the assignment graph of Figure 5.3. The edge between the start node and m odule 2 in the x layer corresponds to a partitioning in which m odule 1 is assigned to processor x, while at least m odule 2 is assigned to processor y. The load assigned to processor x and y corresponding to this partitioning will be 3 and 2 respectively. The edge 71 between m odule 2 in the x layer to m odule 4 in the y layer corresponds to a partitioning in which modules 2 and 3 are assigned to processor y , and thus an ordered pair (2,4) is associated w ith this edge. The sum of all A x weights of all edges encountered in the path shown in Figure 5.3, is 9, while the corresponding A y sum is equal to 10. Thus the load assigned to processor x and y corresponding to the partitioning shown in Figure 5.2, will be 9 and 1 0 respectively. x - Layer 1 2 3 4 5 6 7 (0, 2) START ( 3 ’ 3 END (4, 2) 1 3 5 2 4 7 6 y - Layer Figure 5.3: The layered graph and several paths between the start and the end node. D isc u ssio n It is obvious now th at there is a distinct path P between the start node and the end node in the doubly weighted assignment graph corresponding to each assignment p of the chain structured parallel program over the processor system * and: • The sum of all the A x weights of all the edges encountered in the path P corresponds to the load assigned to processor x in th e partitioning p of the chain structured program. The same is true for the sum of all Aj, weights respectively. Thus Wx i.e. the load assigned to processor x in the p assignment is in fact equal to the sum weight, X) A x(ej), of path P 72 in the assignment graph. Similarly Wy i.e. the load assigned to processor y in the partitioning is equal to the sum weight, A J/(et), of path P in the assignment graph. • The assignment th at minimizes the load on the most heavily loaded pro cessor can be found by finding a path P in which max(J2 A s (e,-), &y(ei)) is m inimal. • The incoming degree of a node tf in the assignment graph is 2*-2. The same is true for a node tf in the y layer. Thus the degree of node end would be 2m — 2. In other words the total num ber of distinct paths between the start and the end nodes would be precisely equal to 2 m — 2 . If m is small then it m ight be convenient to consider all possibilities in order to find the optim al assignment. If, however, m is large then more efficient m ethods should be used to solve this problem. T h e A p p r o x im a te A ssig n m e n t S c h e m e It has been shown in the last section th at the incoming degree of a node tf in the assignment graph is 2*-2 . Thus the total num ber of distinct paths between the start node and tf would be precisely equal to 21-2. Each such path corresponds to an assignment in which the modules have already been assigned in some fashion while the assignment of the rem aining modules, ti...tm, is yet to be made. Let < Wx(i), Wy(i) > represent the ordered pair associated with a path term inating at node t f , where Wx{i){Wy{i)) represents the sum of all A ^ A j,) weights of all the edges in the path between the start node and tf. Thus Wx(i)(Wy(i)) is in fact the load assigned to processor x(y) due to the assignment of modules t\...ti-.\ over the two-processor system. It is im portant to note th a t each path between start node and tf may have a distinct ordered pair and thus, in general, each node tf can have as m any as 2*- 2 different < Wx(i), Wy(i) > ordered pairs. In the Approxim ate Assignment Scheme, described below, we try to restrict the total num ber of distinct paths between the start node and any node in the 73 x or y iayer thereby drastically reducing the size of our search space. Using our approxim ate scheme it is thus possible to efficiently find an approxim ate par tition of th e chain structured program w ith the guarantee th a t the m axim um percentage error in the load assigned to the bottleneck processor is within a fixed bound. An upper bound on the m axim um load, which can be assigned to processor x in any assignment, is when all modules are processed sequentially on pro cessor x assuming th at wxi > 0, where 1 < i < m. If this upper bound is represented by Wt then m w t = y : t=i Let us resolve W t to an accuracy of e i.e. two successive permissible levels for the load on processor x are separated by e. In other words Wx(i) is restricted to have only distinct values in the range of zero and Wt • This operation of restricting the num ber of possible paths between the start node and any other node in the x or y layer in the assignment graph, is perform ed by the procedure Restrict(p). The input param eters of the procedure Restrict are: ( 1 ) the doubly weighted assignment graph, and (2 ) a selected node p, where p can be either tf or tf in the assignment graph. The procedure looks at each incoming path from the start node to node p. It rejects every incoming path Pi in comparison with an incoming path P\ provided: Wxl{i) < Wx2(i) and wyl(i) < Wy2{i) W here < Wxi(i),W y\{i) > (< Wx2{i),Wy2 {i) > ) is the ordered pair associated w ith the incoming path Pi{P2) at node p. Out of all the rem aining paths between the start node and node p for which the actual value of Wx(i) is in between two successive permissible levels, we select the one with m inim al 74 value of Wy(i), and reject all others (see Lemma 1). In this m anner we restrict the num ber of ordered pairs associated with each node tf. Similarly all paths between the start node and tf are also restricted to a m axim um of distinct * e paths. The num ber of paths between the start and the end node are lim ited by the procedure Limit as described below. It is obvious th a t the procedure Limit calls the procedure Restrict as many tim es as 2(m — 1) in order to restrict the total num ber of paths in the assignment graph. P r o c e d u re Limit(A W eighted Assignment Graph) b e g in F o r i = 1 to (to — 1 ) do b e g in Restrict (tf) R estrict(tf) e n d e n d There would be at the most paths between the start node and any node in the x or y layer after the application of the procedure Limit. Thus there will me a m axim um of 2 m ^ - distinct paths between the start node and the end node in the assignment graph. A path in which max(W x, Wy) is m inim al can thus easily be found in tim e proportional to O ( ) , but the total tim e being lim ited by the complexity of the procedure L I M I T which is 0 ( — ) This guarantees th at the m axim um difference between the load on the bottleneck processor in the approxim ate assignment and the one in the optim al assignment is at the most equal to m e (see Lemma 2). If the relative error bound for the Approxim ate Scheme is e then the tim e complexity of our algorithm is bounded by 0 (to 3( ^ 2 1 )). The Approxim ate Assignment Scheme is thus a fully polynomial time approximation scheme in which the tim e complexity is a polynomial function of the size of the problem as well as of the percentage error i.e. 75 L e m m a 1 Assume th at Pi and P 2 are two paths between the start node and the end node in the assignment graph with the following properties: • P ath Pi as well as P 2 consists of two subpaths, one is from the start node to t f , and the other is from tf to the end node. • The subpath from the start node to tf in path Pi is different from the corresponding subpath in path P2. Let < Wxi{i), Wyi{i) > represents the ordered pair associated with the subpath in P i. Similarly the subpath in P 2 between the start node and tf is represented by < WX 2 (i), W y2 {i) >. • Paths Pi and P 2 share a common subpath from node tf to the end node. Under the conditions stated above the load on the bottleneck processor corre sponding to path(or partitioning) Pi would always be less than or equal to the corresponding load in partitioning P 2 provided: w xi(i) < w x2(i) and w ,i( 0 < w * (i) L e m m a 2 Suppose th at the optim al path from start node to the end node consists of the edges, e i(start, t f ), e2(tf, ty), e2(tj, tf.),.... Also assume th at at each node t f , ty, tf..., instead of selecting the ordered pair < Wxopt(l), Wyopi{l) >, corresponding to the optim al path, we select a path (we call this the approximate path) with ordered pair equal to < Wx(l), Wy(l) > such that w x(l) < Wxopt{l) + s and 76 Wy(l) < Wy0pt(l) Under the conditions stated above the load on the bottleneck processor in the approximate path would be no larger than the corresponding load in the optim al assignment by an am ount equal to mS, where m is the num ber of modules in the chain structured program. E x a m p le 2 Consider the assignment graphs (Figure 5.4) for the seven-module chain as shown in Figure 5.2. Paths P\ and P 2 are two paths between the start and the end nodes in the assignment graph, and are shown in Figure 5.4 (bottom ) and (top) respectively. The corresponding partitionings of the chain are also shown in Figure 5.4. It is im portant to note th at the subpath in Pi from the start node to node 3 in the x layer is different from the corresponding subpath in P2. The two paths, however, share a common subpath from node 3 in the x layer to the end node. The ordered pair < Wx{Z),Wy{V) > associated with each subpath is also indicated in Figure 5.4. As PFxx(3) < Wx2 (3), and Wyi{$) < Wj,2 (3), it can be deduced th at the load assigned to processor x or y in the entire path (or partitioning) P\ would be less than the corresponding loads in path P2. Thus when the procedure Restrict looks at the two ordered pairs associated with the two paths term inating at module 3 in the x layer, it im m ediately recognizes th at one of the subpaths, when extended to the end node, would always result in a costly assignment (as compared to the other path), and thus should be rejected immediately. Suppose there is a third path (P 3 ) between the start and the end node in the assignment graph of Figure 5.4, and it also passes through node 3 in the x layer. Assume th at V F X 3 (3 ) = 6 .9 9 9 , and Wy3{2>)=A. Thus Wr xi( 3 )= IF C 3 (3 ) + 0 .0 0 1 , and kFyi(3) < 14^3(3). It is obvious th at if the procedure Restrict rejects path P 3 in comparison w ith path Px then the total load assigned to processor x in Pi would be larger than the load assigned to x in P 3 by at m ost 0.007, while the load assigned to processor y in Pi would be less than the corresponding load 77 in P3. The above statem ent will be true provided Wxi(i) < Wx3{i) + 0.001, and Wyi{i) < Wy3(i) for each common node i between path Pi and P3 in the assignment graph. Y X Y X START END {2, 11) (8, 0) (7, 2) START END (2, 11) (8, 0) Figure 5.4: P ath Pi (bottom ) and path P 2 (top). In this section we have described an efficient algorithm which can partition a chain-structured application program consisting of several heterogeneous mod ules onto a two processor system. Our approach takes care of the heterogeneous 78 nature of each m odule as well as the conversion overheads involved when m od ules residing on different types of machines com m unicate with each other. The results of this research are directly applicable for optim ally partitioning tasks in an integrated vision system consisting of com putationally diverse modules onto a dual processor heterogeneous system. Heterogeneous sites consisting of two processor systems can directly benefit from this research, and can use our crossover strategy which allows application program s to be optim ally parti tioned and run sim ultaneously on more than one supercom puter at a tim e. It is possible to extend this approach to a three or four processor heterogeneous com puter systems, and currently we are working to apply similar techniques to related problems with less restricted structures. 5 .1 .3 V is io n A p p lic a tio n s Most of vision com putations are symbolic in nature. Recently, there has been considerable work done in understanding the inherent parallel complexity of these generic problems as well as in developing efficient parallel solutions to these problems on homogeneous parallel machines [37]. However, not much is known about exploiting the embedded heterogeneous parallelism in such tasks. In the following we have selected a few sample vision tasks and elaborate on the “heterogeneous com puting” requirements of such tasks. In te g r a te d M o tio n E s tim a tio n There are four m ajor steps in integrated motion estim ation [55]. The goal of the first step is to detect, points, lines, and regions in each frame. Optical flow is also com puted between each pair of adjacent frames. All these subtasks fall in the category of low-level vision. The input to these subtasks is an image and the output is a list of desired features. These subtasks can be further sub-divided depending upon the granularity of the com putations. The second step establishes correspondence between the features in each pair of adjacent images. In the third step, the detected correspondences are used to obtain segm entation of the scene. The objective of the fourth step is to determ ine the 79 integrated m otion and structure using feature correspondences established in the third step. S c e n e D e sc r ip tio n The scene description task consists of three m ajor steps: feature extraction, object recognition, and inference based structure description. As described below feature extraction and object recognition themselves consist of several subtasks. One approach to scene description is depicted in Figure 5.5. M o t i o n V i r t u a l l in e s ► T e x t u r e s K n o w le d g e R a w Im a g e . I n t r i n s i c d e s c r i p t i o n in t e r p r e t a t i o n \ P h o t o m e t r y Match d e s c r i p t io n Figure 5.5: A pictorial representation of a Scene Description Task. O b je c t R e c o g n itio n In the object recognition task, the main steps include feature extraction and matching. However, recognition approaches vary depending upon the feature 80 types to be m atched, the search space used for m atching, and the search tech niques employed. G ro u p in g b a sed o n T o p o lo g ic a l R e la tio n s This has been divided into four steps [52] preprocessing, local groupings, per ceptual groupings, and high level groupings. In preprocessing, curves are de tected. In the local grouping and perceptual grouping steps, curves and line segments are grouped into more structured features. Such groupings are gov erned by geometric relationships and the com putations are symbolic in nature. Finally, the results from the previous steps are further refined into high level groupings to form topological graphs which are used as directives for m atching processes [52]. In the above tasks, processing starts on raw images as input and move up in the vision pyram id. The needed processing power, and the type of com putations varies along this pyramid. Initially, the input consists of arrays of image pixels and in the later stages the input is in the form of lists, graphs, and other structures. The parallelism is fine grained and the com putations are regular and local in nature. In the later stages of processing, the data struc tures become more complex and the am ount and type of parallelism changes. In general, in high level vision the com putations are symbolic. The characteris tics of these com putations include complex d ata structures, search and pattern m atching operations, and irregular d ata flow. In order to cope w ith such a scenario, heterogeneous computing seems to be well suited. 5.2 Partitioning Tree Structured Problems Our approach to the solution of this problem is very similar to the one described in the previous section for partitioning chain structured problems. A tree is sim ilar to a chain in the sense th at by removing a single edge it can be divided into two parts. There are, however, im portant differences between the two structures, and these should be kept in m ind while designing the new algorithm . We shall emphasize the similarities as well as differences as we 81 describe the partitioning algorithm. We first traverse the given tree and place consecutive labels on the nodes of the tree visited according to the procedure Label described below. The result ing path (of traversal) is treated as a chain of modules which is then partitioned by drawing a doubly weighted assignment graph. In order to partition the tree structured parallel program using our previous techniques of partitioning a chain, we introduce the concept of critical nodes in the next section. It has been shown th at the tim e complexity of the modified partitioning algorithm is proportional to 2C m ax, crnax is the m axim um num ber of critical nodes affecting a node in the assignment graph. By using an intelligent scheme for traversing (and labelling) the nodes of the tree we lim it the value of cmax to log2 m. 5 .2 .1 T h e L a b e llin g o f th e T ree The modules of a tree structured problem are labelled by the following proce dure Label. Last is a local variable used by the procedure and it represents the last label assigned to a node, j is another local variable and it represents the last node labelled. A tree is divided into two parts by cutting an edge(j, &), where nodes j and k are adjacent to each other [34, 35]. One half of th e tree, which includes node j, is known as subtreejk, while the other half, which in cludes node k , is called subtreekj. The num ber of nodes in a subtree is denoted by nodes [subtree]. The key idea behind this labelling technique is as follows: The next node to be labelled would be a node k adjacent to node j such th at nodes [subtree^] is m inimal. The label of node k would be Last + 1. P r o c e d u r e Label (A tree of m nodes, m > 2) b e g in 1. Start with any leaf node i , and label it with 1, i.e., Label(i) := 1, Last := 1, and j := i. 2 . Find a node k adjacent to node j , which is not yet labelled; Label{k) Last + 1; j := k; Last := Label(k). If all the m nodes are labelled then return. 3. Let d denote the degree of node j ; 82 (a) If d = 1 then goto step (5). (b) If d = 2 then find a node k adjacent to node j , which is not yet labelled. (c) If d > 2 then find a node k out of all unlabelled nodes adjacent to node j such th at nodes[subtreekj\ is minimal. If you do not find such a node k then goto step (5). 4. Label(k) := Last + 1; j k; Last := Label(k)j If all the m nodes are labelled then return otherwise goto step (3). 5. Backtrack to node b, where b is the last node labelled w ith degree larger than two; j 6 ; goto step (3). en d . 5 .2 .2 T h e D o u b ly W e ig h te d A ssig n m e n t G rap h There are two layers in the graph G , the x layer corresponds to the x processor, while the y layer corresponds to the y processor. Both these layers in the graph contains all the nodes (i.e. modules) of the tree, with labels l...m , connected in a chain like fashion. The j t h node in the x{y) layer of this graph corresponds to m odule with label j in the application program and is thus represented by tj(tfj). The start node is connected to every node in the x layer as well as in the y layer except tx{t\). Every node tx -{ty) in the x(y) layer is connected to each node ty k(tk) in the y (x ) layer provided j < k < m. Every node in the x{y) layer is connected to the end node. A path in this graph between the start and the end node corresponds to an assignment of subsequences of modules to processors. An edge(start,tj), 1 < j < m, i.e. an edge between the start node and a node tJ in the assignment graph, corresponds to a partitioning p in which mod ules with labels l...j — 1 , in the application program , are assigned to processor x , while at least m odule j is assigned to processor y. The x and y weights of 83 this edge are given below: j- i x(start, tJ) = ^ 2 wxi 4- C j 1 = 1 y (s ta rt,tj) = C j An edge{t*,ty k), where j < k < m, i.e., an edge from a node in the x layer to another node in the y layer, in the assignment graph, corresponds to a partitioning in which modules j -f 1 ...k are assigned to processor x or y in the following manner: (1) For each i, where j + 1 < * < k — 1, starting from i — j + 1, find an adjacent node which has already been assigned to either the x processor or the y processor. If the adjacent node is assigned to x(y) processor then module i should also be assigned to x(y) processor i.e. the two modules should be assigned to the same processor. (2) Find a node adjacent to node k which has already been assigned to a processor. Node k is assigned to x(y) if the adjacent node is assigned to processor y(x), i.e., the two nodes are assigned to different processors. The x and y weights associated with this edge can then be evaluated. An edge(tJ, end) i.e. an edge between a node t* and the end node in the assignment graph corresponds to partitioning in which nodes j + l...m are assigned to processor x or y in the following m anner: For each node i, where j + 1 < i < m, starting from i = j + 1 , find an adjacent node which has already been assigned to either the x processor or the y processor. If the adjacent node is assigned to x(y) processor then node i should also be assigned to x(y) processor i.e. the two nodes should be assigned to the same processor. The corresponding x and y weights associated w ith this edge can then be found out. 84 E x a m p le 3 Consider the tree structured parallel program shown in Figure 5.6. Assume th at the black modules have an execution cost of 1(4) on processor x(y), while the white modules have execution costs equal to 4(1) on processor x(y ), and the communication cost is assumed to be uniform equal to 2. The tree is labelled using the labelling technique described in the previous section. Note th at once the tree is labelled it can be treated as a chain structure with certain special or critical nodes, defined in the next section. The assignment graph corresponding to the eight-module tree is shown in Figure 5.6 (bottom ), the top layer is the x layer, while the y layer is shown in the bottom . Both these layers in the assignment graph contains all the 8 modules of the tree connected in a chain like fashion. The partitioning of the tree structured parallel program , shown in Figure 5.6 (top), is represented by a path between the start node and the end node in the assignment graph shown in the bottom of this figure. The edge between the start node and m odule 3 in the x layer corresponds to a partitioning in which module 1 and 2 are assigned to processor x, while at least module 3 is assigned to processor y. The load assigned to processor x and y corresponding to this partitioning will be 7 and 2 respectively. The edge between m odule 3 in the x layer to module 5 in the y layer corresponds to a partitioning in which modules 3 and 4 are assigned to processor y, and thus an ordered pair (2,7) is associated with this edge. The edge between m odule 5 in the y layer to the end node corresponds to a partitioning in which m odule 5 is assigned to processor x while modules 6 , 7, and 8 are assigned to processor y. The sum of all x weights of all edges encountered in the path shown in Figure 5.6 (bottom ), is 10, while the corresponding y sum is equal to 12. Thus the load assigned to processor x and y corresponding to the partitioning shown in Figure 5.6 (top), will be 10 and 12 respectively. 85 6 7 8 o—o—o x - Layer 1 2 3 4 5 6 7 8 START END (7,2) (1,3) 8 3 4 5 6 1 2 7 y - Layer Figure 5.6: An eight-module tree m apped onto a two processor heterogeneous system. 86 5 .2 .3 T h e N o tio n o f C r itic a l N o d e s In the doubly weighted assignment graph designed for a chain structured par allel program we have noted th at the load on the bottleneck processor corre sponding to path Pi would always be less than or equal to the corresponding load in partitioning P2 provided W xi(i) < Wx2(i) and Wyi(i) < W y2(i), where partitionings (or paths) Pi and P2 are defined in Lem ma 1. This useful prop erty was exploited to restrict the total num ber of distinct paths between the start node and any node thereby drastically reducing the size of our search space. Thus it became possible for us to efficiently find an approxi m ate partition of the chain structured parallel or pipelined program with the guarantee th at the maxim um percentage error in the load assigned to the bot tleneck processor is within a fixed bound. The above m entioned property for a chain structured program, as described in Lemma 1, does not hold as such in the doubly weighted assignment graph designed for a tree structured parallel program. Consider, for example, the tree structured parallel program shown in Figure 5.7. Partitionings Px, P2, and the corresponding paths in the doubly weighted assignment graph are also shown in the figure. Note th at the subpath in Pi, shown in bold, from the start node to node 5 in the x layer is differ ent from the corresponding subpath in P2. The two paths, however, share a common subpath from node 5 in the x layer to the end node. It is im portant to appreciate here that the load assigned to processor x or y in the entire path (or partitioning) Pi would not be less than the corresponding loads in path P2 even if fFa;i(5) < Wx2{b), and Wyi(5) < W y2 (5 ). This is because node 3 in the tree structured parallel program is assigned to processor y in partitioning Pi while it is assigned to processor x in partitioning P2. This node is critical for node 5 because it decides the future of the nodes with labels 6, 7, and 8, which are yet unassigned. It is im portant to note th at these nodes (with labels 6, 7, and 8) are assigned to processor y in Pi, while they are assigned to processor x in P2 in spite of the fact th at both paths Pi and P2 share a common subpath from node 5 in the x layer to the end node. If, however, node 3 in the tree is assigned to the same processor in both 87 Pi p2 Y x - Layer START END 5 3 4 6 8 1 2 7 y - Layer Figure 5.7: An eight-module tree is partitioned onto a two processor heteroge neous system. 88 partitionings Pi and P2 then the load assigned to the bottleneck processor cor responding to Pi would always be less than or equal to the corresponding load in P2 provided Wau(5) < W ^ S ) and Wyi{b) < Wy2 (5 ). This extra constraint requires us to modify Lemma 1 and we do so by presenting the concept of critical nodes. Note th at node 3 was a critical node for node 5 of the tree structured parallel program as shown in Figure 5.7. Now consider the tree structured program, same as shown in Figure 5.7, with two different partitionings P3 and P4 as shown in Figure 5.8. The corre sponding paths in the doubly weighted assignment graph are also shown in the bottom of the figure. It is im portant to appreciate here th a t the load assigned to processor x or y in the entire path (or partitioning) P3 would be less than the corresponding loads in path P4 if 1^ 3 (7 ) < W a;4(7), and Wy3(7) < Wy4(7). This is because there is no critical node for node 7 in the doubly weighted assignment graph and consequently Lemma 1 holds for such a node. Once a tree is labelled using the labelling procedure described before, the critical nodes, and the nodes affected by these critical nodes, are determ ined using the following simple procedure described in Lemma 3. L em m a 3 1. For each node (or module), labelled as i , with degree 3 or more in the tree, find a node adjacent to i with a label j such th at j is maximal. 2. Node i is a critical node affecting only those nodes which are not adjacent to node i and have labels from i + 2 to j — 1. E x a m p le 4 Consider the tree structured parallel program as shown in Figure 5.9. Note th at the tree is labelled using the labelling technique described in the previous section. We shall now find the critical nodes and also the nodes affected by these critical nodes. There are only four nodes with degree 3 in the tree struc tured program and thus there are only four critical nodes with labels equal to 3. 8, 11, and 18. 89 p3 p4 X O— O* x - Layer 12 3 4 5 6 7 8 END START 8 3 4 5 6 1 2 7 y - Layer Figure 5.8: The eight-module tree, as shown in Figure 5.7, is partitioned onto a two processor heterogeneous system. 90 N o d e s a ffected b y cr itic a l n o d e 3 There is only one node affected by this critical node and th at is node with label 5. N o d e s a ffected b y critica l n o d e 8 These are nodes with labels equal to 10, 11, 12, 13, 14, and 15. It is im portant to note th at node 11 is itself a critical node. Thus there is a sort of nesting of critical nodes witnessed here. N o d e s a ffec te d b y critica l n o d e 11 Node with label 13 is affected by the critical node 11. It is interesting to find th at node 13 is affected by two critical nodes, one is node 11 and the other is node 8. N o d e s a ffected by critic a l n o d e 18 There is only one node affected by this critical node and th at is node with label 20 . Note th at there are certain nodes which are not affected by any critical node, nodes with labels 1, 2, 3, 4, 6, 7, 8, 9, 16, 17, 18, 19, 21, and 22 come under this category. On the other hand there are nodes which are affected by more than one critical node, e.g., node with label 13. L e m m a 4 Assume th at P\ and P2 are two paths between the start node and the end node in the assignment graph as defined in Lemma 1, with the extra constraints: • Each critical node c affecting node i of the tree structured program is assigned in a similar fashion in both partitionings P\ and P2, i.e., if c is assigned to processor x (y ) in Pi then it should be assigned to x(y) in P2 as well. 91 22 Figure 5.9: A labelled tree with critical nodes shown in black. 92 • Node with label i in the tree structured program is assigned in a similar fashion in both partitionings Px and P?, i.e., if i is assigned to processor x(y ) in Pi then it should be assigned to x (y ) in P 2 as well. Under the conditions stated above the load on the bottleneck processor corresponding to path(or partitioning) Pi would always be less than or equal to the corresponding load in partitioning P 2 provided Wx\{i) < W x2(i) and Wyl(i) < Wy2(i). L em m a 5 If a tree is labelled using the labelling procedure, described earlier in this section, then the m axim um num ber of critical nodes affecting any node in the tree would be at the m ost equal to log m. T h e A p p r o x im a te A ssig n m e n t S ch em e An upper bound on the m axim um load, which can be assigned to processor x in any assignment, is when all modules are processed sequentially on processor x assuming th at wxi > 0, where 1 < i < m. If this upper bound is represented by W t then m W T = J ^ w xi 2 = 1 Let us resolve W t to an accuracy of e i.e. two successive permissible levels for the load on processor x are separated by e. In other words Wx(i) is restricted to have only ^ distinct values in the range of zero and W t . The approxim ate partitioning of the tree structured parallel program can now be found us as follows: 1. Label the tree structured parallel program using the procedure described earlier in this section. 2. For each node i, 1 < i < m, find if it is affected by any critical nodes(s). 93 3. Look at each incoming path from the start node to node p where p can be either tf or tf in the assignment graph. Select those paths or partitionings in which: • All critical nodes, influencing node *, are assigned to processor x or y in a similar m anner, and • Node i of the tree structured program is assigned to processor x or y in a similar manner. O ut of all the selected paths between the start node and node p for which the actual value of Wx{i) is in between two successive permissible levels, select the one with m inim al value of W y(i), and reject all others. In this m anner we restrict the num ber of ordered pairs associated with each node t f (OT t ? ) U , 0 ( ? ^ ) . There will be a maximum of distinct paths between the start node and the end node in the assignment graph. A path in which m ax(W x, W y) is m inim al can thus easily be found in tim e proportional to 0 ( m 2Jt'r ), but the complexity of the algorithm would be lim ited by the restricting procedure which is 0 ( ~ Y T)- This guarantees that the m axim um difference between the load on the bottleneck processor in the approxim ate assignment and the one in the optim al assignment is at the most equal to me. If the relative error bound for the Approxim ate Scheme is e then the tim e complexity of our algorithm is bounded by 0 (m 4( ^ £ :)). The Approximate Assignment Scheme is thus a fully polynomial time approximation scheme in which the tim e complexity is a polynomial function of the size of the problem as well as of the percentage W -r error i.e. — 2 1 . e 5.3 Conclusion In this chapter we have described an efficient algorithm which can partition a chain or a tree structured application program consisting of several heteroge neous modules onto a two processor system. Our approach takes care of the 94 heterogeneous nature of each module as well as the conversion overheads in volved when modules residing on different types of machines comm unicate with each other. The results of this research are directly applicable for optim ally partitioning tasks in an integrated vision system consisting of com putationally diverse modules onto a dual processor heterogeneous system. Heterogeneous sites consisting of two processor systems can directly benefit from this research, and can use our crossover strategy which allows application programs to be optim ally partitioned and run sim ultaneously on more than one supercom puter at a tim e. It is possible to extend this approach to a three or four processor heterogeneous com puter systems, and currently we are working to apply similar techniques to related problems with less restricted structures. 95 Part III Concluding Remarks This thesis stands as one of the very first extensive research documents in the novel field of heterogeneous computing. In the first part of this thesis, the theory of heterogeneous computing was analyzed in two chapters. In the first chapter, we defined the design issues to be in three levels: The first layer, the Network Layer, includes the physical design aspects of interconnecting au tonomous machines in the system. The second layer, the comm unication layer, is concerned with communication and synchronization primitives required to exchange information among processes residing in various machines. The third layer, the intelligent layer, provide system-wide tools and techniques necessary to manage the suite of heterogeneous machines. The tasks handled by this layer include code analysis, partitioning, mapping, scheduling, and program ming environment tools. The focus of this thesis is on the intelligent layer. In the second chapter, we presented a theory of Heterogeneous optim al selection which defines the existence of an optim al m atch of a task onto heterogeneous suite. In part two, concentrating on the intelligent layer which deals basically with efficient m apping techniques, we first introduced Cluster-M as a pro gram m ing environment for designing machine independent portable parallel software. Cluster-M allows a single program to be ported and shared among various machines. To exploit multi-level parallelism in a task, A more spe cialized form of Cluster-M called Hierarchical Cluster-M (HCM) is introduced, and to provide better support for Heterogeneous Computing. The m ain com ponents of HCM are the HCM system representation, and the HCM problem specification. Two m apping methodologies for HC were developed in this part. The first m apping methodology presented is based on the Cluster-M model. It is an online mapping heuristic for porting an arbitrary algorithm onto any or all the machines in a heterogeneous suite. In the second chapter of this part, 96 an optim al m apping methodology is presented for a heterogeneous suite of two machines of different types. The input to this m apping methodology is as sumed to have either a chain or tree structure. We have showed th at this chain m apping technique is suitable for image processing applications. In conclusion, we state the following problems to be open: 1. Im plem entation of Hierarchical Cluster-M based m apping tool. 2. Extension of the two machine optim al m apping to other forms of input tasks. 3. Performance evaluation of various heterogeneous m apping tools. 4. Developing ’’good” special-purpose m apping heuristics. 97 Appendix A PCN Cluster-M constructs The seven Cluster-M constructs are implemented in PCN as follows: /* 1. Makes given elements into one cluster */ C M A K E (L V L , E L E M E N T S , x) { || M I N .E L E M E N T ( E L E M E N T S ,n ), /* n is the smallest num ber in ELEMENTS */ a: = [LVL, n, E L E M E N T S ] } M I N - E L E M E N T ( E , n) { ? E l = [m | El] - > {; M I N - E L E M E N T l( E l, m, min), n — m in } } M IN -E L E M E N T 1 (E 1 , m, m in ) { ? E l l = [h | E2] - > {; { ? h < m — > m l = h, default — > m l = m } , M IN „ E L E M E N T 1 {E 2 , m l, min) }, default — > m in = m } /* 2. Yields an element of the cluster */ C E L E M E N T ( x ,j, e) 98 { ; C S IZ E (x ,a ), { ? j < = r? = [ _ , xl\ - > C E L E M E N T l{ x l,j,e ) , default — > e = [] } } C E L E M E N T l( x ,j,e ) "3 V 1 V { ? X ? = [_|a:l]— > C E L E M E N T l ( x l ,j - l,e ), }, defa u lt— > { ; C S IZ E (x , a), { ? s = = 1— > e = x, d efa u lt— > e — s[0] } > } } /* 3. Yields the size of the cluster */ C S IZ E { x ,s ) { ? x? — [_, x2] - > C S IZ E l(x 2 ,0 ,s ), default — > 6 = 0 } C S I Z E l( x , acc, s) { ? x l = [_|arl] - > C S I Z E l( x l,a c c + l,s ) , default — > s = acc } /* 4. Merges cluster x and y */ C M E R G E (x , y, E L E M E N T S , 2 ) { ? [LVL_x|:rl],y? = [L V L _ y |y l]-> 99 { ? xYl — [nx\x2\,y\‘ l — [ny|t/2] — > { || M IN (n x ,n y ,m in ), z = [LVL-x + I,m in ,E L E M E N T S ] h h default — > z — [ ] } M IN (n x , ny, min) { ? ny > = nx — > m in = n x , default — > m m = ny /* 5. Does the Unary operation */ C U N (* ,x ,i,e ) { || C E L E M E N T (x ,i,e 1), e = *(el), } /* 6. Does the Binary operation */ C B I ( * ,x ,i,y ,j, e) { || C E L E M E N T ( x ,i,e l), C E L E M E N T ( y ,j, e2), { ? op = = ” + ” — > e = el + e2, op = = ” — > e = el — e2, op = = ” * > e = el * e2, op =— > e = el/e2 , op = = > e = el%e2, } } /* 7. Does the Split operation */ C S P L IT {x , k,p, q) 1 0 0 { || C S IZ E (x ,s ), { ? x? = [ L V L ,n ,E \- > { I k = = s — > { || p=[L VL + l ,n ,E \ t q ^ [ L V L + 1,0, []], }, k < s - > { || C S P L I T l( E ,k ,E l,E 2 ) , M I N - E L E M E N T ( E l,n l) , M IN -E L E M E N T (E 2 , n2), p = [LVL -f 1, n l, E l ], g = [L V L + 1, rc2, i?2], } } } } C S P L I T l( E ,k ,E l,E 2 ) {? fc > 0 - > { ? E? = [h\t]-> { || C S P L IT l(t, k - 1, £ 3 , E2), E l = [fc|£3] } }, d efa u lt— > { I I £1 = 0 , E2 = E } } 1 0 1 Bibliography [1] G. Agha. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press, Cambridge, Mass., 1986. [2] G. Agha, C. Houck, and R. Panwar. D istributed execution of actor sys tem s. In Proceedings of Fourth Workshop on Languages and Compilers for Parallel Computing, Santa Clara, 1991. [3] G. Agha and Rajendra Panwar. An actor-based framework for heteroge neous computing systems. In Proc. Workshop on Heterogeneous Process ing, pages 35-42, Mar. 1992. [4] E. Arnould, F. Bitz, E. Cooper, H. Kung, R. Sansom, and P. Steenkiste. The design of nectar : A network backplane for heterogeneous m ulticom puters. ACM, pages 205-216, 1989. [5] F. Berm an and B. Stram m . Prep-p: Evolution and overview. Technical report cs89-158, Dept, of Com puter Science, University of California at San Diego, 1987. [6] S. Bokhari. Dual processor scheduling with dynamic reassignments. IEEE Trans. Software Eng., SE-5, July 1979. [7] S. Bokhari. On the mapping problem. IEEE Trans, on Computers, c- 30(3):207-214, March 1981. [8] S. Bokhari. A shortest tree algorithm for optim al assignments accross space and tim e in a distributed processor system. IEEE Trans, on Soft ware Engineering, (6):583-589, 1981. [9] S. Bokhari. Partitioning problem in parallel, pipelined, and distributed computing. IEEE Trans, on Computer, pages 48-57, January 1988. [10] S. Bokhari. Assignment Problems in Parallel and Distributed Computing. Kluwer Academic Publisher, 1990. 1 0 2 [11] L. Borrm an, M. Herdieckerhoff, and A. Klein. Tuple space integrated into modula-2, im plem entation of the linda concept on a hierarchical m ultipro cessor. In In Jessshope and Reinartz, eds., Proc. CONPAR ’ 88. Cambridge Univ. Press. [12] B. Buckle and D. Hardin. Partitioning and allocation of logical resources in a distributed computing environment. In Tutorial : Distributed System Design, pages 151— 1. IEEE Compu. Soc. EHO, 1979. [13] T. Bui and C. Jones. Parallel algorithms for partitioning simple classes of graphs. In Proc. ICPP., August 1990. [14] N. Carriero, D. Gelernter, and J. Leichter. D istributed d ata structures in linda. In Proceedings of the Thirteenth A C M Symposium on Principles of Programming Languages, January 1986. [15] C. Castro and S. Yalamanchili. Partitioning algorithm s for a class of application specific multiprocessor architectures. In Proc. IPPS Workshop on Heterogeneous Processing, April 1993. [16] K. Chandy and S. Taylor. An Introduction to Parallel Programming. Jones and B artlett Publishers, Boston, 1992. [17] W. Chu, L. Holloway, M. Lan, and K. Efe. Task allocation in distributed data processing. Computers, pages 57-69, November 1980. [18] K. Efe. Heuristic models of task assignment scheduling in distributed systems. IEEE Computer, 15(6):50-56, 1982. [19] H. El-Rewini and T. G. lewis. Scheduling parallel program tasks onto arbitrary target machines. Journal of Parallel and Distributed Computing, pages 138-153, 9 1990. [20] M. Eshaghian and R. Freund. Cluster-M paradigms for high-order het erogeneous procedural specification computing. In Proc. Workshop on Heterogeneous Processing, pages 47-49, Mar. 1992. [21] M. Eshaghian and M. Shaaban. A Cluster-M based m apping m ethodol ogy. In Proc. International Parallel Processing Symposium, pages 213-221, April 1993. [22] M. Eshaghian, M. Shaaban, and S. Chen. An algorithm on the mapping problem. Technical report, Dept, of Com puter and Inform ation Science, New Jersey Institute of Technology, 1993. 103 [23] M. M. Eshaghian. Cluster-M parallel programming model. In Proceedings of the International Parallel Processing Symposium, pages 462-465, March 1992. [24] D. Fernandez-baca. Allocating modules to processors in a distributed system. IEEE Trans, on Software Engineering, (11):1427-1436, November 1989. [25] M. Flynn. Very high-speed computing systems. Proc. IEEE, pages 1901- 1909, 1966. [26] I. Foster and S.Tuecke. Parallel programming with pen. Technical report, Argonne National Laboratory, University of Chicago, January 1993. [27] R. Freund. O ptim al selection theory for superconcurrency. In Supercom puting ’ 89, pages 699-703, Nov. 1989. [28] R. Freund. Superconcurrent processing a dynamic approach to heteroge neous parallelism. In Proceedings of the Parallel/Distributed Computing Networks Seminar, February 1990. [29] R. Freund and D. Conwell. Superconcurrency: A form of distributed heterogeneous supercomputing. Supercomputing Review, 3:47-50, Oct. 1990. [30] A. Gerasoulis, S. Venugopal, and T. Yang. Clustering task graphs for message passing architectures. In A C M International Conference of Su percomputing, June 1990. [31] V. Gylys and J. Edwards. O ptim al partitioning of workload for distributed systems. In Tutorial : Distributed System Design, pages 151-1. IEEE Compu. Soc. EHO, 1979. [32] M. Iqbal. Approximate algorithms for partitioning problems. Interna tional Journal of Parallel Programming, October 1991. [33] M. Iqbal. Efficient algorithms for partitioning problems. In International Conference on Parallel Processing, 1991. [34] M. Iqbal. Mapping and Assignment Problems in Multiple Computer Sys tems. PhD thesis, Engineering University, Lahore, Pakistan,, 1991. De partm ent of Electrical Engineering. [35] M. Iqbal. Partitioning a tree structured problem on a heterogeneous com puter system. Technical report, D epartm ent of Electrical Engineering, Engineering University, Lahore, Pakistan, 1993. Technical Report. 104 [36] M. Iqbal. Partitioning problems for hetrogeneous com puter systems. In Proc. IPPS Workshop on Heterogeneous Processing, April 1993. [37] A. Khokhar, W. Lin, and V. Prasanna. Stereo and image m atching on fixed size mesh arrays. In Proc. of the I APR International Conference on Computer Architectures for Machine Perception, December 1991. [38] A. Khokhar, V. Prasanna, M. Shaaban, and C. Wang. Heterogeneous computing: Challenges and opportunities. IEEE Computer, 26:18-27, June 1993. [39] J. Lawson and M. Mariani. Distributed data processing system design - a look at the partitioning problem. In Tutorial : Distributed System Design, pages 151-1. IEEE Compu. Soc. EHO, 1979. [40] S. Lee and J. Aggarwal. A mapping strategy for parallel processing. IEEE Trans, on Computer, pages 433-442, April 1987. [41] David J. Lilja. Experim ents with task partitioning model for heteroge neous computing. In Proc. IPPS Workshop on Heterogeneous Processing, April 1993. [42] V. Lo. Heuristic algorithms for task assignment in distributed systems. IEEE Trans, on Computer, (3):1384-1397, 1988. [43] V. M. Lo, S. Rajopadhye, S. Gupta, D. Keldsen, M. A. Mohamed, and J. A. Telle. Oregami: Software tools for m apping parallel com putations to parallel architectures. In Proc. International Conference on Parallel Processing, 1990. [44] R.M. Macgregor. On partitioning a graph: a theoretical and empirical study. PhD thesis, University of California, Berkeley, 1978. Ph.D thesis. [45] J. Mahdavi, G. L. Huntoon, and M. B. M athis. Developement of a HiPPI- based distributed supercom puting environment at the Pittsburgh super computing center. In Proc. Workshop on Heterogeneous Processing, pages 93-96, Mar. 1992. [46] D. Nicol and D. O ’Hallaron. Improved algorithms for m apping pipelined and parallel computations. IEEE Trans., Computers, 40(3). [47] D. Notkin, A. Black, E. Lazowska, H. Levy, J. Sanislo, and J. Zahorjan. Interconnecting heterogeneous com puter systems. Communications of the ACM , (3):258-273, 1988. 105 [48] R. Ponnusamy, N. Mansour, A. Choudhary, and G. C. Fox. Mapping re alistic data sets on parallel computers. In Proc. 7th International Parallel Processing Symposium, pages 123-128, April 1993. [49] V. Prasanna, editor. Parallel Architectures and Algorithms for Image Un derstanding. Academic Press, 1991. [50] C. Reinhart. Specifying Parallel Processor Architectures for High-level Computer Vision Algorithms. PhD thesis, University of Southern Califor nia, Los Angeles, 1978. Ph.D. thesis. [51] J. Sinclair. Efficient com putation of optim al assignments for distributed tasks. Journal of Parallel and Distributed Computing, (4):342-362, 1987. [52] F. Stein and G. Medioni. Recognition of 3-D objects from 2-D grouping. In DARPA Image Understanding Workshop, 1992. [53] H. Stone. Multiprocessor scheduling with the aid of network flow algo rithm s. IEEE Trans, on Software Eng., SE-3:85-93, January 1977. [54] H. Stone. Critical load factors in two processors distributed systems. IEEE Trans, on Software Engineering, pages 254-258, 1978. [55] S. Sull and N. Ahuja. Integrated 3d recovery and visualization of flight image sequences. In Image Understanding Workshop, 1992. [56] V. S. Sundram. PVM: A framework for parallel distributed computing. Concurrency: Practice and Experience, 2(4):315— 339, December 1990. [57] L. Tao, B. Narahari, and C. Zhao. Heuristics for m apping parallel compu tations to heterogeneous parallel architectures. In Proc. IPPS Workshop on Heterogeneous Processing, April 1993. [58] R. Vetter, D. Du, and A. Klietz. Network supercomputing: Experim ent with a CRAY-2 to CM-2 H iPPI connection. In Proc. Workshop on Het erogeneous Processing, pages 87-92, Mar. 1992. [59] M. Wang, S. Kim, M. Nichols, R. Freund, and H. Siegel. Augmenting the optim al selection theory for superconcurrency. In Proc. Workshop on Heterogeneous Processing, pages 13-21, Mar. 1992. [60] U. W arrier and C. Sunshine. A platform for heterogeneous interconnection network m anagement. IEEE Journal on Selected Areas in Communica tions, (1): 119— 126, January 1990. 106 [61] C. Weems. Image understanding: A driving application for research in heterogeneous parallel processing. In Proc. IPPS Workshop on Heteroge neous Processing, April 1993. [62] J. Yang, L. Bic, and A. Nicolan. A mapping strategy for MIMD com put ers. In Proc. International Conference on Parallel Processing, 1991. [63] T. Yang and A. Gerasoulis. A parallel programming tool for schedul ing on distributed memory multiprocessors. In Proc. IEEE Scalable High Performance Computing Conference, April 1992. 107
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255817
Unique identifier
UC11255817
Legacy Identifier
DP22891