Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Occamflow: Programming a multiprocessor system in a high-level data-flow language
(USC Thesis Other)
Occamflow: Programming a multiprocessor system in a high-level data-flow language
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OCCAMFLOW : PROGRAMMING A MULTIPROCESSOR SYSTEM IN A
HIGH-LEVEL DATA-FLOW LANGUAGE
by
Liang-Teh Lee
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Engineering)
August 1989
Copyright 1989 Liang-Teh Lee
U M I Number: DP22790
All rights reserved
INFORMATION TO ALL USERS
The quality of this reproduction is d ep en d en t upon the quality of the copy subm itted.
In the unlikely event that the author did not sen d a com plete m anuscript
and there are m issing pag es, th e se will be noted. Also, if m aterial had to be rem oved,
a note will indicate the deletion.
UMI
Dissertation Publishing
UMI D P22790
Published by P roQ uest LLC (2014). Copyright in th e Dissertation held by the Author.
Microform Edition © P roQ uest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United S ta tes C ode
P roQ uest LLC.
789 E ast E isenhow er Parkw ay
P.O. Box 1346
Ann Arbor, Ml 4 8 1 0 6 - 1346
UNIVERSITY OF SOUTHERN CALIFORNIA
THE GRADUATE SCHOOL
UNIVERSITY PARK
LOS ANGELES, CALIFORNIA 90089
This dissertation, written by
L ia n g - T e h L ee
under the direction of h. i s Dissertation
Committee, and approved by all its members,
has been presented to and accepted by The
Graduate School, in partial fulfillment of re
quirements for the degree of
P h . D .
CfS
urn
DOCTOR OF PHILOSOPHY
Dean o f G raduate Studies
D ate J.2?.?.
DISSERTATION COMMITTEE
Acknowledgements
I would like to express my deepest gratitude to Prof. Jean-Luc Gaudiot, the
chairman of my dissertation committee, for his invaluable guidance, support
and encouragement throughout the course of my doctorate study at USC. I am
also grateful to the other dissertation committee members, Prof. Kai Hwang
and Prof. Christoph von der M alsburg for their many helpful comments and
suggestions.
This research was conducted while I was a recipient of the Tatung Faculty
Develop Fellowship and the Scholarship of M inistry of Education, the Repub
lic of China. Thanks to Dr. T. S. Lin, president of the Tatung Institute of
Technology, and Prof. C. C. Huang, dean of the study, Tatung Institute of
Technology, for their continuous support and encouragement.
Some of the SISAL programs which have been used in this research (such as
the histogramming program presented in Appendix E) were originally developed
by Professor Gaudiot in collaboration with Alan Chudnow who kindly arranged
for our use of his lab’s Transputer system at the TRW Defense Systems Group.
I gratefully acknowledge the assistance of the Com puter Research Group at \
I
the Lawrence Livermore National Laboratory. I would also like to thank all my j
i
friends and colleagues in the Departm ent of Electrical Engineering at USC for
ii
their valuable discussions and constant friendships. Among them , I would like
to thank Paraskevas Evripidou, Chih-Ming Lin, Dr. Christoph Scheurich, and
Jin-Chin Wang. Thanks also to William Bates and M ary Zittercob for their
adm inistrative help.
Finally, I am grateful to my m other and my wife. W ithout their consistent
support and patience, this work would not have been possible.
iii
Contents
A cknow ledgem ents ii
List o f F igures ix
A b stract x
1 IN T R O D U C T IO N 1
1.1 A Multiprocessor Architecture and Its P ro g ra m m a b ility .............. 3
1.2 Thesis Outline and O rg a n iz a tio n ......................................................... 8
2 B A C K G R O U N D 11
2.1 Data-flow Model of C o m p u ta tio n ......................................................... 11
2.1.1 Data-flow P rin c ip le s .................................................................... 12
2.1.2 Structure Handling in A Data-flow Architecture ...................15
2.1.3 Partitioning and Allocation in Data-flow S y stem s...................17
2.2 Parallel Programming L an g u ag es.................................... 18
2.2.1 SISAL: A High-level Data-flow L a n g u a g e ................................ 18
2.2.2 Interm ediate Form 1 ( IF1 ) ...........................................................19
iv
2.2.3 OCCAM ........................................................................................... 21
2.3 M apping M ech an ism 26 j
3 O C C A M FLO W T R A N SL A T O R 29
3.1 Overview .................................................................................. 29
3.1.1 The Translator E n v iro n m e n t.......................................................30
3.1.2 Interm ediate F i l e s ....................................................................... 32
3.2 G raph G e n e r a tio n ................................................................................... 33
3.2.1 Simple N o d e s ................................................................................. 35
3.2.2 Compound Nodes ....................................................................... 38
3.2.3 Function Node and Partition N o d e ............................................. 41
3.3 Basic Partitioning ....................................................................................... 42
3.4 Im plementation of D ata S tru c tu re s..........................................................44
3.4.1 Representation and Implementation of A rra y s......................... 45
3.4.2 Representation and Implementation of S tre a m s ....................... 46
3.5 Occam Code G en eratio n ............................................................................. 49
3.5.1 Fine-grain A p p ro a c h ................................................................... 50
3.5.2 Macro-actor A p p ro a ch .................................................................... 52
3.5.3 Example: Translation of M atrix M ultiplication Program . 53
4 O PT IM IZ A T IO N : P A R T IT IO N IN G A N D A LLO C A TIO N 59
4.1 Communication Cost T h resh o ld in g ..........................................................60
4.1.1 Communication Cost M a t r i x ....................................................... 60
4.1.2 Threshold M e th o d ........................................................................... 61
v
4.1.3 E xam ple...............................................................................................62
4.2 Code R e p lica tio n ........................................................................................... 65
4.2.1 Forall Construct and Loop U n ro llin g..........................................65
4.2.2 Function c a lls .....................................................................................68
4.3 Process Allocation and D ata A llo c a tio n ................................................ 69
4.3.1 Multiplexing, Demultiplexing, and R o u t in g .............................70
4.3.2 Static A llo c a tio n .......................................................................... 74
4.3.3 Dynamic A llo c a tio n .................................................................... 76
4.3.4 D ata A llocation.............................................................................. 78
5 E X P E R IM E N T S A N D P E R F O R M A N C E E V A L U A T IO N 80
5.1 Experiment I: Livermore Loops, Histogramming, and M atrix
M ultiplication............................................................................................. 80
5.1.1 Experiments and Experimental Results .....................................81
5.1.2 Interpretation of r e s u l t s ............................................................. 89
5.2 Experiment II: Shallow ................................................................................. 93
5.2.1 E x p e rim e n t........................................................................................ 93
5.2.2 Performance results ........................................................................94
5.3 D iscussion........................................................................................................ 95
6 C O N C L U SIO N S A N D S U G G E S T IO N S F O R F U T U R E
R E S E A R C H 110
6.1 C onclusions................................................................................................... 110
6.2 Suggestions for Future R esearch............................................................... 112
vi
A p p en d ix 114
A U S IN G T H E T R A N S L A T O R 114
A .l Intro d u ctio n ...................................................................................................114
A.2 D ata S tru c tu re s............................................................................................ 115
A.3 E x am p le ......................................................................................................... 117
B IF 1 A C T O R S A N D T H E IR O C C A M D E F IN IT IO N S 119
C D IS T R IB U T IO N OF LO O P B O D IE S 125
D A R O U T E R 128
E B E N C H M A R K P R O G R A M : L IV E R M O R E LO O PS 133
F B E N C H M A R K P R O G R A M : H IS T O G R A M M IN G 149
G B E N C H M A R K P R O G R A M : SH ALLO W 158
B ib liograp h y 169
vii
List of Figures
1.1 Overall architecture of the T X 1 6 .......................................................... 4
1.2 Block diagram of a T800 T ra n s p u te r................................................... 5
1.3 A simplified diagram of the programming en v iro n m en t................... 7
2.1 Execution of a data-flow a c to r................................................................ 12
2.2 A sample data-flow g r a p h ....................................................................... 14
2.3 Addition of two arrays in I F 1 .................................................................... 22
2.4 M atrix m ultiplication in I F 1 ................................................................... 23
2.5 A simple occam p ro g ram .......................................................................... 25
2.6 (a) The occam process (b) Corresponding m a c ro -a c to r........................28
3.1 The translator ........................................................................................... 31
3.2 A simple IF1 g ra p h .................................................................................... 35
3.3 A simple node . ....................................................................................... 37
3.4 An example with loop operation ..............................................................40
3.5 An example with select o p e ra tio n ..............................................................41
3.6 Implementation of a two-dimensional a r r a y ..........................................46
viii
3.7 Array operations (a) Create (b) Select (c) Replace (d) Concate
nate ................................................. 47
3.8 Stream operations (a) Append (b) Select first element (c) Select
all but first element ................................................................................ 48
3.9 (a) A macro instruction for P lu s (b) Corresponding occam code 51
4.1 Communication cost m atrix (Basic partitioning) ................................63
4.2 Communication cost m atrix (2nd p a rtitio n in g )................................... 64
4.3 Communication cost m atrix (3rd partitioning).......................................65
4.4 A function call a c t o r .....................................................................................68
4.5 Router co n fig u ratio n ................................................................................. 71
4.6 MUX and DMUX p ro ce sses................................................................... 72
5.1 Speedup for L o o p l ........................................................................................ 85
5.2 Speedup for L o o p 7 ........................................................................................ 86
5.3 Speedup for Loopl2 .....................................................................................87
5.4 Speedup for H istogram m ing........................................................................88
5.5 Speedup for M atrix M u ltip lic a tio n .........................................................90
5.6 Speedup for Shallow with 1 ite r a tio n ....................................................... 98
5.7 Speedup for Shallow with 2 iterations ....................................................99
5.8 Speedup for Shallow with 4 iterations ..................................................100
C .l A tim ing d iag ram ..........................................................................................127
Abstract
The purpose of our research efforts as described in this thesis is to investi
gate software methodologies for programming multiprocessor systems by using
a data-driven approach to solve the problem of runtim e scheduling. Indeed, the
data-flow model of com putation is an attractive methodology for m ultiproces
sor programming for it offers the potential for virtually unlim ited parallelism
detection at little or no program m er’s expense. It is here applied to a dis
tributed architecture based on a commercially available microprocessor (the
Inmos T ransputer). We have integrated the high-level data driven principles of
scheduling within the Transputer architecture so as to provide high program m a
bility of our m ulticom puter system. A programming environment which trans
lates a complex data-flow program graph into occam has been developed and
is presented in this thesis. We describe here in detail the m apping from the
SISAL high-level constructs into the low-level mechanisms of the Transputer.
Several features such as synchronization among different processes, array han
dling scheme, function call mechanism, and routing process are implemented
and discussed. The partitioning issues (granularity of the graph) are presented
and several solutions based upon both data-flow analysis (communication costs)
x
and program syntax (program structure) are proposed and have been imple
m ented in our programming environment. For evaluating the performance of
overall system, we apply several benchmark programs to run on our Transputer
network.
xi
Chapter 1
INTRODUCTION
While improvements in device technology will soon no longer be capable to meet
the ever increasing performance demands from supercom puter users, architec
tural solutions such as multiprocessor systems promise to deliver the improve
m ents required. In fact, many multiprocessor systems have been proposed and
some are already commercially available [9,39,40]. However, high-level program
ming languages have evolved much more slowly than the machine architectures
on which they are to be implemented. Thus, the program m ability of the m ulti
processors of the next generation is generally recognized to be one of the major
issues to confront designers.
Several language approaches for the safe and efficient program m ing of par
allel architectures have been proposed in order to deal w ith these problems.
These languages include concurrent PASCAL [ll] and ADA [5]. Among the
constructs which enable the specification of parallelism, operations may autho
rize the execution in parallel of two or more processes. Similarly, processes may
1
be synchronized on data dependencies by the introduction of shared variables.
In order to ensure the proper ordering of the various updates to a specific cell of
memory, the program m er m ust specify critical sections for the program which
essentially lock out the access of some specific cells of memory until they have
been safely updated by a single process.
As technology progresses, it will be possible to integrate very large numbers
of processors in the same machine. Explicit parallelism specification is there
fore a complicated problem with large multiprocessor systems since the num
ber of tasks th at m ust be kept concurrently active becomes very large. This
dem onstrates the need for a new approach in programming large multiprocessor
systems.
Backus has presented functional languages as a way to unshackle the pro
gram m er from the structure of the machine [10]. Indeed, instead of considering
instructions which can modify memory cells, the functional model of compu
tation assumes functions which are applied to values, producing result values.
This model of execution does not depend upon architectural notions. Moreover,
the executability of an instruction is decided by the availability of its operands.
This can be implemented in a distributed fashion, thereby obviating the need
for the central program counter of the von Neumann paradigm.
Data-flow systems obey functional principles of execution [18]. They im
plement a low-level description of the execution mechanisms which govern a
functional m ulticom puter architecture. In addition, high-level languages such
as VAL [1,48], SISAL [49] and HDFL [26] have been proposed as a high-level
2
interface.
While this approach brings a solution to many multiprocessing problems
[7], several data-flow projects exploit the parallelism inherent in data-flow pro
gram m ing by the introduction of complex custom-m ade Processing Elements.
We describe here a multiprocessor architecture based on off-the-shelf compo
nents which can be program m ed under the data-flow model of com putation.
The research presented here dem onstrates the applicability of the functional
mode of execution to such a problem. We have chosen for this work the high-
level functional language SISAL (Streams and Iterations in a Single Assignment
Language) developed at the Lawrence Livermore National Laboratory in collab
oration with other institutions (Colorado State University, University of East
Anglia, University of M anchester, and Digital Equipm ent Corporation) [49].
This language has also been chosen for high-level language program m ing of the
University of M anchester data-flow machine [35].
1.1 A Multiprocessor Architecture and Its
Programmability
The machine called TX16 has been described in detail by Gaudiot et al. [29].
Figure 1.1 shows the overall architecture of the TX16. It is a very simple mul
tiprocessor structure similar to the Cm*, an early experimental multiprocessor
developed at Carnegie Mellon University [33]. In our design, program s are w rit
ten in a functional language; they are then compiled in a data-flow graph which
3
is partitioned to define processes and to allocate processors and memory. In this
methodology, the actual structure of the multiprocessor itself is transparent to
the program m er.
TO T l
T15
COMMUNICATION NETW ORK
SHARED MEMORY SYSTEM
Figure 1.1: Overall architecture of the TX16
The elementary building block of the TX16 is the Transputer [17], m anu
factured by Inmos, Ltd. The Transputer is a new generation microprocessor
which is either a 16- or 32-bit processor providing 10 million instruction per
second processing power with 2 or 4 Kbytes of on-chip memory. It possesses
not only a memory bus (data, address, and control) for transm ission of data
to and from the on-chip memory, but also four serial links for communication
w ith the outside ( see Figure 1.2 ).
Special on-chip interface circuitry allows the transm ission of d ata packets
4 I
i
32
32
32
32
32
Link
Interface
Link
Interface
Link
Interface
Floating
Point
Unit
System
Service
32 bit
Processor
Control
Unit
4K Bytes
On-Chip
RAM
Figure 1.2: Block diagram of a T800 Transputer
between two Transputers connected by serial links. Because of these links, any
num ber of Transputers can be jointed together to form a m ultiprocessor system
w ith all the Transputers capable of operating concurrently. In addition, the
processors are organized around a central memory system for transm ission of
large d ata structures. In this fashion, scalar d ata as well as synchronization
messages can be sent over the serial interconnection links, while larger arrays
can be exchanged and broadcast through the shared memory system.
At the execution level, the Transputer reflects the structure of the occam
language. It allows the presence of several different processes, although only one
can be active at a time. Message transmission with other processes is based
on the synchronous principles of CSP as described by Hoare [36]. Since the
num ber of processors which may ultim ately be contained in this system is quite
large, it appears th a t an explicit approach to parallelism specification by the
program m er is quite unfeasible. The program m ability afforded by functional
principles of execution has therefore been adopted for the research. Figure 1.3
shows a complex functional programming environment which translates SISAL
into occam.
The output of the SISAL compiler, IF1 (Interm ediate Form 1), is essentially
a high-level d ata dependency graph which contains information concerning the
structure of the original user’s program. A high-level partitioning of the original
program is made, based on the program structure as well as on heuristics. This
creates processes which can be directly translated into occam constructs. The
approach has been term ed occamflow, for it integrates principles of data-flow
SISAL
Data-flow
Graph
Network Topology
Structure
Information
Advice File
OCCAM
Compiler
Allocator
Transputer
Network
Figure 1.3: A simplified diagram of the program m ing environment
7
program m ing w ith the programming of a parallel architecture which supports
the occam language.
1.2 Thesis Outline and Organization
The rest of the thesis is organized as follows:
C hapter 2 contains the research background. We briefly introduce the data
flow model of com putation: the basic data-flow principles, structure handling
schemes, as well as partitioning and allocation problems in data-flow environ
m ents. In this chapter we also describe the parallel program m ing languages
which have been used in the course of this research: high-level language pro
gram m ability is afforded by the data-flow language SISAL, and the low-level
model of execution of the Inmos Transputer (the language occam ), upon which
the architecture is based. The m apping mechanism which transform s from
data-flow graph to occam has also been presented in this chapter.
C hapter 3 contains the implementation issues of the translator. It describes
the generation and translation of the data-flow graph as well as the program
structure graph, a high-level partitioning algorithm based on the program struc
ture and heuristics, and two approaches: fine-grain and m acro-actor approaches
in code generation.
C hapter 4 presents the partitioning and allocation mechanisms. We de
scribe communication cost thresholding approach for partitioning to reduce the
total comm unication cost, discuss the loop unroling and function call schemes
8
to obtain a higher degree of parallelism, and present static allocation and dy
namic allocation m ethods for task assignment to achieve better performance.
In addition, a routing mechanism and three data allocation strategies th at will
be applied for improving system performance, have also been presented in this
chapter.
Chapter 5 describes the experiments for evaluating the system performance.
We show the procedure of the experiments and the experimental results which
are obtained by running a set of benchmark program s, Livermore loops, his-
togram m ing, m atrix m ultiplication, and a weather program , called shallow, on
the transputer network. The analysis of the experimental results and perfor
mance evaluation are presented. An overall discussion of experiments is also
made in this chapter.
C hapter 6 addresses the conclusions of this dissertation and the possible
future directions in this area.
Appendix A describes the usage of the occamflow translator.
Appendix B presents the implemented data-flow actors as well as their sam
ple occam definitions.
Appendix C explains the m ethod used for finding an optim al solution to
determ ine the num ber of loop bodies to be distributed to different transputers
in a transputer network for loop unrolling.
Appendix D presents a router which is implemented to allow the communi
cation between non-neighboring transputers and multiplexing m ultiple commu
nication channels through a single physical channel between two transputers.
9
Appendix E shows three SISAL program of Livermore Loops which have
been used to evaluate our system. Their translated occam code are also shown.
Appendix F presents a SISAL program of histogram m ing and its corre
sponding occam code.
Appendix G contains a medium scale SISAL program , called shallow , which
is one of our benchm ark programs.
10
Chapter 2
BACKGROUND
In our research, we have applied the data-flow principles of execution to pro
gram in a high-level data-flow language (SISAL) for a distributed architecture
based on a commercially available microprocessor, the Inmos Transputer. We
describe here the basic idea of the data-flow model of com putation, the parallel
program m ing languages used in the course of the research, and the m apping
mechanisms to transform data-flow graphs into occam.
2.1 Data-flow Model of Computation
The data-flow model of com putation is a simple and powerful way for describ
ing parallel com putations. Some related research on data-flow principles of
execution, array handling in data-flow environment as well as partitioning and
scheduling issues are described in this section.
11
2.1.1 Data-flow Principles
A data-flow program is a directed graph m ade of nodes ( actors) connected
by arcs (links). The input and output arcs can carry tokens bearing d ata and
control values. The arcs define paths over which token are conveyed from one
actor to other actors. An actor is enabled when all its input arcs carry tokens
and no token exists on its output arcs. It can then fire {i.e., the instruction
is executed), placing tokens on the output arcs. Program can be constructed
by putting together these actors as well as conditional actors. W hen loops are
involved, it is necessary to distinguish between the d ata for different iterations.
Figure 2.1 shows the execution of an actor. In the first diagram the input arcs of
the actor carry tokens th a t bear data values. After execution, only the output
arcs carry d ata tokens. These data tokens are sent only to those instructions in
the program need them . In other words, the execution of instructions is based
on the flow of data.
5
after firing
Figure 2.1: Execution of a data-flow actor
12
The basic principles involved in data-flow processing can be sum m arized as
follows:
• Data-flow program s are m ade of actors ( instructions ).
• Instructions communicate d ata over data arcs between data-flow actors,
• D ata is transm itted in a packet called token.
• W hen an operation is executed, the input d ata values are consumed, re
sults are form ated into tokens th a t are placed on the output arcs of the
actor.
• Operation sequencing is based on availability of data values.
• Operations are functional and do not produce side effects.
The data-flow processor operations [2] can be classified into four groups
based on their use as function operators (add, subtract, etc.), conditional op
erations ( relational and boolean ), structure operations ( append, select, etc.)
used for d ata structuring facility, and special operations ( merge, true, false
gates and procedure application ). As an example, Figure 2.2 shows a data
flow program graph for the following expression:
13
l e t
x : = a - b
y := 2 + c
in
i f y < 1
th e n a
e ls e x
end i f
end l e t
x
y
Merge
Figure 2.2: A sample data-flow graph
Many data-flow architectures have been designed to execute the data-flow
model of com putations [8,19,35,62,64]. They can be classified into two abstract
models: static and dynamic. Static data-flow models allow at m ost one token
14
per arc in data-flow graphs, whereas the dynamic model of data-flow allows
tagged tokens and thus perm its more than one token per arc. In [59] several
data-flow architectures in these two categories have been compared.
2.1.2 Structure Handling in A Data-flow Architecture
According to the data-flow principles of execution, for each path on the data
flow graph, only one actor can put values on th a t path. Once a value has j
j been placed on a path, no actor can modify th a t value, i.e., it can only be
read. In a pure data-flow system, a d ata structure is viewed as a single value
which is defined and referenced as a unit, the entire structure m ust be passed to
each referencing actor. Obviously, this can impose a large overhead. Therefore,
several schemes have been developed in the past, in order to reduce the overhead
of transm itting data structure values [6,7,18,28,50].
H e a p s Dennis has proposed to present arrays by directed acyclic graphs ( or
heaps ) [18]. Each element of the array is represented as the leaf of a tree. On
heaps, the SELECT operation receives a pointer to the root node of the array,
and an index value. It returns a pointer to the substructure, which may be
a leaf node, th a t is associated with the giving index, and the APPEND actor
takes the same two operands in addition to the value of the element to append
to the structure. The structure th at results includes a subset of the old tree.
W hen a single array element m ust be modified, only a sequence of pointer
modifications is involved on heaps. Thus, for an n element array, the cost of
modifying a heap requires logn which compares favorably w ith the cost n of
15
recopying entire array. However, the sequentialization of some array operations,
a centralization of array accesses, and storage overhead, etc. are disadvantages
of heaps [23,28].
I- S tru c tu re s I-Structures [6] allow structure access before all elements are
available. In this scheme, the memory cell of each structure element contains a
presence bit which indicates availability. SELECT operations processed before
structure completion. In the processing of a SELECT operation, the presence
bit is checked, and if the element is unavailable, the result is deferred in a
queue. W hen the element arrives, the presence bit is set and any requests are
proceeded. In I-Structure scheme, a required check of the presence bit for each
SELECT operation, a required check for queued requests for each element, and
additional overhead to implement the request queues are needed.
S ta tic -V A L Mowbray has described a scheme for handling arrays in a VAL
i
high-level environment [50]. In this approach, arrays would be allocated in a]
sequential fashion, as they are in a conventional von Neum ann environment.
SELECT actors can readily be implemented as they consist in a simple READ
operation. Note, however, th a t some additional inform ation m ust be associ
ated w ith the array in order to indicate its dimensionality and its distribution
across the processors of the machine. Safety and correct execution of W RITE
I
operations are a compile-time task. This scheme reduces the num ber of mem- j
ory accesses, and has a better possibility of distribution of an array across the
PEs. However, these undeniable improvements come at the price of a possible
I
16
induction of spurious d ata dependencies, or loss of run-tim e parallelism.
2.1.3 Partitioning and Allocation in Data-flow Systems
In m ultiprocessor systems, partitioning is necessary to ensure th a t the granu
larity of the parallel program is coarse enough for the system, w ithout losing
too much parallelism, while scheduling is necessary to achieve good processor
utilization and to minimize inter-processor comm unication cost in the system.
Sarkar has carried out a thorough analysis of these problems [58]. He com
pares two approaches, macro-dataflow scheduling ( compile-time partitioning
and run-tim e scheduling ) and compile-time scheduling ( compile-time par
titioning and scheduling ) , by the execution of SISAL program s on a sim
ulated m ultiprocessor. The sim ulated results show a lower speed-up in the
macro-dataflow approach, due to the fact th at macro-dataflow incurs run-tim e
scheduling overhead.
Gaudiot et al. has proposed a static allocation m ethod for data-flow mul
tiprocessor and implemented it in the Hughes Data-Flow Machine [13,26,31].
In this approach, the allocator uses different heuristic functions to minimize
comm unication costs and to maximize the speed-up.
2.2 Parallel Programming Languages
Many parallel program m ing languages have been developed in the past [54].
We describe in this section both the high-level language (SISAL) and the low-
level language (occam) which have been used in the course of this research.
Particular attention is given in this section to the first step of the translation:
the compilation of SISAL into an Interm ediate Data-Flow G raph IF1. Note
th a t we used for this purpose, the SISAL compiler supplied by the Lawrence
Livermore National Laboratory.
2.2.1 SISAL: A High-level Data-flow Language
SISAL is the high-level data-flow language which has been used in the course
of this research. There are six basic scalar types of SISAL: boolean, integer,
real, double real, null and character. The d ata structure of SISAL consists of
records, unions, arrays and stream s. Each basic d ata type has its associated
set of operations, while record, union, array and stream types are treated as
m athem atical sets of values ju st as the basic scalar types. Under the forall
construct, these types can be used to support identification of concurrency for
execution on a highly parallel processor.
Since SISAL is a single assignment language, it greatly facilitates the detec
tion of parallelism in a program. A SISAL program comprises a set of functions.
The input and output of the program are passed through a m ain program which
is one of these functions. A SISAL program which performs the addition of two
arrays is shown below.
18
%
% A ddition of two arrays
I
type OneDim = array[ r e a l ];
fu n ctio n Aadd (A, B: OneDim ; N: in te g er
returns OneDim)
fo r i in 1,N
c := A [i] + B [i] ;
returns array of c
end fo r
end fu n ctio n
Note th a t according to the SISAL gram m ar, the left hand side of the assign
m ent statem ent m ust be a variable name, either a simple variable or an array
name. For an array name, e.g., statem ent c := A [*'] + B[i\ in Figure 2.3, the
statem ent “ returns array of c ” is used to obtain the entire value of the array.
To modify an element of an array in SISAL one should use the array operations
such as a rra y-fill or replace etc.
2.2.2 Intermediate Form 1 ( IF1 )
An IF1 graph [60] is produced by the SISAL compiler. It is a direct reflection
of the original SISAL input. The IF l graph corresponds to a combined graph
of PSG (Program Structure Graph) and DFG (Data-Flow G raph). There are
two kinds of nodes in IF l: compound nodes and simple nodes. A compound
19
node can be considered as a control point which affects a sequence of actors
in its controlled range. On the other hand, the simple node is the elementary
processing actor; it consists of the information of its input and output arcs.
The PSG is a tree structure. It describes the relationships among the com
pound nodes and the simple nodes, according to the original user’s program .
The root and internal nodes of the tree are com pound nodes, while leaves are
all simple nodes. In addition to the compound nodes and simple nodes in IF l,
we define a third kind of node, and call it block node. It is an internal node of
the PSG. The block node is a dummy node which is created for the convenience
of partitioning only.
Figure 2.3 is such a combined graph. It corresponds to an IF l description
and describes the addition of two arrays. Solid lines in the graph represents the
edges of the PSG and the dashed lines link the leaves to form a DFG.
In the graph, the root of the PSG is a forall node. Its first child, the
“RangeGenerator” actor broadcasts index values to the “AElement” actors.
Once the “AElement” actor receives the index value, it obtains an element
of the array and forwards it to the next actor. The “Plus” actor receives
two elements from two separate “AElement” actors and sends the sum to the
“A G ather” actor which then generates the result array.
The iteration is controlled by the forall node. To process a multidimensional
array, we can use multilevel forall nodes to control all successors. For instance,
a SISAL program which performs the m ultiplication of an M x N m atrix and
N x L m atrix is shown below, while its corresponding IF l graph is shown in
20
Figure 2.4.
%
% M atrix M u ltip lic a tio n
%
type OneDim = a rra y [ r e a l ];
type TwoDim = array [ OneDim ] ;
finaction MatMultC A, B: TwoDim ; M,N,L: in te g e r
returns TwoDim)
fo r i in 1,M Cross j in 1,L
S:= fo r K in l.N
R:= A[ I , K ] * B[ K, J ]
returns value of sum R
end fo r
returns array of S
end fo r
end fu n ctio n % MatMult
2.2.3 OCCAM
Occam [41,47] is the program m ing language for the Transputer and directly
l
related to CSP (Communicating Sequential Processes) as introduced by Hoare
[36]. It allows the specification of processing modules and their intercom m uni
cation patterns. The basic construct in occam is the process. Communication
L
21
___i
Solid line - PSG
Dashed line - DFG
Figure 2.3: Addition of two arrays in IF l
'orall
AGather2
Range
eratorl
Block
Node
'orall
AElement 1
'orall
A G atherl
Range
erator2
Block
N ode
Reduce
Range
erator3
Times AElement 4 AElement2
1 -------
Solid line - PSG
Dashed line - DFG
Figure 2.4: M atrix m ultiplication in IF l
23
is allowed between processes over communication links between processes. This
model is evidently based upon comm unication by message passing. It allows the
execution of a process when argum ents have been received. This corresponds
to executability by data availability.
There are three basic commands th at can be invoked in occam:
1. The assignment statem ent, whose function is similar to the assignm ent in
conventional languages: variable:= expression
where the variable is set to the value of the expression.
2. The input command which can be stated as: channel ? variable
which means th at a value is sought from the channel nam ed channel and
will be stored in the variable variable.
3. The output command which can be stated as: channel ! expression
where the value of the expression nam ed expression is output to the chan
nel channel.
In addition to above basic commands, three im portant constructs should be
introduced: the S E Q declaration signifies th a t the statem ents inside the process
are to be executed sequentially, the P A R keyword indicates th a t the following
processes are to be executed in parallel, and A L T construct is employed in
cases where a subset of the input channels may be used to initiate com putations
w ithin a process.
The synchronization between the occam processes is accomplished by the
transm ission and reception of d ata to and from the channels. W hen an output
comm and is encountered in a process, the process is halted until another process
24
has executed the corresponding input command. Note th a t the same would
occur if the process w ith the input command was executed first. In other words,
comm unication can occur between two processes only when they are both ready
to perform the I/O transfer. For the input case, when an input comm and such
as C ? X is encountered, the process has to wait until a value sent from the
channel C is received.
An occam program can be represented as a graph where processes are nodes
interconnected by links. Figure 2.5 represents a simple occam program as well
as its corresponding graphical representation.
I N T x :
S E Q
C hanl Chan2
C h a n l ? x
X X * X
Figure 2.5: A simple occam program
It should be noted th a t the data-flow principles of execution can be directly
supported by occam. The converse is not true, however, since it is possible to
design unsafe occam program s which would have no corresponding p art in the
data-flow world. In a data-flow graph, a node corresponds to an operator, arcs
are pointers for forwarding d ata tokens and control tokens (acknowledgement
arcs), while the execution relies upon the availability of data. Alternatively, an
occam process can describe the operation of a node of the data-flow graph. The
function of the node is the m ain operation of the occam process. The input j
and output arcs of the node respectively correspond to the input and output j
25
channels of the occam process. At the same time, detection of d ata availability
is achieved by the synchronization of occam channels. This m apping is m ade
possible by the fact th a t both program m ing approaches rely upon the principles
of scheduling upon data availability.
2.3 Mapping Mechanism
From the previous section, we know th a t a leaf of the PSG is an actor, and th a t
its input and output arcs correspond to channels in occam. If two argum ents x
and y are passed through two input arcs of a "P lus” actor, to be the operands
of this ’ ’Plus” operation, then this part of graph could be translated as:
C0001 ? x
C0002 ? y
The above discussion assumes th a t the addition actor is m apped into a single
occam process. In this discussion, the ”Plus” actor receives d ata x and y from
the channels (70001 and (70002 respectively. After the actual operation has
been completed, the sum of x and y is passed through the output of the process
to the next process. This could be channel (70003. This action is translated
into occam code:
C0003 ! x + y
If this result is to be sent to several actors, more than one channel should j
be created, and d ata will be sent out through all the channels th a t have been
so declared.
26
In summary, a simple add actor, if allowed its own occam process, will be
translated by:
INT x , y:
SE Q
PA R
C0001 ? x
C0002 ? y
C0003 ! x + y
Note th a t relatively more complex notations will be needed for more involved
program constructs and will be explained in the next section. In the above
process, the operation of x + y can be executed only after both values of x and
y have been received. Thus, the two channel input operations, “C0001 ? x ”
and “(70002 ? y” , are allowed to be executed in parallel so as to enable partial
execution before a complete set of input operands has arrived.
Furtherm ore, if several actors are lum ped together, this group could be
m apped into a set of occam code, while some channels between the actors could
be elim inated. For example, assume th a t we need to perform the function
(a + b) x (c + d) in an occam process. The occam program is shown in
Figure 2.6a, and the corresponding graph is shown in Figure 2.6b.
The input values of a, b, c, and d are received from the channels (70001,
C 0002, 00003, and (70004 respectively. As discussed above, these input opera
tions can be performed either sequentially or in parallel. Note th a t the output
value of (a + b) x (c + d) is sent out through the (70005 channel. This operation
can be executed only after the values a, b, c and d have been received. In this
process, no channel is needed to transfer the values from the outputs of these
27
b c
I N T a ,b ,c ,d
S E Q
P A R
(70001 ? a
C0002 ? b
C70003 ? c
(70004 ? d
(70005 ! (o + b) X (c + d)
Plus
Plus
Times
Figure 2.6: (a) The occam process (b) Corresponding m acro-actor
two “Plus” actors to the “Times” actor.
28
Chapter 3
OCCAMFLOW TRANSLATOR
A compiler m ust perform two m ajor tasks: the analysis of a source program and
the synthesis of its corresponding object program . Basically, a compiler consists
of five components: lexical analyzer, syntactic analyzer, sem antic analyzer,
code generator and optimizer [3,63]. Since our translator accepts the IF l code,
output of the SISAL compiler, as a source program , two analysis phases, lexical
analyzing and syntactic analyzing can be bypassed. This chapter presents the
issues involved in the im plem entation of the occamflow translator.
3.1 Overview
i
The translator initially extracts the information from IF l, the output of the
SISAL compiler, to create a set of PSG and DFG. M acro-actors then generated
by lumping simple nodes based upon the source code structure, locality and
d ata traffic between actors. By building this knowledge into a communication!
29
cost m atrix, the data-flow graph can be further partitioned and translated into
partitioned occam processes to be placed on transputers.
I
3.1.1 The Translator Environment
The translator can be logically divided into five phases: SISAL compilation,
graph generation, basic partitioning, optim ization, and occam code generation.
Figure 3.1 is an overview of a flowchart of the translator, depicting its passes
instead of viewing the translator in term s of its five logical phases:
• P a s s 1 corresponds to the SISAL compilation phase. It translates the
SISAL user program into IF l.
• P a s s 2 corresponds to the graph generation phase. It scans the IF l
description and generates a graph which consists of two subgraphs, the
PSG (Program Structure Graph) and the DFG (Data-Flow Graph).
• P a s s 3 corresponds to the basic partitioning phase. Based on the PSG
and DFG, it generates a partitioned data-flow graph (PD FG ), a channel
table and a comm unication cost m atrix.
• P a ss e s 4 th r o u g h N - l correspond to the optim ization phase. Each
separate type of optim ization m ay require several passes of re-partitioning.
They also need a resource information (the num ber of PEs available etc.)
to facilitate the work. A study of optim ization will be discussed later.
• P a s s N corresponds to the occam code generation phase. It traverses
the PSG and PD FG , considers each partitioned group, elim inates the
30
SISAL
Compilation
Graph
Generation
PSG &
DFG
Pass 3
Resource
information
O ptim ization
Basic
Partitioning
I
Macro
definition
table
Communicatio
Cost M atrix
Channel Table
& PD FG
OCCAM Code
Generation
Pass N
Pass 4 ..
Pass N -l
OCCAM Progra!
Figure 3.1: The translator
31
unnecessary channels, and generates the object occam code.
Pass 1 is supported by the SISAL compiler, and needs no further explanation
as it has been covered in chapter 2. Passes 4 through N - 1 will be discussed in
chapter 4. in the following sections, Pass 2, Pass3 and Pass N will be described
in detail. Appendix A will describe the usage of the occamflow translator.
3.1.2 Intermediate Files
i Through the translation from IF l graph to occam code, a sequence of files are
generated. In addition to the translated occam code which can be applied to
run on transputers, other interm ediate files generated by the translator can also
be used for further applications.
• PSG (Program Structure Graph) and DFG (Data-Flow G raph): This
graph file directly reflects the IF l descriptions. It contains a combined
graph of PSG and DFG. The structure inform ation is carried by the com
pound nodes while the dependency information is carried by simple nodes.
Since this file contains the required information for data-flow processing,
it can be considered as an another interm ediate form for transform ation
to other object languages.
• PD FG (Partitioned Data-flow Graph): Based on the PSG and DFG, a ba
sic partitioning process is performed to lump those simple nodes th a t have
high communication cost potentially [32]. This graph inform ation can be
modified after further partitioning, and can also be used to generated a
partitioned object code, i.e., a data-flow program w ith m acro-actors.
32
• Communication cost m atrix: This file describes the comm unication cost
between partitions. In order to achieve a better perform ance, re-partitioning
of the graph based on a specified optim ization technique is required. Ac
cording to the num ber of available PEs, topology of the interconnection
network, and the current communication cost m atrix, a partitioning pro
cess is performed and a new communication cost m atrix is generated.
• Allocation information: This file is generated after the optim ization phase.
It provides the information to indicate the location of the PE where a
proper process should be allocated. This inform ation can also be used to
load Occam processes to their corresponding PEs automatically.
• M acro instruction: According to the PD FG , each simple node is trans
lated into a macro instruction which contains the actor code, arc infor
m ation and partition information. The arc information provides each arc
name, arc type and the d ata passing m ethod, while the partition infor- |
m ation indicates the corresponding partition where the actor belongs to.
Applying a m acro definition table, macro instructions can be expanded
to an object program .
3.2 Graph Generation
i
IF l is sim ilar to an assembly code. A sample SISAL program and its cor
responding IF l statem ents are shown on next page, while the corresponding
data-flow graph is shown in Figure 3.2.
33
*** A sim ple SISAL program ***
d e fin e demo
fu n ctio n dem o(x,y: r e a l ; returns r e a l)
x*x+y*y
end fu n ctio n
*** IF! d e sc r ip tio n s ***
T 1 1 0 %na=Boolean
T 2 1 1 %na=Character
T 3 1 2 %na=Double
T 4 1 3 %na=Integer
T 5 1 4 %na=Null
T 6 1 5 %na=Real
T 7 1 6 %na=Wild
T 8 10
T 9 8 6 0
T 10 8 6 9
T 11 3 10 9
C$ C Faked IF1CHECK
C$ D Nodes are DFOrdered
C$ E Common Subs Removed
C$ F Livermore Frontend V ersio n l.5
C$ G Constant Folding Completed
C$ L Loop Invars Removed
C$ 0 O ffsets Assigned
X 11 "demo" %ar=5 %sl=6
E 3 1 0 1 6 %of=l y.m k=V
N 1 152 %sl=7
E 0 1 1 1 6 %na=x %of=2 °/.m k = V %sl=7
E 0 1 1 2 6 %na=x %of=2 % m k = V %sl=7
N 2 152 %sl=7
E 0 2 2 1 6 %na=y %of=3 %mk=V %sl=7
E 0 2 2 2 6 °/,na=y %of=3 y,mk=V %sl=7
N 3 141 %sl=7
E 1 1 3 1 6 %of=4 % m k = V
E 2 1 3 2 6 y.of=5 y,mk=V
34
Times
Node 2
Plus
Node 3
Times
Node 1
Figure 3.2: A simple IF l graph
Each line in IF l corresponds to a statem ent, and the first character of the
line is used to distinguish nodes from edges, etc. According to the first character
of each IF l statem ent, the graph generator performs the corresponding actions
to generate PSG and DFG. Table 3.1 shows types of the IF l statem ents and
their corresponding semantic actions th at are perform ed by the graph generator.
3.2.1 Simple Nodes
A simple node in IF l corresponds to an actor in the data-flow graph. It will
be translated into an occam process in the code generation phase. There are
54 simple nodes in IF l coded from 100 to 153. A listing of the implemented
nodes and their occam definitions are shown in Appendix B. The simple node
is created as a linked list. It contains an operation code to identify the function
35
T ab le 3.1: S e m a n tic a c tio n s
C h a ra c te r M e a n in g A c tio n
C comment no action required
T type description update the type table
N node description create a simple node
{
beginning of a compound node create a com pound node
}
ending of a compound node term inate the current
com pound node
E edge description create an edge and attach
to the corresponding node
L literal edge create a literal edge and
attach to the corresponding
node
X begin of a graph create a function node
G begin of a subgraph term inate the previous
subgraph
I an external function record the external
function name
36
of the actor, a partition num ber to indicate a partition which this actor belongs
to, the num ber of input arcs, and the num ber of output arcs. The simple node
has three link fields th a t link to its first input arc, first output arc, and the
next simple node w ith the same partition num ber. The type of the operation
can be determ ined by examining the types of the input arcs and output arcs.
Figure 3.3 is such a linked list which presents a “AElement” node th a t selects
i one element from an array.
code
p art # # input # output
105
to input arc
to next
simple node
to output arc
name
sequence
channel $
type
link
no name
(real array)
(real)
Jiil
(integer)
nil
Figure 3.3: A simple node
i
37
3.2.2 Compound Nodes
Com pound nodes contain subgraphs in IF1. They consist of inform ation of the
program constructs. In the PSG, a com pound node has an opcode to identify
its function, a clink links to other compound nodes directly under its control,
and a slin k links to its subordinate simple nodes. Three im portant com pound
nodes are described below to illustrate their control operations. These are
respectively: fo ra ll operations, while repeat/repeat u n til loops, and select
operations.
F O R A L L o p e ra tio n A forall node contains three parts: “RangeGenera-
tor” , “Body” , and “G ather” . We have described earlier a SISAL function for
the addition of two arrays. There is a forall node th at controls the addition
of vectors. All N values of the sum of arrays A and B could be com puted in
parallel, and the results could be collected into an array C. After the SISAL
compilation, an IF1 graph which includes PSG and D FG, as shown earlier in
Figure 2.3., is produced. The graph can be divided into three partitions: the
first is a ” RangeG enerator” actor; it broadcasts the index value to the second
partition. The second partition is a group of three simple nodes which are all
children of a block node. In the second partition, two “AElem ent” actors re
ceive the indices sent from partition 1 to select the array elements A[i] and B[i\,
a “Plus” actor adds these two elements and sends the result to the “AG ather”
actor to form a resulting array.
38
W H IL E R E P E A T /R E P E A T U N T IL lo o p s “LoopA” and “LoopB” in
IF1 are used to perform the SISAL function of “WHILE ..... R EPEA T” and
“REPEAT ..... UNTIL” respectively. They both contain four parts: “initial” ,
“test” , “body” , and “result” , and the testing is perform ed in the “test” part
of both compound nodes “LoopA” and “LoopB.” The only difference between
these two compound nodes is: LoopA performs leading test, whereas LoopB
performs trailing test. According to the data-flow principles of execution, the
leading test and trailing test are determ ined by d ata dependencies only. It
therefore does not affect the correctness of the execution of the process. Thus,
we may consider them as the same node, Loop, in the occam code generation.
The following SISAL function uses a ”Loop” construct to generate a se
quence of values to form an array. The translated graph is shown in Figure 3.4.
The “initial” part of the “Loop” broadcasts the initial value to the other
three parts, the “test” p art checks the condition and sends the boolean value
to the “body” and “result” respectively. Once the ”body” receives the proper
condition, it performs the m ultiplication to generate a new value and sends this
value to the “result” part. At the same time, the “result” of the “Loop” gathers
the values sent from the “body” to form an array.
S E L E C T o p e ra tio n “Select” nodes are used in IF l to implement the SISAL
function “if., then .. else” . They contain three parts: “predicate” , “then” and
“else” . There is a simple SISAL program w ith a “Select” node shown below.
The program can be translated by the SISAL compiler into an IF l graph as
39
I
I
Loop
Less
Times
NoOp
/
u
A G ather
|c initi
initial _ _ _ JGest _ _ -Body _ JLesnlt____
Figure 3.4: An example w ith loop operation
L .
40
shown in Figure 3.5.
Select
Block node
predicate r,|,- else
V
Figure 3.5: An example w ith select operation
The “predicate” p art of the “Select” node tests w hether a is less th an b or
not, and sends the result to the “Int” actor to generate a boolean value for
activating the execution of the “then” p a rt or “else” part.
3.2.3 Function Node and Partition Node
In addition to generating simple nodes and com pound nodes in the PSG and
DFG, there are two auxiliary nodes created by the graph generator to facilitate
the partitioning and code generation works.
41
F u n c tio n n o d e A function node is created as a head which points to the
beginning of the PSG and to the first partition node of the function, it also
serves as a delim iter to separate functions in a program . A function node
contains the function nam e, in p u t/o u tp u t argum ents, and the channels used
w ithin the function. A function is a basic unit in the SISAL program to be
translated into IF l, and the code generator of our system translates entire
graph of a function into an occam process. All inform ation of the function
node is during the graph generation phase.
P a r titio n n o d e Before any partitioning process has been perform ed, each
simple node belongs to a separate partition which is pointed by a link of a
partition node. The partition node is constructed as a linked list, it links to the
next partition and to the simple nodes inside the partition. There is a record
th a t provides the inform ation of both input and output arcs in a partition node,
this includes a set of input arc names, a set of output arc names, the internal
in p u t/o u tp u t arcs and the external in p u t/o u tp u t arcs. This inform ation can be
used to build up a communication cost m atrix so as to facilitate the partitioning
tasks.
3.3 Basic Partitioning
The translation described in the previous section could lead to some inefficient
code generation. In order to better m atch the resulting code to the structure of
the machine and to make a better identification of the structure of the original
42
problem , we here describe an approach which is used in the first partitioning
stage of the translation.
To proceed w ith the partitioning step, the IF l graph is translated into the
PSG and DFG. A high-level partitioning algorithm , based on the program struc
ture and on heuristics which generate a PD FG (Partitioned Data-Flow Graph)
can be described as follows:
1. Traverse the PSG, starting from T , the root of the PSG.
2. If T is a com pound node or a block node w ith com pound nodes as children,
then go to step (3) to traverse this subtree until the tree is exhausted.
Otherwise go to step (5).
3. If the current child of T is a simple node, then set this simple node in a
separate partition, collect its input and output inform ation and store the
comm unication links (I/O channels in occam) of this partition into the
channel table, and traverse the next child of T , if T has no other children
!
then go to step (6), otherwise repeat this step until the child of T is not
a simple node.
4. Set T = child of T , go to step (2).
5. Process the block node which contains only simple nodes: Set all the
children of this block node in the same partition, collect the input and
output inform ation of this partition, and store the com m unication links
of this partition.
6. w ith T = previous T, return to step (2).
43
This partitioning algorithm proceeds recursively: it starts from the root
of the PSG and traverses the PSG until the tree is exhausted. Applying this
algorithm , a block node which contains only simple nodes will be grouped in a
partition. The partition becomes an occam process which may contain several
m acro-actors. Therefore, the communication costs between simple nodes of this
partition are elim inated. However, some degree of parallelism will be sacrificed,
since the execution of this process will be sequential. Further optim ization work
will be discussed in the next chapter. In addition to the PD FG , this procedure
also creates a “Channel Table” which contains all the I/O links of a partitioned
group. After partitioning, communication costs between partitioned groups can
be calculated by using the channel table and the comm unication cost m atrix
which is also generated by the partitioning procedure. We can optimize the
PD FG by considering the communication costs and the load balancing as well
as the degree of parallelism. In addition, the comm unication cost m atrix can be
used by the allocator for optim al task assignment. The allocator repartitions
the PD FG to obtain an optim al occam process and to determ ine the allocation
of the process. This will be discussed in the next chapter.
3.4 Implementation of Data Structures
The m ethod adopted to handle arrays in this system is sim ilar to th a t used in
the HDFM [26]. As opposed to the complex system of heaps [18] or I-structures
[6], we have chosen the simplified option of von Neum ann arrays which are never
updated until it is determ ined th at no more read accesses can be m ade to the
44
current value of the array. Only then, can the array be modified and become a
new array. This sequence of reads followed by one w rite is compiler controlled.
This m ethod brings the very im portant advantage th a t no complex mechanisms
are needed to ensure the safety of array operations. This comes at the expense
of possible compiler induced loss of parallelism.
3.4.1 Representation and Implementation of Arrays
In our system , a multi-dimensional array is represented by a one-dimensional
array of pointers to “row arrays” having the next lower dimension. M ulti
dimensional arrays are stored as arrays of arrays. This representation is cor
responding to the Occam array construct. Figure 3.6 shows an IF l graph for
reading an element from a two-dimensional array. The first read access (AEle-
m ent in IF l) to the two-dimensional array provides a pointer to the “row array” ,
and the second read access gets the value of the element.
Some basic array operations are described below. Their corresponding d ata
flow actors are shown in Figure 3.7.
• C re a te : using A F ill actor to create an array filled w ith the value given
on input arc three. The integer on input arc one gives the lower bound,
and the integer on arc two gives the upper bound of the array being built.
• S elect: using A E le m e n t actor to extract the element of array at the
index position given on input arc two. Note th at only one level of sub
scripting is done.
A
AElement
A[i]
AElement
Figure 3.6: Im plem entation of a two-dimensional array
• R ep lace: using A R e p la c e actor to produce an array th a t differs from
the input array at a given index. The integer on input arc two gives the
index position to be changed.
• C o n c a te n a te : using A C a te n a te actor to produce the catenation of its
input arrays.
3.4.2 Representation and Implementation of Streams
In IF l stream s have nearly the same set of operations as th a t of arrays. Since
the only d ata structure provided in occam is array, all stream m anipulations
are implemented based on array operations. Figure 3.8 shows some im portant
actors for stream operations which are described as follows.
46
lower upper
bound i bound j value v
AFill
array A index i
AElement
(a) (b)
index i value v array A
ARplace
array B array A
A Catenate
array A \\B
(c)
(d)
Figure 3.7: Array operations (a) Create (b) Select (c) Replace (d) Concatenate
47
stream A
stream A stream B
AAddh AElement
stream_append(A, B )
stream A
ARemL
Figure 3.8: Stream operations (a) Append (b) Select first element (c) Select all
b ut first element
48
I
• C re a te : sim ilar to the case of array, using A F ill actor to create a stream
filled w ith the value on input arc three, but integer on arc one is ignored
and the integer on arc two provides the length of the stream .
• A p p e n d : using A A d d h actor to add an element to the high end of the
stream . The input stream is on arc one and the element to appended is
on arc two.
• S elect first e le m e n t, i.e., s tre a m -firs t operation: using A E le m e n t
actor to select the first element of the stream . Note th a t the input stream
is on arc one, but the second input arc is a constant w ith value 1.
• S e lec t a ll b u t first e le m e n t, i.e., stre a m _ re s t operation: using A R e m I
actor to remove the lowest element of the stream .
• C o n c a te n a te : sim ilar to the case of array, using A c a te n a te actor to
produce the concatenation of its input stream s.
3.5 Occam Code Generation
The Occam code is generated in two steps: the m acro instruction generation
step th a t creates macro instructions directly from the PD FG , and the code
generation step, which expands the m acro instruction to generate the object
code. Note th a t the object code selected here is Occam, the low-level language j
of the T ransputer. As described before, there exist Occam constructs which
can im itate the behavior of the data-flow model of com putation.
In the first step of the code generation, the PD FG is read to create a sequence !
49
of m acro instructions which can be considered as an interm ediate representa
tion code to be used for generating target language [24,25]. Code generator
references the channel table and takes the I/O inform ation from the simple
node as the argum ents of the m acro instruction. For instance, a m acro instruc
tion which represents a ”Plus” operation and its corresponding Occam code are
shown in Figure 3.9.
From the id of each arc, we can determ ine the type of each arc (a global
variable, a literal, a function name, or a channel), so as to find out the type of
the instruction. We can then use the code of the instruction as an index and the
type of the instruction as a sub-index to access a predefined m acro definition
table for finding the corresponding Occam code. Two approaches for the code
generation step have been applied:
3.5.1 Fine-grain Approach
This approach expands each m acro instruction into a separate Occam process.
Each process corresponds to an actor. We can lump several processes together
to build a bigger process (a macro actor) while some optim ization techniques
[30,32] can be applied. However, in order to expand the m acro instruction of
an actor which is controlled by some compound nodes, the num ber of d ata
transfers through the arc of this actor m ust be known at compile tim e. The
num ber of d ata transfer can be considered as the num ber of tokens which should
be on the arc at run tim e, e.g., n iterations of a loop the arc needs to transfer
d ata n tim es, i.e., the sending actor m ust be translated to be able to send
50
actor code for Plus
under the control of a loop
(
j belongs to partition 3
code flag
2 1 1
p a rt#
I/O arcs
141 1 3
7
num ber of inputs
num ber of outputs
type of I/O arcs are
all channels
\
\
t
i
5 6 C00005
6 6 C00006
7 6
C00007
id of the arc
( less th an 10000 is a channel )
( a )
REAL32 sysOl :
REAL32 sys02 :
SEQ
C00005 ? sysOl
C00006 ? sys02
C00007 ! sysOl + sys02
(b)
T
nam e of the arc
type of the arc
( 6 means real )
Figure 3.9: (a) A m acro instruction for P lu s (b) Corresponding occam code
51
n tokens on its output arc while the receiving actor will expect to receive n
tokens. Depending on the application program , the num ber of d ata transfer
can be expressed either as constants or expressions. Since unm atched num ber
of d ata transfers between the sending and the receiving processes in a data-
driven machine will result in a deadlock, the exact num ber of d ata transfers
m ust be generated at compile tim e in this approach. Thus, for a program
consists of nested loops, the num ber of d ata transfers of an actor in each level
of the loops m ust be carefully calculated at compile tim e. Moreover, if there is
an if..then..else construct, we m ust prevent actors, either in then p a rt or else
part, from starving. Since only one branch of the if..then..else construct will
be executed at run tim e, actors in the unexecuted branch will still be waiting
for d ata input in the data-driven environm ent and may stop the program to be
continued. Thus, to ensure both branches will not starve, in addition to sending
data to the TRUE branch, a message m ust be sent to the FALSE branch to
m atch the input requests of actors in the branch.
3.5.2 Macro-actor Approach
Three compound nodes in IF l have been implemented in the system: Forall
facilitates the vector operations, Loop is used for while..repeat and repeat..until
constructs, and the Select node is used to implement the if ..then., else statem ent.
In this approach we consider the first actor under the control of a com pound
node as a control actor. The control actor is translated to a control process
which is not only performing the original function of the actor but also used to
control all other actors which are under the control of the same com pound node.
In the same PE , the control process initiates other sub-processes which are
under its control by a procedure call, and then transfers d ata through channels.
While processes are allocated to different PEs, the d ata transfer between control
process and other processes is passed ju st through external channels. In this
approach, the num ber of d ata transfer between two processes is not necessary to
generated during compile tim e. The control process will autom atically initiate
or term inate processes which are under its control at run-tim e. This approach
is especially useful for translating complex com pound nodes in IF l.
3.5.3 Example: Translation of Matrix Multiplication
Program
As we m entioned earlier, a SISAL program which performs the m ultiplication
of two m atrices was shown in section 2.2.2. After the SISAL compilation, the
generated IF l graph was shown in Figure 2.4. In Figure 2.4, actors “RangeGen-
e ra to rl” and “RangeGenerator3” , first children of nodes “foralll” and “forall3”
respectively, broadcast index values i and k to the actors “A E lem entl” (Ar
ray Element select) and “AElement2” respectively. Once actors “A E lem entl”
and “AElement2” receive the index values, they forward the pointers A[i,*] and
B[k,*] to the actors “AElement3” and “AElement4” . “AElement3” and “AEle-
m ent4” axe also waiting for the index values k and j which are sent from the
I
actors “RangeGenerator3” and “RangeGenerator2” , in order to generate ele
m ents A[i,k] and B[kj] respectively. The “Times” actor receives two elements
53
A[i,k] and B[kj] and sends the product to the “Reduce” actor which accumu
lates the received d ata and forwards the result to actors “A G atherl” as well as
“AG ather2” to form a two dimensional array.
W ithout further partitioning, the above graph can be translated into an
occam program as shown below.
PROC MatMult( VAL [] [] INT a. VAL [] [] INT b. VAL INT m,
VAL INT n, VAL INT 1 . [][]INT OUTPTA)
CHAN OF ANY C00024, C00007. C00010, C00014. C01401. C00017
CHAN OF ANY C00011, C00018, C00019, C00016, C00999, C00013
CHAN OF ANY C01301, C99901. C00009, C99902, C01001. C01002
CHAN OF ANY C01101:
PROC proc00013O
PAR
[N]INT sysOl, sys02:
BOOL systest, sys03:
[M][N]INT sys04:
SEQ -- AGather2 (107)
systest:= TRUE
sysOl := 0
WHILE systest
ALT
C00009 ? sys04 [ sysOl ]
SEQ
sys01:= sysOl + 1
C99901 ? sys03
SEQ
systest:= sys03
OUTPTA:= sys04
PROC proc00012()
PAR
INT sysOl, sys02:
BOOL sys03, systest:
54
[N]INT sys04:
SEQ
systest:= TRUE
sysOl := 0
WHILE systest
ALT
C01301 ? sys04 [ sysOl ]
SEQ
sys01:= sysOl + 1
C00999 ? sys03
SEQ
systest:= sys03
C00009 ! [ sys04 FROM 0 FOR ( sysOl
PROC procOOOliC)
PAR
INT sysOl, sys03:
BOOL systest, sys02:
SEQ
systest:= TRUE
sysOl := 0
WHILE systest
ALT
C00016 ? sys03
SEQ
sys01:= sysOl + sys03
C99902 ? sys02
SEQ
systest:= sys02
C01301 ! sysOl
PROC proc00007()
PAR
[N]INT sysOl:
INT sys02:
SEQ
C00010 ? sysOl
C00014 ? sys02
AGatherl (107)
l )3
-- Reduce
-- AElement2
55
C00018 !
INT
SEQ
sysO l [ sys02 ]
sy s0 2 :
AElement3
C01401 ? sys02
b [ sys02 ]
sysOl:
sys02:
C00017 !
[N]INT
INT
SEQ
C00017 ? sysOl
COOOll ? sys02
C00019 ! sysOl [ sys02
-- AElement4
INT sysOl. sys02:
SEQ Times
C00018 ? sysOl
C00019 ? sys02
C00016 ! sysOl * sys02
PROC proc00006()
PAR
[N]INT sysOl:
INT sys03: -- RangeGenerator3 (142)
INT sys02:
INT sys04:
SEQ
C01101 ? sys02
C01001 ? sysOl
sys04:= ( n - 1 ) + 1
SEQ sys03 = 0 FOR sys04
C00014 ! sys03
COOOll ! sys02
C01401 ! sys03
C00010 ! sysOl
proc00007()
C99902 ! FALSE
PROC proc00005()
PAR
PAR
56
CN]INT sysOl:
INT sys03:
INT sys04:
SEQ
C01002 ? sysOl
sys04:= (1 - 1
SEQ sys03 = 0 FOR
PAR
C01101 ! sys03
COlOOl ! sysOl
proc00006()
procOOOll()
C00999 ! FALSE
— RangeGenerator2 (142)
) + 1
sys04
PROC proc000030
PAR
INT
SEQ
C00007 ?
C01002 !
sys02:
sys02
a [ sys02 ]
-- AElementl
PROC proc00002()
PAR
INT sys03:
INT sys04:
SEQ
sys04:= ( m - 1
SEQ sys03 = 0 FOR
PAR
C00007 ! sys03
proc00005()
proc00003()
proc00012()
C99901 ! FALSE
-- RangeGeneratori (142)
) + 1
sys04
PAR
57
proc00002 ()
proc00013 ()
Actors “AElement2” , “AElement3” , “AElement4” and “Tim es” are in the
same partition, actors “A G atherl” and “AG ather2” are grouped into another
partition, and the rem aining actors are in separate partitions. Each partition
corresponds to an occam process and can be executed in parallel.
58
Chapter 4
OPTIMIZATION:
PARTITIONING A N D
ALLOCATION
O ptim ization operations such as constant folding, redundant-subexpression elim
ination, invariant operation elim ination, and strength reduction in loop opti
m ization have been done in the SISAL com pilation phase of the system [61].
The optim ization strategies discussed here are mainly concerned w ith the m in
imization of comm unication costs as well as w ith the efficient im plem entation
of array operations and function calls.
59
4.1 Communication Cost Thresholding
In order to m atch the num ber of partitions to the num ber of available PEs,
a repartitioning process is required. In this section, we present the commu
nication cost thresholding m ethod which lumps togather those partitions w ith
comm unication costs greater th an a specified value, th a t allows for a decrease
of the num ber of partitions and a reduction in the total com m unication costs.
i
i
4.1.1 Communication Cost Matrix
As described in section 3.3, after the basic partitioning step, a comm unication
cost m atrix is generated. For a PD FG w ith n partitions, the comm unication
cost m atrix is a 2-dimensional n x n array, say A , w ith the property th a t A ( i,j)
is the comm unication cost between processes i and j . The comm unication cost
can be estim ated to be the am ount of d ata to be transm itted m ultiplied by
the path length. In a mesh-connected network w ith 16 PEs, the path length
is 0, 1, 2, or 3. Note th at, in actuality, after the basic partitioning phase the
comm unication cost m atrix contains the d ata size only, while the p ath length
will be determ ined after the allocation phase. The com m unication cost m atrix
is built during the partitioning phase and modified at the optim ization phase.
Before allocation, basically, the com m unication cost between two partitions
is one unit for transferring each simple value. Thus, in a loop, the cost of
comm unication is proportional to the num ber of the iterations. These d ata can
be obtained at compile tim e, but they m ay be either a constant or a variable
name.
60
4.1.2 Threshold Method
W hen the num ber of PEs is much less than the num ber of partitions, a reparti
tioning technique will be helpful in reducing the num ber of partitions. Assuming
th a t all PEs are completely connected, i.e., the path length between different
PEs is 1, we can then define a threshold value for repartitioning in order to
allocate all the processes to the PEs and reduce the com m unication costs. In
other words, the goal of the algorithm is to lump together those partitions be
tween which comm unication costs less than a specified value have been found.
The algorithm of the threshold m ethod is described below:
1. L et D := n — 1
2. Determ ine the threshold value: S := !C/=t+i i
3. For i — 1 to n — 1 do the following steps:
4. In row i of the comm unication cost m atrix, lump partition i w ith those i
partitions which have communication costs > S
5. Calculate the comm unication costs between the new partitions
In the algorithm , we have selected D = n — 1 to assure th a t the threshold
value S will be in the range between the m inim um and m axim um value of
A [ i,j] (i,j — 1,2,..,n ) and prevent from combining all partitions into a single
partition. After repartitioning, if the num ber of partitions is still much greater
th an the num ber of available PEs, we may let D = D — 1 and repeat this
procedure for further partitioning such th a t the num ber of partitions is less j
th an or equal to the num ber of available PEs and the largest comm unication
61
cost in the m atrix is less than the execution tim e of the biggest actor. Since
step (2) has 0 ( n 2) complexity, steps (3) to (5) need operations, the time
complexity of the algorithm is 0 ( n 2), where n is the num ber of partitions.
4.1.3 Example
In order to m atch the num ber of processes to the num ber of PE s w ithout losing
excessive parallelism, the comm unication cost thresholding m ethod is applied
to repartition the PD FG . As the example of m atrix m ultiplication described in
section 3.5.3, after basic partitioning the m atrix m ultiplication program needs
7 processes in occam. If the threshold m ethod is applied, the required processes
are reduced to 5, and the communication costs among “RangeG enerator3” ac
tor, “Reduce” actor, and “AElement2, AElement3, AElement4, Tim es” macro-
actor are elim inated. For further repartitioning, the processes can be reduced
from 5 to 3 by lumping the m acro-actor and the “A G atherl,2” together. Fig
ure 4.1, 4.2, and 4.3 show the comm unication cost m atrices generated in each
step. Note th a t after each partitioning process has been perform ed, channel
comm unications w ithin a partition should be elim inated to reduce the inter
nal comm unication costs. For instance, the occam code of process proc0007 in
the occam program of m atrix m ultiplication example shown in section 3.5.3 is
re-generated below, in which internal channels (700017, <700018, and (700019
have been elim inated. Note, now the maxim al com m unication cost is less than
the execution tim e of the biggest actor “RG2, RG3, AE2,3,4, Times, Reduce,
AG1,2,” th a t the algorithm can term inate.
62
PROC proc00007()
PAR
[N]INT
INT
INT
SEQ
C00010 ?
C00014 ?
C00011 ?
C00016 !
sysOl :
sys02 :
sys03 :
sysOl
sys02
sys03
sysOl [ sys02 ] * sys02 [ sys02 ][ sys03 ]
RG1 AE1 RG2 RG3 AE2,3,4,Times Reduce AG 1,2
RG1 0 m 0 0 0 0 0
AE1 0 0 0 0 m 0 0
RG2 0 0 0 0 ml 0 0
RG3 0 0 0 0 mnl 0 0
AE2,3,4,
Times 0 0 0 0 0 mnl 0
Reduce 0 0 0 0 0 0 ml
AG1,2 0 0 0 0 0 0 0
Figure 4.1: Communication cost m atrix (Basic partitioning)
RG1 AE1 RG2 RG3,AE2,3,4,Tim es,Reduce AG 1,2
RG1 0 m 0 0 0
AE1 0 0 0 m 0
RG2 0 0 0 ml 0
RG3,AE2,3,4,
Times, 0 0 0 0 ml
Reduce
AG 1,2 0 0 0 0 0
Figure 4.2: Communication cost m atrix (2nd partitioning)
64
RG1 AE1 RG2,RG3,AE2,3,4,Tim es,Reduce,AGl,2
RG1 0 m 0
AE1 0 0 m
RG2,RG3,
AE2,3,4,Times,
Reduce, AG 1,2
0 0 0
Figure 4.3: Com munication cost m atrix (3rd partitioning)
4.2 Code Replication
In m ost numerical com putation program s there are m any portions of the pro
gram are executed repeatedly, such as the body of a loop and the functions.
These repeatly executed codes may be replicated and allocated to separate PEs
to achieve a higher degree of parallelism.
4.2.1 Forall Construct and Loop Unrolling
SISAL has two forms of /or, the product form and the non-product form. The
later is equivalent to the conventional for loop regarding the range of the indices
generated. The product form corresponds to the IF1 fo ra ll node which is a
com pound node containing a range-generator, a block which really perform
the operations, and a gather node. The body of the loop can be executed in
65
parallel as this construct insures th a t there are no d ata dependencies between
two iterations of the loop. Our approach consists in using the concept of loop
unrolling [3,55], a very efficient optim ization approach for array operations.
This is particularly true in operations which are under the control of the IF1
fo ra ll pseudo-node.
As m entioned in the previous chapter, a forall node contains three parts:
an index-generator, a body, and a result-collector. If p PEs are available for
loop unrolling, then the process of index-generator m ay generate p indices and
propagate them through p independent channels simultaneously. The body and
the result-collector parts of the fo ra ll node are grouped in the sam e partition
and replicated p times. They are allocated to p separate processes and receive
indices from the index-generator process through separate channels. In this
case, the num ber of iterations is decreased from n to and p results can be
created simultaneously. Thus, the speedup may approach p.
Two kinds of problems are encountered here. The first one deals w ith the
am ount of d ata to be sent to each PE. One has to care about overlapping when
needing several d ata of the same array to perform a specified operation. This
problem can be solved by sending the same d ata to every processor.
The second problem is in regard to the determ ination of the num ber of itera
tions to be distributed to different PEs where unrolled bodies are located. Since
the transputer network is not completely connected, some unrolled processes
may require more than one hop to comm unicate w ith the root transputer. De
pending upon the size of the array and the distance between the root transputer
66
and the transputer where the unrolled process is allocated, we can determ ine
the load of each replicated process to achieve an optim al solution by solving
the following linear program m ing problem.
Minimize
/ = x 0t
subject to the constrains:
Xi > X{+ 1 > 0 for i = 0 ,1 , n
kC 0 ~ j ~ 2(7^ S j = n - j b + i t X { "f x nt ^ X f i —k t for J c — 0 , 1 , tv
E?=o hiXi = N
W here
Xi = size of the array to be executed in an unrolled process
w ith i hops to the root transputer,
t — execution tim e of an unrolled process,
n = m axim um num ber of hops allowed,
Cd = unit cost of d ata transfer,
C0 = comm unication overhead,
N = size of the original array,
hi = num ber of transputers w ith i hops to the root transputer.
A detailed explanation of this model can be found in Appendix C.
67
4.2.2 Function calls
In the data-flow scheme, a function call can be considered as an actor which
requires a function name and argum ents on its input arcs to generate results.
Fig. 4.4 shows such an actor. W hen the function and the calling process are
/ unctionjnam e argy o,rgn
call
result result,
Figure 4.4: A function call actor
located on the same PE , the calling scheme in occam can be expressed as follows:
functionjnam e (argum entl, argum ents,..., resultl, results,...)
It is sim ilar to other languages, the call actor receives argum ents and passes
them to a procedure, nam ed function-nam e, for com putation to generate results.
This scheme can be implemented easily, but has a penalty of losing a lot of
parallelism . In this scheme, to call a function the calling process has to wait
until results are generated completely. Moreover, the function can not be called
by other calling processes which are located on rem ote PEs.
68
W hen the function and calling process are located on different PEs, it is
possible to execute the function and the calling process in parallel. In this
scheme, the comm unication between the calling process and the function is
done through external channels. Once a function call operation is encountered,
i.e., a call actor is fired, the calling process ju st passes argum ents to the specified
function through an external channel, the other processes, which are not waiting
for the results of the function, can be executed successively. On the other hand,
when the function process receives all of its input argum ents, the specified
operations can be executed, and the results are sent back to the calling process
through an external channel. The m ajor problem of this scheme is th a t a
function can not be called by several calling processes sim ultaneously if the
function is not reentrant. However, non-reentrant functions can be duplicated
according to the num ber of the calling processes to achieve higher degree of
parallelism.
4.3 Process Allocation and D ata Allocation
In order to achieve high efficient com putation in a m ultiprocessor system , we
m ust develop an approach to avoid processor contention ( instruction execution
conflicts ), and reduce comm unication costs. In addition to the basic partition
ing, loop unrolling m echanisms, function call scheme and comm unication cost (
thresholding m ethod described above, allocation is necessary to achieve a good
processor utilization and to minimize inter-processor comm unication costs.
69
4.3.1 Multiplexing, Demultiplexing, and Routing
After partitioning, all of the occam processes will be loaded on transputers
i
I for execution. In order to ensure the proper d ata transfer among processes, a
router is required to handle the comm unication among those processes which
are allocated on separate transputers. The function of the router includes mul-
tiplexing, demultiplexing, and routing. The router configuration is shown in
Figure 4.5, and the router program is presented in Appendix D.
M u ltip le x in g a n d D e m u ltip le x in g In our research, the transputer network
is constructed as a mesh topology for full utilization of the four links of each
transputer. Since there is only one full-duplex physical link between each pair
of transputers, it is not possible to consider direct folding of several ( th at
is, more th an one in each direction) occam channels onto one physical link.
Instead, one m ust imagine the notion of m ultiplexer MUX and demultiplexer
DMUX processes ( see Figure 4.6).
While “logical” connections are used for transferring d ata between actual
processes, these m ust be folded onto one physical link ( and therefore one single
occam channel). The core of the MUX process is an A L T construct, which
authorizes the OR on the inputs and enables the activation of MUX when an
input from any of the processes (P i,P 2, etc.) is present. The input value is
then transm itted transparently to the output channel m apped onto the inter-
transputer physical link.
In the receiving transputer, the DMUX process accepts the messages and j
70
Router Kernel
M ultiplexing
Receiving
Demultiplexing
Routing
User’s processes
T ransputer
Decoder
External
Channel
Receiver
External
Channel
Sender
Internal
Channel
Sender
Internal
Channel
Receiver
Figure 4.5: Router configuration
71
T ransputer 2 T ransputer 1
P I
P2 Q2
DMU MUX
Physical link-
P n
Logical links
Figure 4.6: MUX and DMUX processes
72
dispatches them to the proper receiving process (Q 1, Q 2, etc.). This dispatching
is based on inform ation contained in the message itself. Note th a t the DMUX
process actually accomplishes a switching function based on received data. The
MUX process, on the other hand, performs no visible function except th a t of a
merge operator. It is not possible to merge the d ata paths directly into a single
occam channel due to hardw are restrictions in the current im plem entation of the
transputer. There is no provision to arbitrate conflicts between two processes
attem pting to take over the link at the same time.
R o u tin g In addition to implementing MUX and DMUX processes for solving
the problem of transferring d ata between two adjacent transputers, a routing
process is necessary to accomplish the d ata transfer between two non-neighbor
transputers. A router implemented in our system can be logically divided
into five parts: “external.channel.receiver” , “internal.channel.receiver” , “exter-
nal.channel.sender” , “internal.channel.sender” and “decoder” . In addition, a
predefined routing table based on the network topology and the shortest p ath
rule, a comm unication protocol, and buffering techniques are applied to fulfill
a complete router.
• External.channel.receiver: It is waiting for a message which is coming
from one of the four external links of a transputer. Once d ata has been
received, this receiver passes the destination address to the “decoder” .
• Decoder: It takes the destination address to decide where the received
message should go. The basic behavior is using the destination address
73
as an index to access the routing table for determ ining the direction of
forwarding, either a physical channel through “external.channel.sender”
or a local channel through “internal.channel.sender” .
• Internal.channel.receiver: It polls all the non local channels and accepts
a message from one of them . Once d ata has been received, this receiver
forwards it to the “external.channel.sender” along w ith a channel num ber.
• External.channel.sender: It accepts messages from “decoder” or from
“internal.channel.receiver” and sends to a specified external channel.
• Internal.channel.sender: The process is to deliver a message which is com
ing from the rem ote transputers to a local process. Once the message is
determ ined to be used for a local process by the “decoder” , the channel
num ber in the message is used to distinguish the nam e of the channel
to be forwarding the data, d ata is stored in a buffer, and a correspond
ing flag is set to indicate th a t d ata in the buffer is available. “Inter
nal.channel.sender” checks flags and forwards d ata through a correspond
ing channel.
4.3.2 Static Allocation
In order to achieve an efficient task assignment, we use the nearest neighbor
algorithm for our allocation scheme, i.e., based on the PD FG , for every allo
cated partition p we allocate all partitions have connections to the partition p
on the neighbor transputer of p allocated as near as possible. A static allocator
has been implemented in our system to determ ine the appropriate locations for j
allocating the partitioned processes to transputers.
According to the PD FG , we can build n 1-level trees. Each tree contains
four leaves. Initially, all leaves of the tree are empty. The root of each tree
corresponds to each partitioned node of the PD FG . These are represented as
p[l],p[2],... and p[n]. The four leaves of the tree w ith root p[i] (represented as
p[i,l], p[i,2], p[i,3], and p[i,4]), are arbitrarily selected predecessors or successors
of the node p[i] in the PD FG and correspond to the east, north, west, and south
links of the root p[i].
A static allocation algorithm applied to the mesh-connected transputer net
work is described below.
1. Allocate the first node to a PE arbitrarily, and p u t its adjacent nodes in
its neighbor PEs
2. Starting from the first child of the allocated nodes, do the following steps
until all nodes have been allocated
3. Allocate the neighbors of the selected node one by one, except the nodes
th a t have been allocated
4. Adjusting the location of the node to be allocated, so as to fit the links
of this node to the other nodes already allocated
In this algorithm , in order to allocate all partitions to the proper PEs, the
num ber of partitions should not be greater than the num ber of available PEs.
However, if the num ber of partitions is m uch greater th an the num ber of the
available PEs, a dynamic allocation m ethod (will be described later) should
75
be applied w ith this m ethod to achieve an effective allocation. The num ber of
the iterations in the procedure is a constant, 4, for allocating a certain node
to a PE. However, in the loop, the procedure is called n tim es, where n is the
num ber of PEs. Thus, the total processing tim e is proportional to the num ber
of PEs, the tim e complexity of this algorithm is O (n).
The m axim um p ath length between two partitions in a mesh-connected net
work w ith p PEs is yjp — 1. Thus for random allocation, the expected path
length is The total comm unication costs are X ^ i1 ]C”=j+i Applying
the static allocation m ethod, the average path length between two partitions
approaches 1 in the sim ulation of the examples. Therefore, the total communi
cation costs after the static allocation will be 5Z”=i+i A[i,j]. It corresponds
to a ^ improvement.
4.3.3 Dynamic Allocation
W hen the num ber of the partitions is much greater th an the num ber of available
PE s, we m ay use at runtim e a dynamic allocation m ethod for task assignment.
Assuming th a t p PEs are available, then we can use a topological sorting proce
dure to obtain the topological order of the nodes of the PD FG , assign the first
p nodes to those available PEs and then start to execute this user’s program .
During the execution of the user’s program , the system program collects the
PE s which have completed the operations and assigns new actors to those PEs.
This procedure is described below:
76
P r o c e d u re DAllocate ;
b e g in
r e p e a t
pick the nodes which have no predecessors, (according to the
PD FG ) ;
allocate these nodes using static allocation m ethod ;
delete these nodes and all their links from the PD FG ;
u n til ( num ber of available PEs < 4) ;
e n d ;
C o b e g in
execute the allocated actors, once the operation of an actor has
been finished,
collect this used P E ;
DAllocate ;
C o e n d ;
The complexity of this algorithm is O (n), since it is based on the static
allocation m ethod. In the algorithm , there is a system program which collects
idle PEs and performs the procedure “DAllocate” to assign the unallocated
actors to these collected PEs. Note th a t this system program and the user
program can be executed concurrently. Note th a t this function is a potential
bottleneck. However, since the allocation is perform ed by a separate processor,
the efficiency can be improved by m ultiprocessing the allocation program . This
77
approach can be achieved basically on the “dynamic code loading” m echanism
of the transputer development system [42].
4.3.4 Data Allocation
In order to reduce the overhead of processing d ata structures, depending on the
characteristics of the application program s, three ways for d ata allocation are
applied in our system:
1. globally allocated: D ata is allocated in a shared storage, while each process
in the separate PE is intended to share the entire array. This is sim ilar to
the conventional approach, where array elements are stored in a “structure
storage.”
2. locally replicated: Each PE has a copy of the array, while each process in
the separate PE requires access to the array b u t no modification is needed,
or there exists no d ata dependencies among processes in different PEs.
This approach takes advantage of the characteristics of the distributed
memory multiprocessor system, in th a t it elim inates unnecessary data
transfer over external channels.
3. locally distributed: In this approach, the entire array is divided into several
p arts which are then distributed to different PEs. It will be applied when
each process in the separate PE requires access to one portion of the
array only. This approach is especially useful for loop unrolling. It not
only eliminates unnecessary d ata transfer over external channels, but also
reduces the array size for allocation.
78
These schemes have been applied to various application program s. The exper
im ental results will be described in the following section.
79
Chapter 5
I
EXPERIM ENTS A N D
PERFO RM ANCE
EVALUATION
We have chosen to directly evaluate the perform ance of our system by observing
the experiments of a certain num ber of test cases. This was done using our
transputer m ultiprocessor architecture. Two sets of experiments have been
done on T800 and T414 transputer networks respectively.
5.1 Experiment I: Livermore Loops,
Histogramming, and M atrix M ultiplication
In order to verify the correctness of the translator and to evaluate the perfor
mance of the optim ization schemes, we first select a set of small benchm ark
80 !
program s, w ritten in SISAL, to be translated into occam and executed on 8
T800 transputers.
5.1.1 Experiments and Experimental Results
The benchm ark program s used in this experim ent are listed below:
1. Livermore Loops [22]:
• Loopl: Hydro fragm ent
This program returns an array of size n. The ith elements is set to
Q + [Y(k) * [R * Z[k + 10) + T * Z{k + 11)]]
• Loop7: Equation of state fragm ent
This program returns an array of size n. The ith elements is set to
U(i) + R * Z{i) + R 2 * Y{i) + T * [U{i + 3 ) + R * U(i + 2) +
R? * U{i + 1)] + T 2 * [U{i + 6) + R * U(i + 5) + R 2 * U{i + 4)]
• Loopl2: First difference
This program returns an array of size n. The ith elements is set to
Y(i + 1) - Y(i)
t
The SISAL version of these program s as well as their translated occam
codes are shown in Appendix E.
2. Histogramming: This is an image-processing algorithm [39], A picture
can be represented by a rectangular array of pixels. Each pixel has a
small integer value between 0 to n — 1 th a t represents a gray scale value
81
of the picture. Histogramm ing involves keeping track of the frequency
of occurrence of each gray scale value of the picture. In the program
the gray scale value of each pixel is stored in an array digit, and slot
represents the array th at keeps the count for the num ber of occurrences
of the gray scale value 0, 1, ...,n — 1. A SISAL version of this program
and its corresponding occam code can be found in Appendix F.
3. MMULT: M atrix m ultiplication
We measure the speedup, ratio of the execution tim e of a program on a single
transputer over the execution tim e of the same program on m ultiple transputers.
The unit of the execution tim e in measuring is a tick, 64 psec. The different
d ata allocation m ethods described previously are also applied.
1. Livermore Loops: Two array sizes, 1000 and 50000, are used in each loop.
The locally distributed data allocation m ethod has been applied. Table 5.1,
Table 5.2, and Table 5.3 show the experim ental results of loopl, loop7,
and loopl2 respectively, and the corresponding speedups are shown in
Figure 5.1, Figure 5.2, and Figure 5.3.
2. Histogram: In this experiment two different sizes of digit, 1000 and 50000
are applied to 16 slots. Slots are evenly distributed to each transputer.
The d ata allocation m ethod is locally replicated. Table 5.4 shows the
experim ental result, and speedup is shown in Figure 5.4.
3. MMULT: In this experim ent, we compare two different sizes of m atrices,
16 X 16 and 64 X 64. D ata of one m atrix is locally distributed, while d ata
82
T a b le 5.1: P e rfo rm a n c e p a ra m e te rs fo r L o o p l
problem size 1000 50000
num ber of PEs
exe. tim e
(ticks)
speedup* exe. tim e
(ticks)
speedup
1 PE 2440 1.0 123212 1.0
2 PEs 1211 2.0 61820 1.9
3 PEs 572 4.2 42070 2.92
4 PEs 468 5.2 32832 3.75
8 PEs (without buffering) 961 2.5 18255 6.7
8 PEs (single buffering) 248 9.0 3487 7.0
*For an explanation of the superlinear speedup, see section 5.1.2
T ab le 5.2: P e rfo rm a n c e p a ra m e te rs fo r L o o p 7
problem size 1000 50000
num ber of PEs
exe. time
(ticks)
speedup* exe. tim e
(ticks)
speedup
1 PE 6711 1.0 332289 1.0
2 PEs 3339 2.0 178488 1.86
3 PEs 1649 4.0 114978 2.89
4 PEs 1225 5.4 87675 3.79
8 PEs (without buffering) 662 10.1 46081 7.2
8 PE s (single buffering) 642 10.4 45214 7.3
*For an explanation of the superlinear speedup, see section 5.1.2
83
T ab le 5.3: P e rfo rm a n c e p a ra m e te r s fo r L o o p l2
problem size 1000 50000
num ber of PEs
exe. time
(ticks)
speedup* exe. tim e
(ticks)
speedup
1 P E 1343 1.0 68176 1.0
2 PEs 514 2.6 35226 1.93
3 PEs 315 4.2 23895 2.85
4 PEs 403 3.4 18500 3.68
8 PEs (single buffering) 397 3.3 9557 7.13
*For an explanation of the superlinear speedup, see section 5.1.2
T ab le 5.4: P e rfo rm a n c e p a ra m e te rs fo r H is to g ra m m in g
problem size 1000 50000
num ber of PEs
exe. tim e
(ticks)
speedup exe. tim e
(ticks)
speedup
1 PE 16014 1.0 801890 1.0
2 PEs 8023 1.99 401900 1.99
4 PEs 4071 3.93 203825 3.93
8 PEs 2078 7.71 103836 7.72
84
Speedup
9
8
7
6
5
4
O Array size : 1000
3
Array size : 50000
2
1
# of PEs
lT r 2Tr 3Tr 4Tr 8Tr
Figure 5.1: Speedup for Loopl
85
Speedup
10 '
Array size : 1000
D Array size : 50000
# of PEs
8Tr IT r 2Tr 3Tr 4Tr
Figure 5.2: Speedup for Loop7
86
Speedup
a
9
8
7
6
5
4
3
2
1
O Array size
Array size
✓
— > ----------1 1 1 -------------------------
IT r 2Tr 3Tr 4Tr
Figure 5.3: Speedup for Loopl2
87
Speedup
9
8
7
6
5
4
Array size : 1000 3
D Array size : 50000
2
1
T ^ of PEs
IT r 2Tr 3Tr 4Tr 8Tr
Figure 5.4: Speedup for Histogramming
I
88
of the other m atrix is locally replicated. Table 5.5 and Figure 5.5 show
respectively the experim ental result and speedup for this experiment.
T a b le 5.5: P e rfo rm a n c e p a ra m e te rs fo r M a tr ix M u ltip lic a tio n
problem size 16 X16 64 X 64
num ber of PEs
exe. tim e
(ticks)
speedup exe. time
(ticks)
speedup
1 PE 4911 1.0 616901 1.0
2 PEs 2413 2.03 310719 1.99
4 PEs 1235 3.98 155711 3.96
8 PEs 622 7.89 77639 7.95
5.1.2 Interpretation of results
Here we will describe the issues th a t concern w ith the result of above experi
m ents.
T o p o lo g y o f th e n e tw o rk : The transputer network used for this experiment
is m esh-connected. However, for loop unrolling im plem entation purpose, we use
the network as a hierarchical one. The root processor acts as a dispatcher, it
splits the d ata and sends them to several processors as explained in section
4.2.1. The transputer has four links to communicate w ith the external world,
and the root transputer has to talk with the host com puter, it can only have
three direct slaves (first level slaves), each one being able to have only one or
two direct slaves(second level slaves), because of problem inherent to transputer
architecture (boot path).
89
Speedup
9
8
7
6
5
4
3
O Problem size 16 X 16
O Problem size 64 X 64
2
1
# PEs
IT r 2Tr 3Tr 4Tr 8Tr
Figure 5.5: Speedup for M atrix M ultiplication
90
S p e e d u p : The m easured speedups have to be analyzed separately for the
different problem sizes. Note th a t the transputer owns an on-chip memory of
4Kbytes which has a three times faster access rate th an the off-chip memory
(120M bytes/s compared to 40M bytes/s). This feature caused some interesting
results to occur.
• L a rg e p ro b le m size: The required d ata size is too big to fit in the
on-chip memory, even after the partitioning. In this case the speedups
obtained are close to linear, since the execution tim e is proportional to
the am ount of d ata to be processed.
• S m a ll o r in te rm e d ia te p ro b le m size: The required d ata size is greater
th an the capacity of the on-chip memory, but it will fit in the on-chip
m emory after unrolling the loop. In this case a superlinear speedup may
occur, since the operation in a unrolled loop needs less memory access
tim e th a t will shorten the entire execution time.
C o m p u ta tio n /c o m m u n ic a tio n e q u ilib riu m : Depending on the ratio of
the com putation tim e over the communication cost of a problem , one can ob
serve good performances even if some of the d ata are required to reach a distant
processor to be processed. As one can observe in the LOOP1 case, the sys
tem achieves a superlinear speedup when two T ransputers have been used, the
speedup rem ains superlinear even if d ata have to perform a second hop to reach
their processor. There is a sim ilar behavior in the LOOP7 case. For needing
more com putation tim e, in LOOP1 and LOOP7 cases, the better performances
91
can be achieved when using 8 PE s, where the com putations overwhelming the
communications. On the other hand, the LOOP12 shows a different behavior.
Having less requirem ent of com putation, the system can achieve a superlinear
speedup as long as the d ata fit in the on-chip memory and need only one hop
to reach their target. However, when some d ata elements need several hops to
reach their assigned processor, perform ance degrades significantly, since in this
situation the comm unication costs are much greater th an com putations.
F o rw a rd in g p o lic y : In these experim ents, we did not apply the complete
router, b u t a simpler routing procedure on the first level slaves. This procedure
is executed in parallel with the processing function. We can observe the different
behavior between a static and dynamic forwarding policies. For the static, non
buffering, forwarding policy the first level PE returns its own results and then
forwards the second level results, while for the dynamic forwarding policy the
first completed process on either of the second level or the first level procedure
will be served by the forwarding procedure, i.e., the procedure can forward the
result come from the second level to the root transputer even the result of the
first level has not been generated. The latter scheme is closer to the real router
which will be able to serve any request at any tim e, providing th a t the physical
p ath is free.
92
5.2 Experiment II: Shallow
In this section we present an experiment on a m edium scale SISAL program ,
called shallow ( see Appendix G ), which is a weather program provided by the
University of Leicester, England. A mesh-connected transputer network w ith
16 T414 transputers has been used for facilitating this experim ent.
5.2.1 Experiment
The shallow program contains seven functions, m a in , initial, flu x , height,
potential, timestep, and smooth, where initial function applies two built-in
functions, sin and cos, to generate initial values. The program shows a se
quential loop w ith parallel loops inside it. Initially, m ain function calls initial
to set up initial values, and then calls other six functions w ithin a sequential
loop to generate results. The program has been translated into occam and loop
unrolling technique has also been applied for parallel processing. The locally
distributed allocation m ethod has been applied for data allocation, i.e., all loops
in the program are unrolled evenly through out transputers.
As perform ance m easurem ent, we report execution tim e ( in ticks, 64 p
seconds), speedup, parallel effectivity, and distribution of load. Specifically,
T{n) = execution tim e on n transputers
Ti = execution tim e on ith transputer
S{n) = speedup =
E = effectivity =
93
J
Di = distribution of load = ^
n l ^ i = l * •
Since the com putations in the loops of functions initial, height, and timestep
require to accept d ata from their neighbor PE s, i.e., each PE has to receive d ata
from its right neighbor PE and send d ata to its left neighbor PE , the PE to PE
traffic, total am ounts in bytes across links from one transputer to another, has
also been m easured in this experiment.
5.2.2 Performance results
The allocated shallow program has been run on our transputer network with
different problem sizes, 16 x 16, 32 x 32, 48 X 48, and 64 x64 for 1, 2, 4 iterations.
The perform ance results are described in this section.
S p e e d u p For m easuring the speedup of the system , the translated shallow
program has been executed on 1, 2, 4, 8, and 16 transputers and the loops in the
program are unrolled accordingly. Table 5.6 gives the experim ental results and
the speedups of running for 1, 2 and 4 iterations are shown in Figure 5.6, 5.7,
and 5.8 respectively.
E ffe c tiv ity The effectivity in the system is m easured by the percentage of the
actual M IPS ( million instructions per second ) over the num ber of transputers
tim es the m axim um MIPS rate. It can also be m easured by the percentage of
the total tim e spent on all transputers over the num ber of transputers times
Y 'n t
the actual execution tim e, i.e., E = effectivity = nT{n) * • The effectivity for
94
running shallow program on 2, 4, 8, and 16 transputers w ith different problem
sizes is shown in Table 5.7.
D is tr ib u tio n o f lo a d As defined in the previous section, the distribution
of load for a transputer is to m easure the ratio of the execution tim e on th at
transputer over the average execution tim e on all transputers. As m entioned
earlier, according to the characteristic of the shallow program , each loop in the
program can be unrolled evenly. The expected result of the distribution of load
on each transputer should be 1. Since the distances between the root transputer
and other transputers are not equal, the communication cost on each transputer
may be different. Thus, the m easured distribution of load may result in slight
different on each transputer. Tables 5.8 to 5.17 show the distributed load on
each transputer for running shallow program on 16, 8, 4, and 2 transputer.
5.3 Discussion
From the above experiments we have concluded the following:
• As m entioned in section 3.2.2, the actors of vector operations are under
the control of a forall compound node in IF1. Since such vector operations
are easily detectable in IF1, improvement by loop unrolling came at low
compiler cost.
• For loop unrolling, the lim itation of the num ber of comm unication links
of transputer will restrict the num ber of replications of the loop: Since
excessive replication of loops require longer com m unication paths, the
95
comm unication costs will be increased. The m ethod described in section
4.2.1 is useful for distribution of d ata size for each replication.
• In order to decrease the communication overhead for array operations,
according to properties of the application program s, different types of data
allocation are needed. For locally replicated method, it does not affect the
ratio of d ata size stored in on-chip and off-chip memory, i.e., the speedup
is not affected by the memory access tim e. However, in locally distributed
m ethod d ata size distributed to each PE is proportionally decreased to
the increase of the num ber of available PEs. Thus the ratio of the data
size stored in on-chip and off-chip memory is increased by increasing the
num ber of PEs.
• According to the experiments and the speedup analysis in previous, when
the problem size is relatively large the actual speedup will approxim ate
to the linear speedup. In fact, it is w orth for parallel processing only the
problem size is large enough.
96
T ab le 5.6 E x e c u tio n tim e a n d sp e e d u p
problem size 16 x 16 32 X 32
# P E s ^iteratio n s 1 2 4 1 2 4
1 exe. time 28432 43971 75499 111193 169393 287891
speedup 1.00 1.00 1.00 1.00 1.00 1.00
2 exe. time 14837 22766 38776 57139 86494 146185
speedup 1.92 1.93 1.95 1.95 1.96 1.97
4 exe. tim e 8025 12304 20941 29750 45005 76119
speepup 3.54 3.57 3.61 3.74 3.76 3.78
8 exe. tim e 4595 6999 11901 16008 24264 40949
speedup 6.19 6.28 6.34 6.95 6.98 7.03
16 exe. tim e 2957 4442 7468 9302 13992 23465
speedup 9.62 9.90 10.11 11.95 12.11 12.27
problem size 48 X 48 64 X 64
# P E s ^iteratio n s 1 2 4 1 2 4
1 exe. tim e 248105 376461 636331 439410 664612 1124674
speedup 1.00 1.00 1.00 1.00 1.00 1.00
2 exe. time 126694 191246 322520 223812 337051 567821
speedup 1.96 1.97 1.97 1.96 1.97 1.98
4 exe.time 65106 98332 165833 114158 171785 289383
speepup 3.81 3.83 3.84 3.85 3.87 3.89
8 exe. tim e 34173 51677 87060 59203 89450 150656
speedup 7.26 7.28 7.31 7.42 7.43 7.47
16 exe. tim e 19042 28661 48064 32225 48548 81353
speedup 13.03 13.13 13.24 13.64 13.69 13.82
97
Speedup
A
• problem size: 16 X 16
★ problem size: 64 X 64
o problem size: 48 X 48
o problem size: 64 X 64
H — I — i — I — I — I — I — I — I — |— I — » — I — I — |— b -> - # PEs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Figure 5.6: Speedup for Shallow w ith 1 iteration
98
Speedup
• problem size: 16 x 16
★ problem size: 64 x 64
o problem size: 48 x 48
o problem size: 64 x 64
Figure 5.7: Speedup for Shallow w ith 2 iterations
99
Speedup
A
• problem size: 16 X 16
* problem size: 64 X 64
o problem size: 48 X 48
o problem size: 64 X 64
-
1 — I — I — I — I — I — I — I — I — |— I — I — I — I — |— h -= > # PEs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Figure 5.8: Speedup for Shallow w ith 4 iterations
100
T ab le 5.7 E ffe ctiv ity
problem size num ber of PEs 2 4 8 16
16 x 16
1 iteration 1.000 0.999 0.997 0.996
2 iterations 1.000 1.000 0.998 0.997
4 iterations 1.000 1.000 1.000 0.999
32 x 32
1 iteration 1.000 1.000 0.999 1.000
2 iterations 1.000 1.000 1.000 0.998
4 iterations 1.000 1.000 1.000 1.000
48 x 48
1 iteration 1.000 1.000 0.999 0.998
2 iterations 1.000 1.000 0.999 0.998
4 iterations 1.000 1.000 1.000 0.999
64 x 64
1 iteration 1.000 1.000 0.999 0.998
2 iterations 1.000 1.000 0.999 0.998
4 iterations 1.000 1.000 1.000 0.999
T ab le 5.8 D is tr ib u tio n o f lo a d fo r 16 P E s
p ro b le m size = 16 x 16
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
time
distri.
of load
execution
tim e
distri.
of load
T 0 2923 0.992 4420 0.998 7451 0.999
T 1 2928 0.994 4420 0.998 7450 0.999
T 2 2931 0.995 4416 0.997 7449 0.999
T 3 2939 0.998 4422 0.998 7453 0.999
T 4 2937 0.997 4430 1.000 7455 1.000
T 5 2938 0.998 4426 0.999 7452 0.999
T 6 2947 1.001 4433 1.000 7452 0.999
T 7 2952 1.002 4436 1.001 7459 1.000
T 8 2953 1.003 4438 1.002 7462 1.001
T 9 2956 1.004 4438 1.002 7461 1.000
T10 2956 1.004 4435 1.001 7461 1.000
T i l 2957 1.004 4433 1.000 7466 1.001
T12 2954 1.003 4437 1.001 7468 1.001
T13 2949 1.001 4432 1.000 7465 1.001
T14 2953 1.003 4436 1.001 7463 1.001
T15 2951 1.002 4442 1.003 7461 1.000
102
T a b le 5.9 D is trib u tio n o f lo a d fo r 16 P E s
p ro b le m size = 32 x 32
transputer
1 iteration 2 iterations 4 iterations
execution
time
distri.
of load
execution
time
distri.
of load
execution
tim e
distri.
of load
T 0 9239 0.996 13930 0.997 23441 0.999
T 1 9241 0.996 13941 0.998 23444 1.000
T 2 9241 0.996 13954 0.999 23454 1.000
T 3 9262 0.998 13957 0.999 23461 1.000
T 4 9255 0.998 13962 1.000 23465 1.000
T 5 9275 1.000 13958 0.999 23465 1.000
T 6 9283 1.001 13976 1.001 23463 1.000
T 7 9280 1.000 13971 1.000 23460 1.000
T 8 9291 1.002 13987 1.001 23462 1.000
T 9 9296 1.002 13992 1.002 23459 1.000
T10 9294 1.002 13990 1.002 23456 1.000
T i l 9302 1.003 13987 1.001 23454 1.000
T12 9293 1.002 13979 1.001 23455 1.000
T13 9287 1.001 13958 0.999 23446 1.000
T14 9293 1.002 13960 1.000 23452 1.000
T15 9288 1.001 13955 0.999 23451 1.000
103
T a b le 5.10 D is tr ib u tio n o f lo a d fo r 16 P E s
p ro b le m size = 48 x 48
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
tim e
distri.
of load
execution
tim e
distri.
of load
T 0 18928 0.996 28529 0.997 48009 0.999
T 1 18936 0.997 28563 0.998 48011 0.999
T 2 18934 0.997 28578 0.999 48011 0.999
T 3 18949 0.998 28608 1.000 48035 1.000
T 4 18964 0.998 28608 1.000 48039 1.000
T 5 18983 0.999 28619 1.000 48021 1.000
T 6 18999 1.000 28611 1.000 48039 1.000
T 7 19008 1.001 28584 0.999 48034 1.000
T 8 19017 1.001 28610 1.000 48045 1.000
T 9 19029 1.002 28637 1.001 48046 1.000
T10 19027 1.002 28648 1.001 48052 1.000
T i l 19042 1.002 28661 1.002 48062 1.000
T12 19026 1.002 28650 1.001 48062 1.000
T13 19020 1.001 28650 1.001 48048 1.000
T14 19035 1.002 28647 1.001 48064 1.001
T15 19026 1.002 28615 1.000 48045 1.000
T ab le 5.11 D is tr ib u tio n o f lo a d fo r 16 P E s
p ro b le m size = 64 x 64
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
time
distri.
of load
execution
tim e
distri.
of load
T 0 32050 0.997 48242 0.996 81231 0.999
T 1 32063 0.997 48343 0.998 81221 0.999
T 2 32062 0.997 48391 0.999 81230 0.999
T 3 32078 0.998 48432 1.000 81238 0.999
T 4 32119 0.999 48419 1.000 81234 0.999
T 5 32143 1.000 48398 0.999 81260 1.000
T 6 32165 1.000 48378 0.999 81299 1.000
T 7 32154 1.000 48342 0.998 81309 1.000
T 8 32179 1.001 48415 1.000 81334 1.000
T 9 32196 1.001 48478 1.001 81334 1.000
T10 32200 1.001 48525 1.002 81340 1.001
T i l 32225 1.002 48548 1.002 81353 1.001
T12 32206 1.002 48544 1.002 81351 1.001
T13 32196 1.001 48517 1.002 81350 1.001
T14 32210 1.002 48495 1.001 81339 1.001
T15 32202 1.002 48403 0.999 81312 1.000
105
T ab le 5.12 D is trib u tio n o f lo a d fo r 8 P E s
p ro b le m size = 16 x 16
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
tim e
distri.
of load
execution
tim e
distri.
of load
T 0 4554 0.995 6977 0.999 11881 0.999
T 1 4569 0.998 6975 0.999 11892 1.000
T 2 4570 0.998 6977 0.999 11892 1.000
T 3 4581 1.000 6988 1.000 11900 1.000
T 4 4580 1.000 6985 1.000 11899 1.000
T 5 4591 1.003 6986 1.000 11898 1.000
T 6 4595 1.003 6991 1.001 11901 1.000
T 7 4592 1.003 6999 1.002 11899 1.000
T ab le 5.13 D is trib u tio n o f lo a d fo r 8 P E s
p ro b le m size = 32 x 32
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
time
distri.
of load
execution
tim e
distri.
of load
T 0 15935 0.997 24237 0.999 40935 1.000
T 1 15952 0.998 24252 1.000 40945 1.000
T 2 15967 0.999 24259 1.000 40933 1.000
T 3 16004 1.001 24263 1.000 40935 1.000
T 4 16002 1.001 24263 1.000 40939 1.000
T 5 16002 1.001 24261 1.000 40946 1.000
T 6 16008 1.001 24264 1.000 40946 1.000
T 7 16005 1.001 24237 0.999 40940 1.000
106
T ab le 5.14 D is trib u tio n o f lo a d fo r 8 P E s
p ro b le m size = 48 x 48
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
time
distri.
of load
execution
time
distri.
of load
T 0 34083 0.998 51615 0.999 87053 1.000
T 1 34096 0.999 51643 1.000 87060 1.000
T 2 34099 0.999 51641 1.000 87048 1.000
T 3 34171 1.001 51667 1.000 87053 1.000
T 4 34158 1.001 51677 1.001 87049 1.000
T 5 34170 1.001 51668 1.000 87051 1.000
T 6 34173 1.001 51642 1.000 87055 1.000
T 7 34163 1.001 51602 0.999 87049 1.000
!
T ab le 5.15 D is trib u tio n o f lo a d fo r 8 P E s
p ro b le m size = 64 x 64
transputer
1 iteration 2 iterations 4 iterations
execution
time
distri.
of load
execution
time
distri.
of load
execution
tim e
distri.
of load
T 0 59051 0.998 89224 0.999 150619 1.000
T 1 59069 0.999 89331 1.000 150639 1.000
T 2 59070 0.999 89398 1.001 150626 1.000
T 3 59203 1.001 89447 1.001 150650 1.000
T 4 59180 1.001 89450 1.001 150656 1.000
T 5 59185 1.001 89418 1.001 150650 1.000
T 6 59199 1.001 89334 1.000 150637 1.000
T 7 59185 1.001 89208 0.998 150587 1.000
107
T ab le 5.16 D is tr ib u tio n o f lo a d fo r 4 P E s
problem size = 16 x 16
transputer
1 iteration 2 iterations 4 iterations
execution
time
distri.
of load
execution
time
distri.
of load
execution
tim e
distri.
of load
T 0 8029 1.001 12303 1.000 20941 1.000
T 1 8022 1.000 12291 0.999 20937 1.000
T 2 8025 1.000 12297 1.000 20937 1.000
T 3 8023 1.000 12304 1.000 20940 1.000
problem size = 32 x 32
transputer
1 iteration 2 iterations 4 iterations
execution
time
distri.
of load
execution
tim e
distri.
of load
execution
tim e
distri.
of load
T 0 29750 1.000 44994 1.000 76118 1.000
T 1 29738 1.000 45005 1.000 76117 1.000
T 2 29746 1.000 45001 1.000 76117 1.000
T 3 29737 1.000 44988 1.000 76119 1.000
problem size = 48 x 48
transputer
1 iteration 2 iterations 4 iterations
execution
time
distri.
of load
execution
time
distri.
of load
execution
tim e
distri.
of load
T 0 65106 1.000 98310 1.000 165833 1.000
T 1 65092 1.000 98311 1.000 165827 1.000
T 2 65103 1.000 98332 1.000 165823 1.000
T 3 65091 1.000 98281 1.000 165815 1.000
problem size = 64 x 64
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
tim e
distri.
of load
execution
tim e
distri.
of load
T 0 114158 1.000 171744 1.000 289383 1.000
T 1 114136 1.000 171762 1.000 289376 1.000
T 2 114148 1.000 171781 1.000 289370 1.000
T 3 114142 1.000 171785 1.000 289369 1.000
108
T ab le 5.17 D is trib u tio n o f lo a d fo r 2 P E s
problem size = 16 X 16
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
tim e
distri.
of load
execution
tim e
distri.
of load
T 0 14837 1.000 22766 1.000 38776 1.000
T 1 14829 1.000 22762 1.000 38765 1.000
problem size = 32 x 32
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
tim e
distri.
of load
execution
tim e
distri.
of load
T 0 57128 1.000 86494 1.000 146169 1.000
T 1 57139 1.000 86529 1.000 146185 1.000
problem size = 48 x 48
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
tim e
distri.
of load
execution
time
distri.
of load
T 0 126664 1.000 191179 1.000 322484 1.000
T 1 126694 1.000 191246 1.000 322520 1.000
problem size = 64 X 64
transputer
1 iteration 2 iterations 4 iterations
execution
tim e
distri.
of load
execution
tim e
distri.
of load
execution
time
distri.
of load
T 0 223746 1.000 336920 1.000 567747 1.000
T 1 223812 1.000 337051 1.000 567821 1.000
I
109
Chapter 6
CONCLUSIONS AN D
SUGGESTIONS FOR FU TU R E
i
I
RESEARCH !
i
l
6.1 Conclusions j
i
i
Our research efforts as described in this thesis have focused on dem onstrating
a practical approach to provide high program m ability to the user of a homo- i
i
geneous, asynchronous m ultiprocessor architecture. For this purpose, we have [
chosen a functional high-level language interface (the high-level data-flow lan- (
I
|
guage SISAL) and have m apped it onto the low-level principles of execution of j
a commercially available microprocessor. 1
The contributions are sum m arized as follows: j
I
• A translator for translating from IF l, production of SISAL compiler, to j
occam has been developed. As dem onstrated in this thesis, a program
ming environm ent from high-level language to low-level principles of ex
ecution has been realized to illustrate the applicability of the data-flow
approach to both low-level and high-level parallelism specification in a
m ultiprocessor system.
A structure handling scheme has been developed to ensure the safety of
array operations w ithout complex mechanisms. A function call scheme
has also been developed to allow highly parallelism by replicating the
required functions.
In the transputer network, a general router has been developed to allow
the comm unication between non-neighboring transputers and m ultiplex
ing m ultiple communication channels through a single physical channel
between two transputers.
We have developed several graph optim ization schemes which will deliver
higher performance at execution time. These include dynamic and static
allocation approaches, partitioning by comm unication cost thresholding
as well as loop unrolling of parallel loop operations. The performance
of these m ethods has been evaluated and the throughput improvement
notably delivered by the latter approach has been shown to be substantial.
The methodology developed in this work can be applied to m any mul
tiprocessor architectures because the program m ing environm ent is very
m odular. Program decomposition and allocation can be tailored to the
target machine w ithout modifying the initial stages of the m apping mech
anisms.
6.2 Suggestions for Future Research
!
1
This thesis provides a starting point for future work in integrating the high-
j
j level data-driven principles within the transputer architecture, so as to provide
\
high program m ability of multiprocessor systems w ith autom atic partitioning
and allocation. Some related issues for future research are described below:
• Enhancing the function of the translator:
Due to the lim itations of current features of occam language, there are
several restrictions to the preparation of the SISAL program , such as
, recursive calls, record and union d ata structures, etc. These features are
possible to be implemented by the transputer assembly language. Thus,
instead of m apping IF l graph onto occam code entirely, the function of
translator can be enhanced by translating IF l graph into object code w ith
combining occam and assembly codes.
• Improving system perform ance by introducing shared memory:
I The transputer network used for our research is a distributed memory sys
tem . It is shown effectively for passing short message through transputer
links for comm unication among transputers. However, for passing array,
j the system performance degrades significantly. Since transm itting large
112
am ount of data through transputer links introduces an expensive com
m unication cost, a shared memory can be implemented in a transputer
network as shown in Figure 1.1 to reduce the com m unication cost.
• Em ulating a transputer-based dynamic data-flow system:
The methodology presented in this thesis is based on a static data-flow
model of com putation, since occam is a static language. As described
in the previous paragraph, p art of the IF l actors can be translated into
transputer assembly language to fit the dynamic features of the source
language. A dynamic data-flow system can be em ulated by using trans
puters as functional units to handle the operations such as token m atching,
instruction fetching, com putation, and in p u t/o u tp u t operations etc.
113
Appendix A
U SIN G THE TRANSLATOR
A .l Introduction
Occamflow program is designed for translation of a high-level data-flow lan
guage, SISAL (Stream s and Iterations in a Single Assignment Language), into
occam, the low-level model of the Inmos transputer. The program was w ritten
in PASCAL, it accepts the output of the SISAL compiler, IF l ( Interm ediate
Form 1 ), and generates a sequence of occam processes which can be allocated
to the PEs th a t are connected w ith a specified topology.
The program is proceeded in the following steps:
• graph generation step
• basic partitioning step
• occam code generation step
There should be an “optim ization and allocation step” between the second
114
and the third steps for the complete occamflow process. Since optim ization and
allocation can be proceeded depending on the different type of approaches, the
step is separated from the program .
i
There are some restrictions of the current OCCAM FLOW translator, due to
the lim itations of occam language and some unim plem ented features. For ap-
j plying the translator, the following restrictions m ust be considered in preparing
SISAL program s. All implemented actors are listed in Appendix B. i
• No recursive function.
• No record and union d ata structures.
• Nested functions can not be used.
• No more th an one output for a function.
• Array is limited to three dimension.
I
' A .2 Data Structures
i
| The m ain data structures of the program include five kind of nodes th a t are all j
I !
| implemented by the linked lists, and four files. Five kind of nodes are describe j
below: j
• fnode: function node, contains the function nam e of the SISAL program ,
and links to next function node, first compound node and partition node
of this function. j
I [
i
115
• pnode: partition node, contains the ID ( partition num ber), input arcs,
and output arcs of the partition. It links to the first simple node of the
partition.
• cnode: compound node, contains the ID (code of this com pound node)
j and links to its children, compound a n d /o r simple nodes. It will link to
j the next com pound node ( its brother, if any).
i
! • snode: simple node, describes a real actor. It contains the partition ID
which it belongs to, the sequence num ber of the actor, the ID ( code of
this simple node ), num ber of input arcs and num ber of output arcs. It
links to its input arcs and output arcs. It also links to the next simple
node in the same partition.
• process: array pointer , points to each process in a function. Each process
I
! represents an actor and corresponds to an occom process. Several pro
cesses ( actors in the same partition ) may be lumped to form a m acro
actor and correspond to an PROC in occam program .
There are four files directly related to the translator. They are described as
follows.
i
I
I
j • *.ifl: input file, it contains the IF l descriptions.
! • *.out: output file, it contains the comm unication cost m atrix and the
testing outputs.
i
! • macrodef: input file, is a predefined m acro definition table.
i
• *.occ: output file, it contains the translated occam code.
j Note th a t after running the translator user should insert the translated program
*.occ into a “tx t” file in order to run on transputer.
i
! A .3 Example
' The procedures to run the program:
| i
; I
i 1. Before running the program , user m ust prepare the IF l descriptions in !
i I
j the file “*.ifl” . ( output of the SISAL compiler ) J
t
2. While running, type “xocc” , the program will ask user to enter the in-
j put file nam e (input file m ust be stored in the file “file.ifl” ). Note th a t
i ^ i
“m acrodeP file m ust be stored in the same directory of xocc. I
j 3. The translated occam program is generated in the file “file.occ” . The
program also generates a file ”file.out” for checking purpose.
4. Since the translated occam program contains only the program body, it !
i
m ust be inserted into the file “file.txt” for running on transputer.
t
i
| E x a m p le : translating a simple SISAL program “demo.sis” , which was shown 1
i |
| in section 3.2, to perform the com putation x x x + y x y. j
I
1. prepare IF l descriptions in “dem o.ifl” which is the output of the SISAL j
compiler for the program demo.sis, and was shown in section 3.2. !
2. After running the occamflow program , output files are stored in “dem o.out”
which contains the internal representation of the graph, a communication
117
cost m atrix after basic partitioning, and m acro instructions etc. The
translated occam program is stored in “demo.occ” .
3. Insert demo.occ into dem o.txt by adding declarations, array size, if any,
and m ain calling process to dem o.txt for running on the transputer. The
occam program “dem o.txt” is shown below.
i n s e r t d e c la r a tio n s h ere
INT TermChar:
REAL32 z :
#USE u s e r io
- - i n s e r t t r a n s la te d program h ere
PROC demo ( V A L REAL32 x , V A L
C H A N OF A N Y C00004, C00005 :
PROC procOOOOl()
PAR
REAL32 y . REAL32 OUTPTA)
SEQ
C00004 !
SEQ
C00005 !
REAL32
REAL32
SEQ
C00004 ?
C00005 ?
OUTPTA :=
PAR
procOOOOl()
x * X
sysOl :
sys02 :
sysOl
sys02
sysOl + sys02
-- Times
-- Times
— Plus
insert main procedure here
SEQ
demo ( 1.2, 3.4, z )
newline ( screen )
write.real32 ( screen, z, 0, 0 )
newline ( screen )
write.full.string (screen,"type CR to continue")
read.char (keyboard, TermChar)
118
I
Appendix B
|IF l ACTORS A N D THEIR
i
OCCAM DEFINITIO NS
The input and output arcs of actors defined by the following occam code are
! assumed to be channels.
\
I
I
| Code Mnemonic name Occam definition
i
■ 104 ACatenate SEQ
#1 ? s i
#2 ? s2
[ s3 FROM SIZE 0 FOR SIZE si] := si
[ s3 FROM SIZE si FOR SIZE s2] := s2
#3 ! s3
105 AElement SEQ
! #1 ? sysOl
1 #2 ? sys02
#3 ! sysOl [ sys02 ]
119
AGather
BOOL systest :
INT sysOl :
SEQ
systest := TRUE
sysOl := 0
WHILE systest
ALT
#2 ? sys04 [ sysOl ]
SEQ
sysOl := sysOl + 1
#3 ? systest
SKIP
#4 ! sysOl - 1
#4 ! [sys04 FROM 0 FOR (sysOi-1)]
AReplace
SEQ
#2 ? sys02
#3 ? #4 [ sys02 ]
ASetL
INT sys04 :
SEQ
#1 ? sys04
#1 ? sysOl
[#3 FROM #2 FOR sys04] :=
[sysOl FROM 0 FOR sys04]
ASize
SEQ
#1 ? sysOl
#2 ! SIZE sysOl
Abs
SEQ
#1 ? sysOl
IF
sysOl < 0
#2 ! - sysOl
TRUE
#2 ! sysOl
Call ( assume two input and one output arguments )
SEQ
#2 ? sys02
#3 ? sys03
#1 ( sys02, sys03, sys04 )
#4 ! sys04
Div
SEQ
#i ? sysOl
#2 ? sys02
#3 ! sysOl / sys02
Equal
For IF construct
SEQ
#1 ? sysOl
#2 ? sys02
IF
sysOl = sys02
PAR
° / , i ( THEN part )
TRUE
PAR
12 ( ELSE part )
For LOOP construct
SEQ
#1 ? sysOl
#2 ? sys02
WHILE sysOl = sys02
PAR
$1 ( LOOP body )
Less
For IF construct
SEQ
#i ? sysOl
#2 ? sys02
121
132 LessEqual
135 Minus
IF
sysOl < sys02
PAR
# / , i ( THEN part )
TRUE
PAR
12 ( ELSE part )
For LOOP construct
SEQ
#1 ? sysOl
#2 ? sys02
WHILE sysOl < sys02
PAR
$1 ( LOOP body )
For IF construct
SEQ
#1 ? sysOl
#2 ? sys02
IF
sysOl <= sys02
PAR
• / . I ( THEN part )
TRUE
PAR
%2 ( ELSE part )
For LOOP construct
SEQ
#1 ? sysOl
#2 ? sys02
WHILE sysOl <= sys02
PAR
$1 ( LOOP body )
SEQ
#1 ? sysOl
#2 ? sys02
Mod
Neg
Not
NotEqual
Plus
RangeGenerator
#3 ! sysO l - sys02
SEQ
#1 ? sysOl
#2 ? sys02
#3 ! sysOl \ sys02
SEQ
#1 ? sysOl
#2 ! - sys02
SEQ
#1 ? sysOl
#2 ! NOT sys02
For IF c o n s tru c t
SEQ
#1 ? sysOl
#2 ? sys02
IF
sysO l <> sys02
PAR
7.1 ( THEN p a r t )
TRUE
PAR
7.2 ( ELSE p a r t )
For LOOP c o n s tru c t
SEQ
#1 ? sysOl
#2 ? sys02
WHILE sysO l <> sys02
PAR
$1 ( LOOP body )
SEQ
#1 ? sysOl
#2 ? sys02
#3 ! sysO l + sys02
123
INT sys03 :
INT sys04 :
SEQ
sys04 := ( #2
SEQ sys03 = #1
PAR
#3 ! sys03
7 .
Times
SEQ
#1 ? sysO l
#2 ? sys02
#3 ! sysO l *
#1 ) + 1
FOR sys04
sys02
; Appendix C
|
I
DISTRIBUTIO N OF LOOP !
i
| BODIES !
■ I
l
i I
I
W hen the loop unroling is applied for parallel processing of loop bodies, if the
I i
j transputer network is not completely connected then some unrolled processes j
j may require more than one hop to communicate w ith the root transputer. De-
1 # i
j pending on the size of the array and the distance between the root transputer
and the transputer where the unrolled process is allocated, we can determ ine
i
I I
j the num ber of loop bodies to be distributed to different transputers in a trans- j
| puter network by solving a linear program m ing problem using simplex m ethod !
j to obtain an optim al solution [46].
j Let j
| I
Xi = size of the array to be executed in an unrolled process
I , 1
> w ith % hops to the root transputer, j
i I
1 t = execution tim e of an unrolled process, i
!
I
n = m axim um num ber of hops allowed,
Cd = unit cost of d ata transfer,
C0 = communication overhead,
i
| N = size of the original array,
!
j h i = num ber of transputers with * hops to the root transputer. ;
i
I
; A loop body on a transputer w ith 1 hop to the root transputer then can
be executed after C 0 + Cd I3” =i x * tim e units to receive data. It needs X\t
i
tim e units for com putation. In addition, Cd ]C” = i x * tim e units are required |
to send the result back to the root transputer. Similarly, for a loop body on j
a transputer w ith k hops to the root transputer, it requires CQ + c d Er=* sa i
tim e units after its neighbor transputer begins to send data. It takes x^t time j
units for com putation and C0 + Cd ]C"=* . Xi tim e units to send the result to its
[ i
. neighbor transputer. Since all unrolled loops are executed in parallel and root !
: i
transputer has to collect results, the total execution tim e is the tim e spent
on the root transputer, i.e.,xot. We may have a tim ing diagram as shown in j
i Figure C .l.
In the tim ing diagram
I
| tl — to = C0 + Cd ] C £ = 1 xi j
t2 — ti = Co + Cd x* ■ >
| t& — H = Cd Y%=2 xi - > i
I t7 - t e = C d E , ” = 1 xt-
i
i
I
j According to the tim ing diagram , the problem can be form ulated as follows: !
!
i
i
126
Minimize
/ = X q t
subject to the constrains:
Xi > x i+i > 0 for i = 0 , 1 , n
kC0 + 2Cd *xi + x n t < Xn-kt for k = 0 , 1 , n
E " = 0 hiXi = N
PEO
X \ t
P E I
X jt
PE2
P E n
tim e to t7 t4 t5 t6 t l
t2 t3
Figure C .l: A tim ing diagram
127
J
Appendix D
A ROUTER
i Following is a sample router program w ith multiplexing, demultiplexing, and
j routing functions as described in section 4.3.2.
| A router program
1 -- communication protocol
PROTOCOL comm
CASE
tagl ; INT INT ; BOOL
tag2 ; INT INT ; INT
tag3 ; INT INT ; REAL32
tag4 ; INT INT ; INT :: [] INT
tag5 ; INT INT ; INT :: □ REAL32
— Routing table
VAL Rtable IS [[0,2,2.3,3,3,2,1,3,3,0,1,1,1,0,0].
-- from west or local
[0,2,2,0,3,2,2,1.3,3,0,0,1.2,0,0],
— from north
[0.2,2.3,3,3.3,1,3,3,0,1,1.1,0.0].
-- from east
[0,2,2,0,3,2.2,1.3,3.0.0,1,2,0,0]] :
-- from south
-- External channel definition
128
E4] CHAN OF comm InCh, OutCh, Out
PLACE InCh AT 4 :
PLACE OutCh AT 0 :
— Routing buffers
INT EXseq, EXsize :
BOOL bufl :
INT buf2 :
REAL32 buf3 :
[N]INT buf4 :
[N]REAL32 buf5 :
-- Local channel definition
CHAN OF ANY COOOIO, C00020 :
CHAN OF ANY COOOil, C00021 :
VAL N IS 16 :
[N]INT Lbufl :
[N]INT Lbuf2 :
INT Lbuf3 :
INT Lbuf4 :
INT idxla, idxlb :
INT idx2a, idx2b :
-- Initial value setting
[10][N+1]B00L flag :
SEQ
SEQ i = 0 FOR 10
SEQ j = 0 FOR N+l
flag[i] [j] := FALSE
idxla, idxlb := 0, 0
idx2a, idx2b := 0, 0
PAR
External channel receiver
VAL destlnl IS 6 :
VAL destln2 IS 9 :
input channels
output channels
-- local buffer
buffer indices
-- destination 1
— destination 2
INT i l , i 2 , j , destEx :
INT ckVal :
SE
11 := R ta b le[0] [(16 + (d e s tln l - home)) \ 16]
12 := R ta b le[0 ][(1 6 + (d estln 2 - home)) \ 16]
W H IL E T R U E
A LT
A LT i = 0 F O R 4 — data receiv ed
InCh[i] ? C ASE - - from ex te rn a l channel
ta g l ; destEx ; EXseq ; b u fl
SE Q
ckVal := (16 + (destEx - home)) \ 16
j := Rtable [i][ck V a l]
O ut[j] ! ta g l ; destEx ; EXseq ; b u fl
tag2 ; destEx ; EXseq ; buf2
SE Q
ckVal := (16 + (destEx - home)) \ 16
IF
ckVal = 0 - - meet d e stin a tio n
SEQ
IF — check channel number
EXseq = 10
SE Q
Lbuf1 [id x la ] := buf2
f l a g [ 1 ] [id x la ] := T R U E
id x la :- (id x la + 1) \ N
EXseq = 20
SE Q
Lbuf2[idx2a] := buf2
f l a g [ 2 ] [idx2a] := T R U E
idx2a := (idx2a + 1) \ N
T R U E
SKIP
T R U E — forwarding
SE Q
j := Rtable [i][ck V a l]
Out [j] ! tag2; destEx; EXseq; buf2
tag3 ; destEx ; EXseq ; buf3
SEQ
ckVal := (16 + (destEx - home)) \ 16
j := Rtable [i][ckVal]
Out[j] ! tag3; destEx; EXseq; buf3
tag4 ; destEx; EXseq; EXsize::buf4
SEQ
ckVal := (16 + (destEx - home)) \ 16
j := Rtable [i][ckVal]
Out[j] ! tag4; destEx; EXseq; EXsize::buf4
tag5 ; destEx ; EXseq ; EXsize :: buf5
SEQ
ckVal := (16 + (destEx - home)) \ 16
j := Rtable [i][ckVal]
Out[j] ! tag5; destEx; EXseq; EXsize::buf5
— Internal channel receiver
C00011 ? Lbuf3
Out [il] ! tag2 ; destlnl ; 11 ; Lbuf3
C00021 ? Lbuf4
Out [i2] ! tag2 ; destln2 ; 21 ; Lbuf4
— External channel sender ( Multiplexing )
[4 ]REAL32 b u f3 :
[4] [N] INT buf 4 :
[ 4 ] [N]REAL32 b u f5 :
PAR i = 0 FOR 4
WHILE TRUE
O u t[i] ? CASE
ta g l ; d e stE x [i] ; EX seq[i] ; b u f l [ i ]
OutCh[i] ! ta g l ; d e stE x [i] ; E X seqti] ; b u f l [ i ]
ta g 2 ; d e s tE x [i] ; EX seq[i] ; b u f2 [i]
O utCh[i] ! tag 2 ; d e stE x [i] ; E X seqti] ; b u f2 [i]
[4]INT
[4]BOOL
[4]INT
EXseq, EXsize, destEx :
bufl :
buf 2 :
tag3 ; destEx[i] ; EXseq[i] ; buf3[i]
OutCh[i] ! tag3 ; destEx[i] ; EXseq[i] ; buf3[i]
tag4 ; destEx[i] ; EXseq[i] ; EXsize[i] :: buf4[i]
OutCh[i] ! tag4;destEx[i];EXseq[i];EXsize[i]::buf4[i]
tag5 ; destEx[i] ; EXseqti] ; EXsize[i] :: buf5[i]
OutCh[i] ! tag5;destEx[i];EXseq[i];EXsize[i]::buf5[i] 1
— User program
INT sysOl, sys02 :
SEQ
PAR
COOOIO ? sysOl
C00020 ? sys02
PAR
COOOil ! sysOl + sys02
C00021 ! sysOl - sys02
: Appendix E
!
i
BENCHM ARK PROGRAM
LIVERMORE LOOPS
%
I % Loop 1 - Hydro Fragment
%
define Loopl
! type OneDim = array[real] ;
; function Loopl(n : integer ; Q,R,T: real ;
j Y, Z : OneDim ; returns OneDim )
I
for k in i,n returns
array of Q + (Y[k] * (R * Z[k+10] + T * Z[k+ll]))
end for
)
end function 7»Loop 1
I
I
t % Loop 7 - Equation of State Fragment
%
define Loop7
type OneDim = array[real] ;
j function Loop7 ( n : integer ; R, T : real ;
I U, Y, Z : OneDim ; returns Onedim )
133
• •
for k in 1,n returns
array of U[k] + R * (Z[k] + R * Y[k])
+ T * (U[k+3] + R * (UCk+2] +
j + T * (U[k+6] + R * (U[k+5] +
] end for
I
i
end function % Loop 7
'%
I “ /.Loop 12 - First Difference
: %
| define Loopl2, main
| type OneDim = array[real] ;
1 function Loopl2 (n : integer ; Y : OneDim ;
j returns OneDim )
for i in l,n returns
! array of Y[i+1] - Y[i]
i end f or
end function % Loop 12
* U[k+i])
* U[k+4])))
Loopl Occam program
PROC Loopl C VAL INT n, VAL REAL32 q, VAL REAL32 r.
VAL REAL32 t. VAL []REAL32 y,
VAL []REAL32 z. []REAL32 OUTPTA)
CHAN OF ANY C00022, C00008, C00801. C00802, C00013, C00014
CHAN OF ANY C00015, C00016, C00017, C00018, C00019, C00020
CHAN OF ANY C00021. C00999, COOOIO :
PROC proc00013()
PAR
REAL32 sys02
BOOL sys03
[]REAL32 sys04
BOOL systest :
INT sysOl :
SEQ -- AGather (107)
systest := TRUE
sysOl := 0
WHILE systest
ALT
COOOIO ? sys04 [ sysOl ]
SEQ
sysOl := sysOl + 1
C00999 ? sys03
SEQ
systest := sys03
C00022 ! sysOl - 1
C00022 ! [ sys04 FROM 0 FOR ( sysOl - 1 ) ]
PROC proc00003()
PAR
INT
SEQ
C00008 ?
C00019 !
INT
SEQ
C00801 ?
C00013 !
INT
SEQ
C00802 ?
C00014 !
INT
SEQ
C00013 ?
C00015 !
INT
SEQ
C00014 ?
C00016 !
REAL32
SEQ
C00015 ?
C00017 !
REAL32
SEQ
C00016 ?
C00018 !
REAL32
REAL32
sys02
sys02
y [ sys02 ]
sysOl :
sysOl
sysOl + 10
sysOl :
sysOl
sysOl + 11
sys02
[ sys02 ]
[ sys02 ]
sys02
z
sys02
sys02
z
sys02 :
sys02
r * sys02
sys02 :
sys02
— Plus
-- Plus
--Times
--Times
sys02
sysOl :
sys02
- AElement
- AElement
- AElement
136
SEQ
C00017 ?
C00018 ?
C00020 !
—Plus
sysOl
sys02
sysOl + sys02
REAL32
REAL32
SEQ
C00019
C00020
C00021
sysOl :
sys02 :
sysOl
sys02
sysOl * sys02
— Times
REAL32
SEQ
C00021 ?
COOOIO !
sys02
—Plus
sys02
q
sys02
PROC proc00002()
PAR
INT sys03 : -- RangeGenerator (142)
INT sys04 :
SEQ
sys04 :=(n -1 ) + 1
SEQ sys03 = 1 FOR sys04
PAR
C00008 ! sys03
C00802 ! sys03
C00801 ! sys03
proc00003()
C00999 ! FALSE
proc00013()
137
PROC procOOOOlC)
PAR
[N]REAL32 sysOl : '
INT sys04 :
SEQ
C00022 ? sys04 '
C00022 ? sysOl
[OUTPTA FROM 0 FOR sys04] := [sysOl FROM 0 FOR sys04]
i
i
1
PAR |
proc00002 ()
procOOOOl ()
Loop7 Occam program
PROC Loop7 ( VAL INT n , VAL REAL32 r , VAL REAL32 t ,
VAL [] REAL32 u , VAL []REAL32 y ,
VAL HREAL32 z . [] REAL32 OUTPTA)
CHAN OF ANY C00047, C00008, C00801, C00802, C00803. C00804
CHAN OF ANY C00806, C00807, C00808, C00017, C00018, C00019
CHAN OF ANY C00021. C00022, C00023, C00024. C00025, C00026
CHAN OF ANY C00028. C00029. C00030, C00031, C00032. C00033
CHAN OF ANY C00035, C00036, C00037. C00038, C00039, C00040
CHAN OF ANY C00042, C00043. C00044. C00045. C00046, C00999
CHAN OF ANY C00805. C00020. C00027, C00034, C00041. C00010
PROC proc00034()
PAR
REAL32 sys02 :
BOOL sys03 :
[N]REAL32 sys04 :
BOOL systest :
INT sysOl :
SEQ -- AGather (107)
systest := TRUE
sysOl := 0
WHILE systest
ALT
C00010 ? sys04 [ sysOl ]
SEQ
sysOl := sysOl + 1
C00999 ? sys03
SEQ
systest := sys03
C00047 ! sysOl - 1
C00047 ! [ sys04 FROM 0 FOR ( sysOl - 1 ) ]
PROC proc00003()
PAR
INT sys02 :
SEQ — AElement
i
I
L
139
C00008 ?
C00033 !
INT
SEQ
C00801 ?
C00024 !
INT
SEQ
C00802 ?
C00017 !
INT
SEQ
C00803 ?
C00018 !
INT
SEQ
C00804 ?
C00019 !
INT
SEQ
C00805 ?
C00020 !
INT
SEQ
C00806 ?
C00021 !
INT
SEQ
C00807 ?
C00022 !
INT
SEQ
sys02
u [ sys02 ]
sys02 :
sys02
z [ sys02 ]
sys02 :
sys02
y [ sys02 ]
sysOl
sysO l
sysOl + 3
sysOl :
sysOl
sysOl + 2
sysO l :
sysOl
sysOl + 1
sysOl :
sysOl
sysOl + 6
sysO l :
sysOl
sysOl + 5
sysO l :
-- AElement
— AElement
-- Plus
-- Plus
— Plus
-- Plus
— Plus
-- Plus
C00808 ? sysOl
C00023 ! sysOl + 4
REAL32 sys02 :
SEQ — Times
C00017 ? sys02
C00025 ! r * sys02
INT sys02 :
SEQ — AElement
C00018 ? sys02
C00037 ! u [ sys02 ]
INT sys02 :
SEQ -- AElement
C00019 ? sys02
C00029 ! u C sys02 ]
INT sys02 :
SEQ -- AElement
C00020 ? sys02
C00026 ! u t sys02 ]
INT sys02 :
SEQ — AElement
C00021 ? sys02
C00039 ! u [ sys02 ]
INT sys02 :
SEQ -- AElement
C00022 ? sys02
C00031 ! u [ sys02 ]
INT sys02 :
SEQ -- AElement
C00023 ? sys02
C00027 ! u [ sys02 ]
REAL32 sysOl :
REAL32 sys02 :
SEQ — Plus
C00024 ? sysOl
C00025 ? sys02
C00028 ! sysOl + sys02
REAL32 sys02 :
SEQ --Times
C00026 ? sys02
C00030 ! r * sys02
REAL32 sys02 :
SEQ — Times
C00027 ? sys02
C00032 ! r * sys02
REAL32 sys02 :
SEQ --Times
C00028 ? sys02
C00034 ! r * sys02
REAL32 sysOl :
REAL32 sys02 :
SEQ — Plus
C00029 ? sysOl
C00030 ? sys02
C00035 ! sysOl + sys02
REAL32 sysOl :
REAL32 sys02 :
SEQ — Plus
C00031 ? sysOl
C00032 ? sys02
C00036 ! sysOl + sys02
REAL32 sysOl :
REAL32 sys02 :
SEQ --Plus
C00033 ? sysOl
C00034 ? sys02
C00045 ! sysOi + sys02
REAL32 sys02 :
SEQ --Times
C00035 ? sys02
C00038 ! r * sys02
REAL32 sys02 :
SEQ — Times
C00036 ? sys02
C00040 ! r * sys02
REAL32 sysOl :
REAL32 sys02 :
SEQ --Plus
C00037 ? sysOl
C00038 ? sys02
C00042 ! sysOl + sys02
REAL32 sysOl :
REAL32 sys02 :
SEQ --Plus
C00039 ? sysOl
C00040 ? sys02
C00041 ! sysOl + sys02
REAL32 sys02 :
SEQ --Times
C00041 ? sys02
C00043 ! t * sys02
REAL32 sysOl :
REAL32 sys02
SEQ — Plus
C00042 ? sysOl
C00043 ? sys02
C00044 ! sysOl + sys02
REAL32 sys02 :
SEQ — Times
C00044 ? sys02
G00046 ! t * sys02
REAL32 sysOl :
REAL32 sys02 :
SEQ --Plus
G00045 ? sysOl
C00046 ? sys02
COOOIO ! sysOl + sys02
PROC proc00002()
PAR
INT sys03 : -- RangeGenerator (142)
INT sys04 :
SEQ
sys04 :=(n -1 ) + 1
SEQ sys03 = 1 FOR sys04
PAR
C00008 ! sys03
C00808 sys03
C00807 sys03
C00806 sys03
C00805 sys03
C00804 sys03
C00803 sys03
C00802 sys03
C00801 sys03
proc00003()
C00999 ! FALSE
proc000340
PROC procOOOOl()
PAR
[N]REAL32 sysOl ;
INT sys04 :
SEQ
C00047 ? sys04
C00047 ? sysOl
[OUTPTA FROM 0 FOR sys04] := [sysOl FROM 0 FOR sys04]
PAR
proc00002 ()
procOOOOl ()
145
Loopl2 Occam program
PROC Loopl2 ( V A L INT n , V A L []REAL32 y . [ ] REAL32 OUTPTA)
C H A N OF A N Y COOOll, C00004, C00401, C00008, C00009. COOOIO
C H A N OF A N Y C00999, C00006 :
PROC proc00007()
PAR
REAL32 sys02 :
B O O L sys03 :
[ ] REAL32 sys04 :
BO O L s y s te s t :
INT sysOl :
SEQ - - AGather (107)
s y s te s t := TRUE
sysO l := 0
WHILE s y s te s t
ALT
C00006 ? sys04 [ sysOl ]
SEQ
sysOl sysO l + 1
C00999 ? sys03
SEQ
s y s te s t := sys03
COOOll ! sysOl - 1
COOOll ! [ sys04 FR O M 0 FOR ( sysO l - 1 ) ]
PROC proc00003()
PAR
INT sysOl :
SEQ -- Plus
C00004 ? sysOl
C00008 ! sysOl + 1
INT sys02 :
SEQ -- AElement
C00401 ? sys02
COOOIO ! y [ sys02 ]
INT
SEQ
C00008 ?
C00009 !
sys02 :
sys02
y [ sys02 ]
AElement
REAL32
REAL32
SEQ
C00009
COOOIO
C00006
sysOl :
sys02 :
sysOl
sys02
sysOl - sys02
--Minus
PROC proc00002()
PAR
INT sys03 : — RangeGenerator (142)
INT sys04 :
SEQ
sys04 :=(n -1 ) + l
SEQ sys03 = 1 FOR sys04
PAR
C00004 ! sys03
C00401 ! sys03
proc00003()
C00999 ! FALSE
proc00007()
PROC procOOOOl()
PAR
[]REAL32 sysOl :
INT sys04 :
SEQ
COOOll ? sys04
COOOll ? sysOl
[OUTPTA FROM 1 FOR sys04] := [sysOl FROM 0 FOR sys04]
147
PAR
proc00002 ()
procOOOOl C)
Appendix F
BENCHM ARK PROGRAM :
HISTOGRAM M ING
X
7 , Histogramming
%
define hist,addl.histl,hist2
type OneDim = array[integer];
type TwoDim = array[OneDim];
function addl (j : integer ; slot : OneDim
returns integer)
slot[j] + 1
end function % addi
149
fu n c tio n h i s t l (N : in te g e r ; s . d i g i t : OneDim
r e tu r n s OneDim )
f o r i n i t i a l
s l o t := s ;
i := 0;
w hile i < N
re p e a t
i := o ld i + 1 ;
s l o t := o ld s l o t [ d i g i t [ i ] : a d d l( d i g i t [ i ] , o ld s l o t ) ]
re tu r n s v alu e of s l o t
end f o r
end fu n c tio n % h i s t l
fu n c tio n h is t2 (N : in te g e r ; s , d i g i t : OneDim ;
r e tu r n s OneDim )
f o r i n i t i a l
s l o t := s ;
i := 0;
w hile i < N
re p e a t
i := o ld i + 1
s l o t := o ld s l o t [ d i g i t [ i ] : a d d l( d i g i t [ i ] , o ld s l o t ) ]
re tu r n s v alue of s l o t
end f o r
end fu n c tio n % h is t2
fu n c tio n h i s t ( N : in te g e r ; s l o t .d i g i t : OneDim ;
re tu r n s OneDim)
l e t
s l o t l := h i s t l (N, s l o t . d i g i t ) ;
s lo t2 := h is t2 (N, s l o t . d i g i t ) ;
150
in
f o r i in 0 ,7
R slo t := s l o t l [ i ] + s l o t 2 [ i]
r e tu rn s a rra y of R slo t
end f o r
I
end l e t
end fu n c tio n % h i s t
151
Histogramming Occam program
PROC addl ( V A L INT j , V A L [ ] INT s l o t , INT OUTPTA)
C H A N OF A N Y C00004 :
PROC procOOOOl()
PAR
SEQ
C00004 ! s l o t [ j ]
INT sysO l :
SEQ —P lu s
C00004 ? sysOl
OUTPTA := sysO l + 1
PAR
procOOOOl()
PROC h i s t l ( V A L INT n , V A L [ ] INT s , V A L [ ] INT d i g i t , [ ] INT OUTPTA)
C H A N O F A N Y C00005. C00501, C00502, COOOll, CO llO l, C00013 :
PROC p ro c 00007()
PAR
OUTPTA :■ s l o t —F inalV alue
PROC proc00003()
PAR
INT sysOl
SEQ — P lu s
C00501 ? sysOl
C00005 ! sysOl + 1
INT sys02 :
SEQ - - AElement
C00502 ? sys02
COOOll ! d i g i t [ sys02 ]
152
C01101 ! d i g i t [ sys02 ]
INT
INT
SEQ
sys02 :
sys04 :
C a ll
COOOll ? sys02
addl ( sys02, s l o t , sys04 )
C00013 ! sys04
INT
INT
SEQ
sys02 :
sys03
- - AReplace
COilOl ? sys02
C00013 ? s l o t [ sys02 ]
PROC procOOOOl()
PAR
SEQ
s l o t := s
C00005 ! 0
INT sysOl :
C00005 ? sysOl
WHILE sysOl < n
PAR
C00502 ! sysOl
C00501 ! sysOl
C00005 ? sysOl
proc00003()
proc00007()
PAR
procOOOOl()
SEQ
153
PROC h is t2 ( V A L INT n , V A L []INT s . V A L []INT d i g i t .
[ ] INT OUTPTA)
C H A N OF A N Y C00005. C00501. C00502. COOOll, C01101, C00013 :
PROC proc00007()
PAR
OUTPTA := s l o t —F inalV alue
PROC proc00003()
PAR
INT
SEQ
sysOl
P lu s
C00501 ? sysO l
C00005 ! sysOl + 1
INT
SEq
sys02
- - AElement
C00502 ? sys02
COOOll ! d i g i t [ sys02
COllOl ! d i g i t [ sys02
INT
INT
SEQ
sys02
sys04
C a ll
COOOll ? sys02
addl ( sy s0 2 , s l o t , sys04 )
C00013 ! sys04
INT
INT
SEQ
sys02
sys03
— AReplace
COllOl ? sys02
C00013 ? s l o t [ sys02 ]
PROC procOOOOl()
PAR
SEQ
154
s l o t := s
C00005 ! 0
INT sysOl :
SEQ
C00005 ? sysOl
WHILE sysOl < n
PAR
C00502 ! sysOl
C00501 ! sysOl
C00005 ? sysOl
p ro c 000030
proc00007()
PAR
procOOOOl()
PROC h i s t ( V A L INT n . V A L []INT s l o t . V A L [ ] INT d i g i t .
[] INT OUTPTA)
C H A N OF A N Y C00013, C00007. C00701. COOOll. C00012, C00999
C H A N O F A N Y COOOIO :
PROC p ro c 00008()
PAR
INT sys02 :
B O O L sys03 :
[] INT sys04 :
B O O L s y s te s t :
INT sysOl :
SEQ - - AGather (107)
s y s te s t := TRUE
sysOl := 0
WHILE s y s te s t
ALT
COOOIO ? sys04 [ sysOl ]
SEQ
sysO l := sysOl + 1
C00999 ? sys03
SEQ
s y s te s t := sys03
C00013 ! sysOl - 1
C00013 ! C sys04 FR O M 0 FOR ( sysOl - 1 ) ]
PROC proc00005()
PAR
INT
SEQ
C00007 ?
COOOll !
sys02 :
sys02
s l o t l [ sys02 ]
— AElement
INT
SEQ
C00701 ?
C00012 !
sys02 :
sys02
s lo t2 [ sys02 ]
AElement
INT
INT
SEQ
COOOll ?
C00012 ?
COOOIO !
sysOl :
sys02 :
sysOl
sys02
sysOl + sys02
--P in s
PROC proc00004()
PAR
INT sys03 : - - R angeG enerator (142)
INT sys04 :
SEQ
sys04 : =( 7 - 0 ) + 1
SEQ sys03 = 0 FOR sys04
PAR
C00007 ! sys03
C00701 ! sys03
proc00005()
C00999 ! FALSE
156
proc00008()
PROC procOOOOl()
PAR
SEQ
h i s t l ( n s l o t , d i g i t , s l o t l
Call
SEQ Call
h is t2 ( n s l o t , d i g i t , s lo t2
□ INT
INT sys04 :
sysOl
SEQ
C00013 ? sys04
C00013 ? sysOl
[OUTPTA FR O M 0 FOR sys04] := [sysO l FR O M 0 FOR sys04]
PAR
proc00004 ()
procOOOOl ()
157
Appendix G
BENCHM ARK PROGRAM
SHALLOW
% -------- SH A LL O W : A w eather program --------
% -------- F un ctio n Names --------
d e fin e Main, '/, Main e n try p o in t to program .
I n i t i a l i z e , % D efine i n i t i a l c o n d itio n .
F lu x e s, % Compute th e mass flu x e s .
H eig h t, % Compute th e f i e l d h e ig h t.
P o te n tia l, % Compute th e t o t a l p o te n tia l v e lo c ity .
TimeStep * / , Compute n ex t tim e s te p .
% -------- Type S p e c ific a tio n s -------
ty p e G rid V a ria b les = a rra y [a rra y [ r e a l ] ] ;
% -------- Im ported fu n c tio n s --------
g lo b a l s in (x: r e a l re tu rn s r e a l)
g lo b a l cos (x: r e a l r e tu r n s r e a l)
% -------- I n i t i a l i z e --------
fu n c tio n I n i t i a l i z e (M: in te g e r ; d e lta : r e a l
r e tu rn s G rid V a ria b le s,
G rid V a ria b le s, G rid V a ria b les)
158
• •
l e t
P : G rid V a ria b le s;
P si : G rid V a ria b le s;
U : G rid V a ria b le s;
V : G rid V a ria b le s;
A m p : r e a l := 1.0E6; % Am plitude of waves
% in i n i t i a l c o n d itio n
P i
: r e a l := 3.14159265359;
e l : r e a l := r e a l(M )/d e lta ;
pc : r e a l := p ie * p ie * A m p * A m p / ( e l *
d e lta _ i : r e a l := 2 .0 * p ie / re a l(M );
d e lta ..j : r e a l := 2 .0 * p ie / real(M ) ;
P s i := f o r j in l.M +l c ro s s i in l.M +l
r e tu r n s a rra y of
A m p * s in ( d e l t a _ i * ( r e a l ( i ) - 0 .5 ) )
* s in ( d e l t a _ j * ( r e a l ( j ) - 0 .5 ) )
end f o r
in
f o r j in l.M +l c ro s s i in l.M +l
re tu r n s a rra y of
p c f * (cos ( 2 .0 * d e l t a _ i * r e a l ( i - l ) )
+ cos ( 2 .0 * d e l t a _ j * r e a l ( j- l ) ) ) + 50000.0
end f o r ,
f o r j in l.M
re tu r n s a rra y of
f o r i in l.M +l
r e tu r n s a rra y of
- ( P s i [ j + l , i ] - P s i [ j , i ] ) / d e lta
end f o r
end f o r
I I
f o r j in M+l.M+1
re tu r n s a rra y of
f o r i in l.M +l
r e tu r n s a rra y of
- ( P s i [ l , i ] - P s i [ j , i ] ) / d e lta
159
end f o r
end f o r ,
f o r j in l.M +i
re tu r n s a rra y of
f o r i in 1,M
r e tu r n s a rra y of
( P s i [j , i + 1 ] - P s i[j , i ] ) / d e lta
end f o r
II
f o r i in M+l.M+1
re tu r n s a rra y of
( P s i [ j ,1 ] - P s i [ j , i ] ) / d e lta
end f o r
end f o r
end l e t
end fu n c tio n % I n i t i a l i z e
% -------- Fluxes --------
fu n c tio n F luxes (M: in te g e r ;
P, U, V: G rid V ariab les
r e tu r n s G rid V a ria b le s,
f o r j in 1,M+1
r e tu r n s a rra y of
f o r i in 1,1
r e tu r n s a rra y of
0 .5 * ( P [ j , i ] + P [ j ,M+1]) * U [ j.i]
end f o r
I I
f o r i in 2.M+1
r e tu r n s a rra y of
0 .5 * (P C j.i] + P [j , i ~ l ] ) * U [ j,i]
end f o r
end f o r ,
f o r j in 1,1
r e tu r n s a rra y of
f o r i in l.M +l
re tu r n s a rra y of
G rid V a ria b les)
160
0 .5 * C P C j . i] + P [M+l ,i ] ) * V C j.i]
end f o r
end f o r
II
f o r j in 2 ,M+l
re tu r n s a rra y of
f o r i in 1,M+1
r e tu rn s a rra y of
0 .5 * (P [j ,i ] + P [ j - l . i ] ) * V C j.i]
end f o r
end f o r
end fu n c tio n 7 . Fluxes
7 . H e i g h t-------
fu n c tio n H eight (M: in te g e r ;
P, U, V: G rid V ariab les
re tu rn s G rid V ariab les)
f o r j in l.M
re tu r n s a rra y of
f o r i in l.M
re tu r n s a rra y of
P C j.i] + 0 .2 5 * (U [j,i+ l]* U [j,i+ l3 + U[j ,i]* U [j ,i ]
+ V[j + l .i ] * V [ j + l .i ] + V [ j ,i ] *V[j ,i ] )
end f o r
I I
f o r i in M+l,M+l
re tu r n s a rra y of
P C j.i] + 0 .2 5 * ( U [ j.l]* U [ j,l] + U [ j .i ] * U [ j.i ]
+ V[j + l.i] * V [ j + l . i ] + V [ j .i ] * V [ j .i ] )
end f o r
end f o r
II
f o r j in M+l,M+1
r e tu r n s a rra y of
f o r i in l.M
re tu r n s a rra y of
P C j.i] + 0 .2 5 * ( U [ j.i+ i]* U [ j.i+ l] + U [ j .i ] * U [ j.i ]
+ V [ l.i ] * V [ i.i ] + V [ j .i ] * V [ j .i ] )
end f o r
161
II
f o r i in M+l.M+l
re tu r n s a rra y of
P [ j , i ] + 0 .2 5 * ( U [ j.l]* U [ j,l] + U [ j ,i ] * U [ j,i ]
+ V [ i,i ] * V [ l,i ] + V [ j .i ] * V [ j .i ] )
end f o r
end f o r
end fu n c tio n % H eight
I t P o t e n t i a l --------
fu n c tio n P o te n tia l (M: in te g e r ;
P, U. V: G rid V a ria b les;
d e lta : r e a l
re tu r n s G rid V ariab les)
l e t
fsd x : r e a l := 4 .0 / d e lta ;
fsd y : r e a l := 4 .0 / d e lta ;
in
f o r j in 1.1
re tu r n s a rra y of
f o r i in 1,1
r e tu r n s a rra y of
(fsdx* (V[j ,i] - V [ j ,M+1]) - fsdy*(U [j ,i]-U [M +i ,i ] ) ) /
(P[M+l,M+1]+P[M+l, i]+ P [j,M+1]+P [ j , i ] )
end f o r
II
f o r i in 2 .M+l
r e tu r n s a rra y of
(fsd x * (V [j ,i] - V [ j , i - l ] ) - fsdy*(U [j ,i] - U [ M + l,i] ) ) /
(P[M+l ,i~ l] +P[M + l,i]+P [j ,i - l ] + P [ j . i ] )
end f o r
end f o r
II
f o r j in 2 ,M+l
r e tu r n s a rra y of
f o r i in 1,1
r e tu r n s a rra y of
(fsdx*(V [j ,i ] -V [j ,M+1] ) - fsdy* (U[j ,i ] -U [j-1 ,i ] ) ) /
(P [j - 1 ,M+1]+P[j - 1 ,i]+P[j,M + 1]+P [ j , i ] )
162
end for
I I
for i in 2.M+1
returns array of
(fsdx*(V[j ,i]-V[j ,i-l] ) - fsdy*(U[j ,i]-U[j-i,i]))/
(P[j-l.i-l]+P[j-l.i]+P[j.i-l]+P[j.i])
end for
end for
end let
end function % Potential
% Time Step----
function TimeStep (M: integer; deltat, delta: real;
Psmooth, Usmooth, Vsmooth: GridVariables;
Uflux, Vflux, H, PotVel: GridVariables
returns GridVariables, GridVariables, GridVariables)
let
deltat_8 : real := deltat / 8.0;
deltat_d : real := deltat / delta;
in
for j in l.M
returns array of
for i in l.M
returns array of
Psmooth[j,i]-deltat_d*(Uflux[j,i+1]-Uflux[j,i])
-deltat_d*(VfluxCj + 1,i]-Vflux[j ,i])
end for
I I
for i in M+l.M+l
returns array of
Psmooth[j,i]-deltat_d*(Uflux[j,1]-Uflux[j,i])
-deltat_d*(Vflux[j+1,i]-Vflux[j,i])
end for
end for
II
for j in M+l,M+1
returns array of
for i in l.M
163
returns array of
Psmooth[j,i]-deltat_d*(Uflux[j,i+l]-Uflux[j,i])
-deltat_d*(Vflux[1,i] -Vf lux[j ,i])
end for
I I
for i in M+l.M+l
returns array of
Psmooth [j,i]-deltat_d*(Uflux[j,1]-Uflux[j,i])
-deltat_d*(Vflux[1,i]-Vflux[j,i])
end for
end for,
for j in 1,M
returns array of
for i in 1,1
returns array of
UsmoothCj,i]+deltat_8*(PotVel[j+1,i]+PotVel[j,i])
* (Vf lux[j+l ,i]+Vf luxCj + 1 ,M+1] +Vf lux[j ,M+1] +Vflux[j ,i] )
-deltat_d*(H tj,i]-H[j,M+1])
end for
II
for i in 2,M+1
returns array of
UsmoothCj,i]+deltat_8*(PotVel[j+1,i]+PotVel[j ,i])
* (Vflux [j + 1 ,i] +Vf lux[j +1 ,i-l] +Vf lux[j , i-1] +Vf lux[ j ,i] )
-deltat_d*(H[j ,i]-H[j , i-1] )
end for
end for
II
for j in M+l.M+l
returns array of
for i in 1,1
returns array of
Usmooth[j,i]+deltat_8*(PotVel[1,i]+PotVel [ j , i])
*(Vflux[l,i]+Vflux[l,M+l]+Vflux[j,M+1]+Vflux[j,i])
-deltat_d*(H[j,i]-H[j,M+1])
end for
I I
for i in 2.M+1
returns array of
UsmoothCj ,i]+deltat_8* (PotVel[1 ,i]+PotVel[j ,i])
*(Vflux[l ,i]+Vflux[l ,i-l]+Vflux[j ,i-l]+VfluxCj ,i])
-deltat_d*(H[j , i] -H[j ,i-l] )
end for
end for,
for j in 1,1
returns array of
for i in 1,M
returns array of
VsmoothCj,i]-deltat_8*(PotVel(j,i+l]+PotVel[j,i])
*(Uflux[j,i+l]+UfluxCj,i]+Uflux[M+l,i]+UfluxCM+1,i+l])
-deltat_d*(H[j,i]-H[M+l,i])
end for
I I
for i in M+l.M+l
returns array of
VsmoothCj,i]-deltat_8*(PotVelCj,1]+PotVelCj,i])
*(Uflux(j ,1]+Uflux[j ,i]+Uflux(M+l,i]+Uflux(M+l,l])
-deltat_d*(H Cj,i]-H CM+1,i])
end for
end for
II
for j in 2.M+1
returns array of
for i in l.M
returns array of
VsmoothCj ,i] ~deltat_8* (PotVel Cj , i+l]+PotVel(j ,i])
*(UfluxCj,i+l]+UfluxCj,i]+UfluxCj-1,i]+UfluxCj-1,i+l])
-deltat_d*(H Cj,i]“H[j-1,i])
end for
II
for i in M+l.M+l
returns array of
Vsmooth Cj,i]-deltat_8*(PotVelCj,1]+PotVelCj,i])
*(Uf luxCj ,1]+Uflux[j ,i]+Uf luxCj-1 ,i]+UfltixCj-l, 1] )
-deltat_d*(H Cj,i]-H Cj-1,i])
end for
end for
end let
165
end function % Time Step
% ----- Smooth-----
function Smooth CM: integer;
X, Xsmooth, Xnext: GridVariables
returns GridVariables)
let
Alpha : real := 0.001 % Time filtering parameter.
in
for j in i,M+l cross i in 1,M+1
returns array of
X[j,i] + Alpha * (Xnext[j,i]-2.0*X[j,i]+Xsmooth[j,i])
end for
end let
end function % Smooth
% Main Program----
function Main (Minput: integer; lastiter: integer
returns array [real])
for initial
P : GridVariables; 5 » Pressure
U : GridVariables; % Velocity in east/west direction.
V : GridVariables; % Velocity in north/south direction.
Psmooth : GridVariables; % Time smoothed P.
Usmooth : GridVariables; % Time smoothed U.
Vsmooth : GridVariables; % Time smoothed V.
M : integer; ’ / , Deminsonality of system,
iter : integer; % Iteration count,
deltat : real; % Time step size in seconds,
delta : real; % Grid spacing.
M := Minput;
iter := 0;
deltat := 90.0;
delta := 1.0E5;
P. U. V := Initialize (M, delta);
Psmooth := P;
Usmooth := U;
Vsmooth := V
while
(iter < lastiter)
repeat
P, Psmooth, U, Usmooth, V, Vsmooth :=
let
Uflux : GridVariables; % Mass flux in east/west direction.
Vflux : GridVariables; 7 , Mass flux in north/south direction
H : GridVariables; 7 . A value related to the height of fluid
PotVel : GridVariables; % Potential velocity.
Pnext : GridVariables; % Intermediate results.
Unext : GridVariables; % Intermediate results.
Vnext : GridVariables; 7 . Intermediate results.
Uflux, Vflux := Fluxes (M, old P, old U, old V);
H := Height (M, old P, old U, old V);
PotVel := Potential (M, old P, old U, old V, delta);
Pnext, Unext, Vnext
:= TimeStep (M, old deltat, delta,
old Psmooth, old Usmooth, old Vsmooth,
Uflux, Vflux, H, PotVel)
in
if (old iter = 0) then
Pnext,
old P,
Unext,
old U,
Vnext,
old V
else
Pnext,
Smooth (M, old P, old Psmooth, Pnext),
Unext,
Smooth (M, old U, old Usmooth, Unext),
Vnext,
Smooth (M, old V, old Vsmooth, Vnext)
end if
167
end let;
deltat := if (old iter = 0) then
old deltat * 2.0
else
old deltat
end if;
iter := old iter + 1;
returns
array of P[2,2] % Element for checking correctness,
end for
end function * / . main
168
Bibliography
[1] Ackerman, W. B. and Dennis, J. B., “VAL: A Value-Oriented Algorithm
Language: Prelim inary Reference M anual,” Tech. Rep. TR-218, Com puta
tion Structures Group, Labratory for Com puter Science, M IT, Cambridge,
Mass., June 1979.
[2] Ackerman, W. B., “D ata Flow Languages,” IE E E Computer, February
1982, pp. 15-24.
[3] Aho, A. V., Sethi, R., and Ullman, J. D., “Compilers : Principles, Tech
niques, and Tools,” Addison-Wesley, 1986.
[4] Allan, S. J. and Oldehoeft, A. E., “A Flow Analysis Procedure for the
Translation of High Level Language to A D ata flow Language,” Proceedings
of the 1979 International Conference on Parallel Processing, August 1979,
pp. 26-34.
[5] Andrews, G .R.,et al., “Concepts and Notations for Concurrent Program
ming,” Computing Surveys, Vol.15, No. 1, M arch 1983.
[6] Arvind, and Thoms, R. E., “I-structures : An efficient d ata type for
functional languages,” Rep. LCS/TM -178, Lab. for Com puter Science,
M IT, June 1980.
[7] Arvind and Iannucci, R.A., “Two fundam ental issues in multiproces-
sors:the data-flow solutions,” M IT Laboratory for Com puter Science Tech
nical Report M IT/LCS/TM -241, September 1983.
[8] Arvind and Culler, D.E., “Dataflow Architectures” Annual Reviews in
Computer Science, 1986, Volume 1, pp. 225-253.
[9] Babb II, R. G., “Program m ing Parallel Processors,” Addison Wesley, 1988
[10] Backus, J., “Can program m ing be liberated from the von Neum ann style?
A functional style and its algebra of program s,” Communication of the
ACM , 21(8), August 1978, pp. 613-641.
11] Brinch Hansen, P., “The Program m ing Language Concurrent Pascal”
IE E E Transactions on Software Engineering, SE -l(2),June 1975, pp. 199-
206.
12] Brock, J. D. and Montz, L. B., “Translation and optim ization of D ata Flow
Program s,” Proceedings of the 1979 International Conference on Parallel
Processing, August 1979, pp. 46-54.
13] Campbell, M. L., “Static allocation for a dataflow multiprocessor” Pro
ceedings of the 1985 International Conference on Parallel Processing, Au
gust 1985, pp. 511-517.
14] Chandy, K. M. and M isra, J., “Parallel Program Design : A Foundation,”
Addison-Wesley, 1988.
15] Chu, W. W., Holloway, L. J., Lan, M. T ., and Efe, K., “Task Allocation
in D istributed D ata Processing,” IE E E Computer, November 1980.
16] Chu, W. W ., Lan, M. T ., “Task Allocation and Precedence Relations for
D istributed Real-Time Systems,” IE E E Transactions on Computers, June
1987, pp. 667-679.
17] Colin W hitby-Strevens, “The Transputer,” Proceedings of the 12th Inter
national Symposium on Computer Architecture, 1985, pp. 292-300.
18] Dennis, J.B., “First version of a d ata flow procedure language,” Pro
gramming Symp.: Proc. Colloque sur la Programmation (Paris, France,
Apr. 1974), B. Robinet, Ed., Lecture notes in Computer Science, vol. 19,
Springer-Verlag, New York, 1974, pp. 362-376.
19] Dennis, J.B ., “D ata Flow Supercom puters,” IE E E Computer, November
1980, pp. 48-56.
20] Efe, K., “Heuristic Models of Task Assignment Scheduling in D istributed
Systems,” IE E E Computer, June 1982, pp. 50-56.
21] Ezzat, A. K., Bergeron, R. D., and Pokoski, J. L., “Task Allocation
Heuristics For D istributed Computing Systems,” Proceedings of the 6th
International Conference on Distributed Computing Systems, May 1986,
pp. 337-346.
[22] Feo, J. T., “An analysis of the com putational and parallel complexity
of the Livermore Loops,” Parallel Computing, North-Holland Publishing
Company, July 1988, pp. 163-185.
[23] Gajski, D.D., Padua, D. A., Kuck, D. J., and Kuhn, R.H., “A second
opinion on data-flow machines and languages,” IE E E Computer, February
1982, pp. 58-69
[24] G anapathi, M. and Fischer, C. N., “A ttributed Linear Interm ediate Rep
resentations for Retargetable Code Generators,” Software-Practice and
Experiments, Vol. 1 4 (4), April 1984, pp. 347-364.
; [25] G anapathi, M. and Fischer, C. N., “Affix G ram m ar Driven Code Gener
ation,” A C M Transactions on Programming Languages and systems, Vol.
7, No. 4, October 1985, pp. 560-599. pp. 347-364.
[26] Gaudiot, J.L, Vedder, R.W ., Tucker, G.K, Finn, D., and Campbell, M.L.,
“A D istributed VLSI Architecture for Efficient Signal and D ata Process
ing,” IE E E Transactions on Computers, Special Issue on Distributed Com
puting Systems, December 1985.
[27] Gaudiot, J. L. and Ercegovac, M. D., “Performance Evaluation of a Simu
lated Data-Flow Com puter w ith Low-Resolution Actors” Journal of Par
allel and Distributed Computing, 2, pp. 321-351.
[28] Gaudiot,. J.L., “Structure handling in data-flow system s,” IE E E Trans
actions on Computers, June 1986, pp. 489-502
[29] Gaudiot, J.L., Dubois, M., Lee, L.T., and Tohme, N., “The TX16: a highly
program m able multimicroprocessor architecture,” IE E E Micro, October
1986.
[30] Gaudiot, J. L. and Lee, L. T., “Multiprocessor systems program m ing in a
high-level data-flow language,” Proceedings of the Conference on Parallel
Architectures and Languages Europe, Eindhoven, The Netherlands, June
1987.
[31] Gaudiot, J. L., Campbell M., and Pi, J. I, “Program graph allocation in
distributed m ulticom puters,” Journal of Parallel Computing, Vol. 7, No.
2, June 1988, pp. 227-247.
[32] Gaudiot, J. L. and Lee, L. T., “Occamflow : a methodology for pro
gram m ing multiprocessor systems,” Journal of Parallel and Distributed
Computing, Academic press, in press.
[33] Gehringer, E .F.,et al., “The Cm* Testbed,” IE E E Computer, October
1982.
[34] G raham , R. L., Lawler, E. L., Lenstra, J. K., and Rinnooy Kan, A. H.
G., “O ptim ization and Approximation in Deterministic Sequencing and
Scheduling : a survey, ” Annals of Discrete Mathematics, pp. 287-326,
North-Holland Publishing Company, 1979.
[35] Gurd, J. R., Kirkham , C.C., and W atson, I., “The M anchester data-flow
com puter,” Communications of the ACM , Vol. 28, Number 1, January
1985, pp. 34-52.
171
[36] Hoare, C.A.R., “Communicating sequential processes,” Communications
j of the ACM , Vol. 21, Number 8, August 1978.
J [37] Hong, Y. C., Payne, T. H., and Ferguson, L. B. O., “G raph allocation in
I static data-flow system s,” Proceedings of the 13th International Symposium
on Computer Architecture, ACM, Tokyo, Japan, June 1986, pp. 55-64.
[38] Huang, J. P., “Modeling of Software Partition for D istributed Real-Time i
J Applications,” IE E E Transactions on Software Engineering, October 1985, j
i pp. 1113-1126.
5 [39] Hwang, K. and Briggs, F. A., “Com puter Architecture and Parallel P ro -
| cessing,” McGraw Hill, 1984.
| [40] Hwang, K., “Advanced Parallel Processing w ith Supercom puter Architec- j
| tures,” Proceedings of IEEE, Vol. OC-75, No.10, October 1987.
| [41] Inmos, Ltd., “Occam 2 Reference M anual,” Prentice Hall, 1988
1 [42] Inmos, Ltd., “Transputer Development System,” Prentice Hall, 1988
I [43] Kruskal, C. and Weiss, A., “Allocating Independent Subtasks on Paral
lel Processors,” IE E E Transactions on Software Engineering, Vol. SE-11,
October 1985.
[44] Kuck, D. J., Kuhn, R. H., Padua, D. A., Leasure, B., and Wolfe, M.,
“Dependence Graphs and Compiler Optim izations,” Proceeding of the 8th
I A C M Symposium on the Principles of Programming Languages, January
i 1981. '
[45] Lee, R. B-L, “Empirical Results on the Speed, Efficiency, Redundancy and j
Quality of Parallel Com putations,” Proceedings of the 1980 International j
Conference on Parallel Processing, August 1980, pp. 91-96.
| [46] Luenberger, D. G., “Linear and Nonlinear Program m ing,” Second Edition,
Addison-Wesley Publishing Company, 1984.
[47] May, D., “Occam,” Inmos technical notes, 1983. I
[48] McGraw, J.R ., “The VAL Language : Description and Analysis,” A C M J
Transactions on Programming Languages and Systems, 4(1), January 1982,
pp. 44-82. !
[49] McGraw, J., and Skedzielewski, S., “SISAL: Streams and Iteration in a
Single Assignment Language, Language Reference M anual, Version 1.2,” j
Lawrence Livermore National Laboratory Technical Report M-146, M arch j
1985.
[50] Mowbray, T. J., “Language features for a static data-flow environm ent,”
Ph.D . dissertation, University of Southern California, Los Angeles, CA,
May 1983.
Najjar, W. and Gaudiot, J. L., "Multi-Level Execution in Data-Flow Ar
chitectures” , Proceedings of the 1987 International Conference on Parallel
Processing, August 1987, pp. 32-39.
Paige, M. R., “On Partitioning Program Graphs,” IE E E Transactions on
Software Engineering, November 1977, pp. 386-393
Papadim itriou, C. H. and Steiglitz, K., “Com binatorial Optim ization :
Algorithms and Complexity,” Prentice-Hall, 1982.
Perrott, R. D., “Parallel Program m ing,” Addison-Wesley Publishing
Company, 1987
Polychronopoulos, C. D., “parallel Program m ing and Compiler,” Kluwer
Academic Publisher, 1988.
Rum baugh, J., “A d ata flow multiprocessor,” IE E E Transactions on Com
puters, February 1977, pp. 138-146
Sargeant, K. and Kirkham, C. C., “Stored D ata Structures on the M anch
ester Dataflow Machine,” Proceedings of the 13th International Symposium
on Computer Architecture, ACM, Tokyo, Japan, June 1986, pp. 235-242.
Sarkar, V., “Partitioning and Scheduling Parallel Program s for Execution
on M ultiprocessors,” Technical Report CSL-TR-87-328, Stanford Univer
sity, Com puter Systems Laboratory, April 1987.
Srini, V. P., “An Architectural Comparison of Data-flow Systems,” IE E E
Computer, M arch 1986, pp. 68-88.
Skedzielewski, S. and Glauert, J., “IF l- An Interm ediate Form for Ap
plicative Languages,” M anual M-170, Lawrence Livermore National Lab
oratory, July 1985.
Skedzielewski, S. and Welcome, M. L., “D ata Flow G raph O ptim ization
in IF l,” Functional Programming Languages and Computer Architecture,
Springer-Verlag, September 1985, pp. 17-34.
Treleaven, P.C., Brownbridge, D.R., and Hopkins, R.P., “Data-driven and
demand-driven com puter architecture,” A C M Computer Surveys, Vol 14,
No. 1, M arch 1982.
Tremblay, J. P. and Sorenson, P., “The Theory and Proctice of Compiler
W riting,” McGraw-Hill, 1985.
Yuba, T ., Shimada, T., Hiraki, K., and Kashiwagi, H., “SIGMA-1 : A
Dataflow Com puter for Scientific Com putations,” Computer Physics Com
munications, Volume 37, 1985, pp. 141-148.
173
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Language features for a static data flow environment
PDF
Programming language constructs for data intensive application development
PDF
High-Level Synthesis For Asynchronous System Design
PDF
Reliability analysis and optimization in the design of distributed systems
PDF
Theory and practice in system-level design of application-specific heterogeneous multiprocessors
PDF
High-level area-delay prediction with application to behavioral synthesis
PDF
Access ordering and coherence in shared memory multiprocessors
PDF
SNAP: A graphics-based schema management system
PDF
A bit-plane architecture and 2-D symbolic substitution for optical computing
PDF
A learning-based object-oriented framework for conceptual database evolution
PDF
A hierarchical knowledge based system for airplane recognition
PDF
Nonlinear composite triangular shell element
PDF
Dynamic load balancing for concurrent Lisp execution on a multicomputer system
PDF
Parallel language and pipeline constructs for concurrent computation
PDF
The importance of using domain knowledge in solving information distillation problems
PDF
Properties of functional dependency data bases
PDF
Parallelism control in multithreaded multiprocessors
PDF
Specification, verification, and implementation of concurrent programs
PDF
Empirical performance modeling of multiprocessors based on data-sharing analysis
PDF
Foundations of the WinWin requirements negotiation system
Asset Metadata
Creator
Lee, Liang-Teh (author)
Core Title
Occamflow: Programming a multiprocessor system in a high-level data-flow language
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-773713
Unique identifier
UC11349035
Identifier
DP22790.pdf (filename),usctheses-c17-773713 (legacy record id)
Legacy Identifier
DP22790.pdf
Dmrecord
773713
Document Type
Dissertation
Rights
Lee, Liang-Teh
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA